Tools for large graph mining: structure and diffusion.
WWW 2008 tutorial, Beijing, China.
Jure Leskovec and
Carnegie Mellon University
How do graphs look like? How do they evolve over time? How can we find patterns, anomalies and regularities in them? How to find influential nodes in the network?
We will present both theoretical results and algorithms as well as case studies on several real applications. Our emphasis is on the intuition behind each method, and on guidelines for the practitioner.
The tutorial has four parts:
(a) Statistical properties and models and graph generators of static and evolving netoworks.
(b) Diffusion and cascading behavior in networks, where a virus or information spreads through the network. We present empirical observations, models and algorithms to find influential blogers or nodes to target for viral marketing.
(c) Tools for the analysis of static and dynamic graphs, like the Singular Value Decomposition, tensor decomposition for community detection, detecting anomalous nodes, and analyzing time evolving networks.
(d) Case studies of communication patterns of MSN Instant Messenger, how to find fraudsters on eBay, how to predict quality of web search results from the properties of webgraph, internet traffic analysis, and virus propagation results.
- Part 1: Properties, models and tools to mine the structure of large networks [1.5 hours]
- Statistical properties of static and evolving networks.
- Power law degree distributions found in static networks
- Small world phenomena and six degrees of separation
- Communities and clusters in networks
- Densification of time evolving networks
- Shrinking diameters of growing networks
- Evolution and link predictions in social networks. Where will the next edge appear in the evolution of the network?
- Models and generators. We motivate each model by the properties of networks it generates, and explain the intuition behind it.
- Erdos-Renyi random graph model, giant components, properties and phase transitions
- Preferential attachment and variants (copying model)
- Forest Fire model for time evolving networks
- Kronecker graph generators and how to fit them to large real graphs so that one can generate realistic synthetic networks
- Part 2: Diffusion and influence propagation in web and social networks [1.5 hours]
- Mathematical or virus propagation. We focus on SIS and SIR virus propagation models and on epidemic thresholds (how quickly does the virus need to spread to conquer the network?).
- Models of diffusion in social networks: we review the independent cascade model and the threshold model. And show some very recent measurements on large product recommendation dataset that help us validate the models.
- Empirical studies of diffusion of information in blogs and cascading behavior in on-line viral marketing. We explain how does the network structure influence the propagation of information and what are common topological patterns of information propagation. We also look into recommendation propagation from a large on-line retailer.
- Algorithms for selecting influential nodes in networks (who are influential bloggers? who is good a summarizer?). Here we use submodularity to design theoretically sound algorithms to detect information outbreaks in networks (what are the most informative blogs) and water distribution networks (where shall we position sensors to detect disease outbreaks quickly). E.g., see http://www.blogcascades.org.
- Part 3: Tools for static and dynamic graphs [1.5 hours]
Focus is on powerful tools for analyzing large matrices and tensors. A matrix describes a graph, while tensor can describe a graph with nodes and edges of various types, time evolving graphs, time series, etc. We will cover:
- Singular Value Decomposition (SVD), HITS and Pagerank
- Other matrix decompositions (CUR decomposition)
- Tensors and higher-order SVD
- Applications of random walk and spectral methods to measure proximity between nodes in networks, and to find best explanatory subgraph connecting a set of nodes.
- Part 4: Case studies [1.5 hours]
- Communication patterns of MSN Messenger. The application of above mentioned tools and algorithms to a large network of communication on MSN Instant Messenger (30 billion conversations, 240 million people).
- Detecting fraud on eBay. How to find fraudulent people on eBay. We present a belief propagation method that is able to find fraudulent people in large networks.
- Monitoring social and communication networks over time -- intrusion and outlier detection. An application of tensor decomposition techniques to monitor multiple time series over time and detect outliers and abnormal events
- Web projections. Exploiting the structure of web graph to predict the quality of search results, user intention to reformulate queries and to find spam search results.
- Connection subgraphs and CenterPiece subgraphs.
Who should attend
The target audience is data management, data mining and machine learning researchers and professionals who work on static or time-evolving graphs and want to know about tools and models when dealing with large network datasets. There will be special emphasis on web, blogs and on-line social networks related topics.
About the instructors
Jure Leskovec is a PhD candidate in Machine Learning at Carnegie Mellon University graduating in summer 2008. He is also a Microsoft Research Graduate Fellow. He received the ACM KDD 2005 and ACM KDD 2007 best paper awards, he also won the ACM KDD cup competition in 2003 and the Battle of the Sensor Networks 2007 competitions. Jure holds three patents. His research interests include data mining and machine learning on large real-world graphs and networks. He is especially interested in evolution, friendship formation, diffusion and cascading behavior in large social networks and the web.
Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the ICDM 'research contributions' award in 2006, and several ``best paper'' and teaching awards. He has delivered over 20 tutorials in database and data mining conferences, he has published over 160 refereed articles, one monograph, and holds five patents. His research interests include data mining for graphs and streams, fractals, indexing methods for spatial and multimedia bases, and data base performance.