Workshop on Incomplete Networked Data (WIND)


DATES: Tuesday March 22 and Wednesday March 23, 2016



LOCATION: Open Campus, Sandia National Laboratories, Livermore, CA


ORGANIZERS (in alphabetical order)

o   Tina Eliassi-Rad (Northeastern University)

o   James Ferry (Metron, Inc.)

o   Ali Pinar (Sandia National Labs, Livermore)

o   C. Seshadhri (UC Santa Cruz)



Many systems of critical importance are commonly modeled as networks. Understanding their structure and anticipating their weaknesses are vital to numerous applications. However, most networks are vast, distributed, and there is no mechanism to completely collect their topologies. We cannot expect to know all the router structure of the Internet; collect all the hyperlinks; or identify all social connections. We are restricted to specific data collection schemes that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. It is imperative to investigate and develop new network analysis theory and techniques that take into account the incompleteness and inaccuracy of the data and the biases in data collection methods. Current state-of-the-art network analysis methods fall short, and we need novel theory and approaches for this daunting problem.



Tuesday, March 22


8:45 - 9:00       Introduction


9:00 - 10:30     Session 1 – Sampling (Chair: Ali)


10:30 - 11:00   Break


11:00 - 12:30   Session 2 – Linear Algebra and Spectral Methods (Chair: Sesh)


12:30 - 2:00     Lunch (Box lunches will be provided for $10. Please bring cash.)


2:00 - 3:30       Session 3 – Partially Observed Networks (Chair: Sesh)


3:30 - 4:00       Break


4:00 - 5:30       Session 4 – Algorithms for Uncertain Networks (Chair: Tina)


6:00     Workshop Dinner at Terra Mia Livermore

Wednesday, March 23


9:00 - 10:10     Session 5 – Complex Representations (Chair: Tina)


10:10 - 10:30   Break


10:30 - 11:40   Session 6 – Applications (Chair: Jim)


11:40 - 12:00   Break


12:00 - 12:30   Perspectives from Real Applications


12:30 - 1:00     Concluding remarks


1:00 - 3:00       Lunch and open discussion (Box lunches will be provided for $10. Please bring cash.)




·       Tammy Kolda (Sandia): Measurements on Graphs: The Power of Wedge and Diamond Sampling [Slides]
Abstract: Massive graphs can make computations prohibitively expensive. We show that graph sampling based on wedges is a versatile technique allows for the fast and accurate approximation of various quantities. Using wedge sampling, we can compute graph degree-wise and overall clustering coefficients as well as triangle counts. Our methods come with provable and practical time-approximation tradeoffs for all computations. We provide extensive results that show our methods are orders of magnitude faster than the state-of-the-art, while providing nearly the accuracy of full enumeration. Our MapReduce implementation has successfully used this method on graphs with 132M nodes and 4.7B edges, as well as artificially generated networks (using the Graph500 benchmark), the largest of which has 240M nodes and 8.5B edges. These methods can be extended to other purposes as well. The maximum all-pairs dot-product search (MADS) problem finds the largest entry in the product of two matrices, which is useful for similarity search, link prediction, and collaborative filtering. Sampling diamonds, intersecting pairs of wedges, allows us to avoid computing all dot products. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-the-art hashing methods. This is based on joint work with Grey Ballard, Todd Plantenga, Ali Pinar, and C. Seshadhri.


·       Ravi Kumar (Google): Sampling a Uniform Node [Slides]
Abstract: In this talk we will address the basic problem of sampling a node from a large graph according to the uniform distribution, by using random walk as the basic primitive. We study the query complexity of three algorithms and show a near-tight bound expressed in terms of two key parameter of the graph, namely, the average degree and the mixing time.


·       Dennis Feehan (UC Berkeley): Network Reporting Methods for Sample Surveys [Slides]
Surveys have traditionally been based on the idea that researchers can estimate characteristics of a population by obtaining a sample of individuals and asking them to report about themselves. Network reporting surveys generalize this traditional approach by asking survey respondents to report about members of their personal networks. By asking respondents to report about others, network reporting surveys can be used to study many important rare and hidden populations for which traditional survey methods are inadequate; for example, the approach has been used to estimate the size of epidemiologically important groups like sex workers, drug injectors, and men who have sex with men. In my talk, I first describe a framework for developing estimators from network reporting survey data. Next, I present some results from a nationally-representative survey experiment that my colleagues and I conducted in Rwanda. Finally, I outline some directions for future work.


·       Michael Mahoney (UC Berkeley): Algorithmic Methods for Characterizing Local Structure in Large Graphs [Slides]
Abstract: Large networks can have surprisingly subtle and counterintuitive properties, so much so that even characterizing their basic structure can be challenging. In many cases, there is meaningful small-scale of local structure, e.g., a small cluster or an ego-network around an individual, but there is not particularly meaningful large-scale structure. This is particularly challenging to deal with since most algorithmic and statistical tools implicitly assume exactly the opposite. Recent work has focused on developing local and locally-biased algorithms, i.e., algorithms that compute solutions around exogenously-specified seed set of nodes, in some cases without even touching most of the nodes of the graph. We will describe some recent results along these lines.


·       David Gleich (Purdue): Using Local Spectral Methods to Robustify Graph-based Learning Algorithms [Slides]
Abstract: Graph-based learning methods have a variety of names including semi-supervised and transductive learning. They typically use a diffusion to propagate labels from a small set of nodes with known class labels to the remaining nodes of the graph. While popular, these algorithms, when implemented in a straightforward fashion, are extremely sensitive to the details of the graph construction from the real-world data. We focus the question on how to make these procedures robust to missing data and mistakes in the graph construction process. Here, we provide four procedures: recognizing implicit regularization in the diffusion, using a scalable push method to evaluate the diffusion, using rank-based rounding, and densifying the graph through a matrix polynomial. We study robustness with respect to the details of graph constructions, errors in node labeling, degree variability, and a variety of other real-world heterogeneities, studying these methods through a precise relationship with mincut problems. For instance, the densification strategy explicitly adds new weighted edges to a sparse graph. We find that this simple densification creates a graph where multiple diffusion methods are robust to several types of errors. This is demonstrated by a study with predicting product categories from an Amazon co-purchasing network.


·       Ken Clarkson (IBM Research): Low-Rank Approximation in Input-Sparsity Time, with Regularization and Robustness [Slides]
The classical problem of finding the best rank-k approximation to a matrix A under the Frobenius norm, min_{rank(B)=k} || A - B ||F, can be solved using the singular value decomposition. I will outline fast sketching-based approximation algorithms for two variations on this problem: where the Frobenius norm is replaced by a more robust measure, and where the objective has an added regularization term L*||A||_* using the nuclear (trace) norm, with L a weight parameter. Here "sketching" is a matrix compression technique, including for example random projections. These algorithms have a "leading term" in running time of tilde{O}(nnz(A)), where nnz(A) is the number of nonzero entries of A, and an additive term dependent on poly(k/eps), where eps is the relative approximation error. The first algorithm includes a step taking exp(poly(k/eps)), and we have an NP-hardness result suggesting that this may be unavoidable. The running time of the second algorithm has a dependence on the "statistical dimension" of A, which is decreasing in L. The algorithmic techniques can be applied to broader classes of robust and regularized problems, and there are related algorithms for regression. This is joint work with Haim Avron and David Woodruff.


·       Jure Leskovec (Stanford): Missing Data in Information Networks [Slides]
Abstract: Information networks and knowledge graphs represent many important and challenging problems. For example, identifying pairs of nodes in an information network that should be linked may be very hard and time consuming for humans. Similarly, identifying missing entities in an information network is a daunting task for humans and automatic approaches to information network completion would have many important applications. In this talk I will present a set of approaches for automatically finding missing links and missing nodes in information networks based on the way people access and browse the information network. In particular, we will develop methods that use passively collected data as implicit signals indicating which nonexistent links would be useful if they were to be introduced. We develop a set of scalable methods and deploy them on full Wikipedia network.


·       Forrest Crawford (Yale): Identification and Estimation of Partially-observed Networks [Slides]
Learning about the structure and function of hidden or covert networks is a major challenge in epidemiology, social science, and intelligence analysis. Vertices in hidden networks usually cannot be enumerated or sampled in a systematic way; they can only be revealed by tracing links emanating from already-observed vertices. Observers sometimes cannot follow links directly, and instead must rely on passive observation of a dynamic process to reveal vertices and edges. In this talk, I will outline a framework for estimating network structures and network functionals from partial observation of information diffusion through the network. I will describe how analysts can reconstruct topological features of networks that are partially revealed by diffusion processes, and show how to estimate more general graph functionals, with corresponding uncertainty quantification. The framework is general and applies to network data that arise from a variety of missing-data mechanisms. In addition, I will discuss estimation of the number of vertices in a graph from small samples and detection of information cascades from point process or communication data.


·       Sucheta Soundarajan (Syracuse): Increasing the Size of Partially Observed Networks [Slides]
Networked representations of real-world phenomena are often partially observed, which lead to incomplete networks. Analysis of such incomplete networks can lead to skewed results. We examine the following problem: given an incomplete network, which b nodes should be probed to bring the largest number of new nodes into the observed network? Many graph-mining tasks require having observed a considerable amount of the network. Examples include community discovery, belief propagation, information propagation, etc. For instance, consider someone who has observed a portion (say 1%) of the Twitter retweet network via random tweet sampling. She wants to estimate the size of the largest connected component of the fully observed retweet network. To improve her estimate, she uses her limited budget to reduce the incompleteness of the network. In this work, we propose a novel algorithm, called MaxOutProbe, which uses a budget b (on nodes probed) to increase the size of the observed network in terms of the number of nodes. We compare MaxOutProbe to the best existing probing approaches across various incomplete networks observed via different sampling methods. Across a range of conditions, MaxOutProbe demonstrates consistently high performance in comparison to existing methods.


·       David Kempe (USC): Algorithmic Robustness in the Face of Uncertain Network Data [Slides]
Large amounts of noise are unavoidable in network data. Lack of observability, uncertainty about models, or even uncertainty about the precise meaning of an "edge" (e.g., where does "acquaintance" end and "friend" start?) combine to make social networks and their inferred parameters at best crude approximations to reality. Nonetheless, researchers and end users want to draw reliable conclusions, and perform successful interventions, based on the output of algorithms on such uncertain data. This means that algorithms must be designed with the goal of robustness in mind. In this talk, I will touch on the issue of algorithmic robustness in the context of several classical social network problems, including influence maximization and community detection. The talk will be heavy on problems, and light on answers.


·       Anil Vullikanti (Virginia Tech): Sensitivity of Structural and Dynamical Properties to Network Uncertainty [Slides]
Many phenomena can be modeled by diffusion processes on networks and the structure has a significant impact on the dynamics. In most applications, however, the underlying networks are inherently noisy and incomplete, since they are often inferred by indirect measurements. Surprisingly, there is limited work on understanding the effects of network uncertainty, despite a large body of research on complex networks. In this talk, we discuss the sensitivity of structural and dynamical properties to network uncertainty. First, we study the sensitivity of the core decomposition of the network. The k-core of an undirected graph is defined as the maximal subgraph in which each node has degree at least k; the core number of a node is the largest k such that it belongs to the k-core. We find that, in general, the top core is quite sensitive to both noise and sampling. Most importantly, the overlap with the top core set varies non-monotonically with the extent of perturbations/sampling. Next, we consider the sensitivity of dynamical properties. We find that the number of nodes that are influenced in two different models, namely the IC and LT models, behaves quite differently under network uncertainty. Our work has important implications for the use of network and dynamical properties, and suggests the need for a more careful characterization of the missing data and sensitivity to it.


·       Jeremy Wendt (Sandia): Social Media Network Collection Problems [Slides]
Social Media is rife with network collection problems. In this talk, I’ll describe two problems we regularly encounter that lead to incomplete, muddied, or otherwise troubled data. First, web crawling is still a potent situational awareness tool. However, modern websites are intentionally designed to trap both crawlers and users within the hosting domain or its partners — not to enhance relevant data collection. Second, social media is rife with semantic graphs. For instance, Twitter graphs could contain edges for following, @-referring, or any number of other actions. These different edge or node types often require different querying APIs, with varying data rates. While the sampling problem is well known and explored in traditional graphs, semantic graphs lead to both new problems and new opportunities. In this talk, I will describe both specific problems that led to our interest in the area and some early generalizations we’ve been exploring.


·       Jiawei Han (UIUC): Structuring and Mining of Heterogeneous Information Networks [Slides]
Abstract: The real-world data are largely unstructured but interconnected. The majority of such data is in the form of natural language text or multimedia forms. One of the grand challenges is to turn such massive data into actionable knowledge. We present our vision on how to turn massive unstructured, text-rich, but interconnected data into knowledge. We propose a D2N2K (i.e., data-to-network-to-knowledge) paradigm, that is, first turn data into relatively structured heterogeneous information networks, and then mine such text-rich and structure-rich heterogeneous networks to generate useful knowledge. We show why such a paradigm represents a promising direction and present some recent progress on the development of effective methods for structuring and mining of heterogeneous information networks from text data.


·       Lise Getoor (UCSC): Collective Graph Identification [Slides]
Graph data (e.g., communication data, financial transaction networks, data describing biological systems, collaboration networks, the Web, etc.) is ubiquitous. While this observational data is useful, it is usually noisy, often only partially observed, and only hints at the actual underlying social, scientific or technological structures that give rise to the interactions. For example, an email communication network provides useful insight, but is not the same as the “real” social network among individuals. In this talk, I introduce the problem of graph identification, i.e., the discovery of the true graph structure underlying an observed network. This involves inferring the nodes, edges, and node labels of a hidden graph based on evidence provided by the observed graph. I show how this can be cast as a collective probabilistic inference task and describe a scalable approach to solving this problem.


·       Jared Saia (UNM): Cooperative Computing for Autonomous Data Centers [Slides]
We present a new distributed model for graph computations motivated by limited information sharing. Two or more independent entities have collected large social graphs. They wish to compute the result of running graph algorithms on the entire set of relationships. Because the information is sensitive or economically valuable, they do not wish to simply combine the information in a single location. We consider two models for computing the solution to graph algorithms in this setting: 1) limited-sharing: the two entities can share only a poly logarithmic size subgraph, 2) low-trust: the entities must not reveal any information beyond the query answer, assuming they are all honest but curious. This second model is closely related to the classic secure multiparty computation problem (MPC), except that input sizes are much larger than is common. We describe efficient algorithms in these models for two graph theoretic problems: 1) cooperative computation of s-t connectivity, and 2) finding a planted clique in a social network. This work is joint with Jonathan W. Berry, Michael J. Collins, Aaron Kearns, Cynthia A. Phillips, and Randy Smith.


·       Bradley Huffaker (CAIDA): Autonomous Systems (AS) Topology: An Introduction [Slides]
Topology maps of the Internet are indispensable for characterizing this critical infrastructure and understanding its properties, dynamics, and evolution. Most Internet mapping methods have focused on characterizing and modeling the network structure at the level of interconnected Autonomous Systems (ASes). In this talk we will introduce various ways to annotate ASes, go over available caveats, and possible areas of interest.