DATES: Tuesday March 22 and Wednesday March 23, 2016
|
SPONSOR
|
LOCATION: Open Campus,
Sandia National Laboratories,
Livermore, CA
|
|
ORGANIZERS
(in alphabetical order)
o Tina Eliassi-Rad (Northeastern
University)
o James
Ferry (Metron,
Inc.)
o Ali Pinar (Sandia National Labs, Livermore)
o C. Seshadhri (UC Santa
Cruz)
|
Many
systems of critical importance are commonly modeled as networks. Understanding
their structure and anticipating their weaknesses are vital to numerous
applications. However, most networks are vast, distributed, and there is no
mechanism to completely collect their topologies. We cannot expect to know all
the router structure of the Internet; collect all the hyperlinks; or identify
all social connections. We are restricted to specific data collection schemes
that examine tiny portions of such a network. It has been empirically
documented and theoretically proven that these measurements have significant
biases, and direct inferences from them will be wrong. It is imperative to
investigate and develop new network analysis theory and techniques that take
into account the incompleteness and inaccuracy of the data and the biases in
data collection methods. Current state-of-the-art network analysis methods fall
short, and we need novel theory and approaches for this daunting problem.
Tuesday, March 22
8:45 - 9:00 Introduction
9:00 - 10:30 Session 1 – Sampling (Chair: Ali)
10:30 - 11:00 Break
11:00 - 12:30 Session 2 – Linear Algebra and
Spectral Methods (Chair: Sesh)
12:30 - 2:00 Lunch (Box lunches will be provided for
$10. Please bring cash.)
2:00 - 3:30 Session 3 – Partially Observed
Networks (Chair: Sesh)
3:30 - 4:00 Break
4:00 - 5:30 Session 4 – Algorithms for Uncertain
Networks (Chair: Tina)
6:00 Workshop Dinner at Terra
Mia Livermore
Wednesday, March 23
9:00 - 10:10 Session 5 – Complex Representations
(Chair: Tina)
10:10 - 10:30 Break
10:30 - 11:40 Session 6 – Applications (Chair:
Jim)
11:40
- 12:00 Break
12:00
- 12:30 Perspectives from Real Applications
12:30
- 1:00 Concluding remarks
1:00 - 3:00 Lunch and open discussion (Box lunches
will be provided for $10. Please bring cash.)
· Tammy Kolda (Sandia):
Measurements on Graphs: The Power of Wedge and Diamond Sampling [Slides]
Abstract: Massive graphs can make computations prohibitively expensive. We show
that graph sampling based on wedges is a versatile technique allows for the
fast and accurate approximation of various quantities. Using wedge sampling, we
can compute graph degree-wise and overall clustering coefficients as well as
triangle counts. Our methods come with provable and practical
time-approximation tradeoffs for all computations. We provide extensive results
that show our methods are orders of magnitude faster than the state-of-the-art,
while providing nearly the accuracy of full enumeration. Our MapReduce
implementation has successfully used this method on graphs with 132M nodes and
4.7B edges, as well as artificially generated networks (using the Graph500
benchmark), the largest of which has 240M nodes and 8.5B edges. These methods
can be extended to other purposes as well. The maximum all-pairs dot-product
search (MADS) problem finds the largest entry in the product of two matrices,
which is useful for similarity search, link prediction, and collaborative filtering.
Sampling diamonds, intersecting pairs of wedges, allows us to avoid computing
all dot products. Experimental results indicate that diamond sampling is orders
of magnitude faster than direct computation and requires far fewer samples than
any competing approach. We also apply diamond sampling to the special case of
maximum inner product search, and get significantly better results than the
state-of-the-art hashing methods. This is based on joint work with Grey
Ballard, Todd Plantenga, Ali Pinar, and C. Seshadhri.
· Ravi Kumar (Google): Sampling a Uniform Node [Slides]
Abstract: In this talk
we will address the basic problem of sampling a node from a large graph
according to the uniform distribution, by using random walk as the basic
primitive. We study the query complexity of three algorithms and show a
near-tight bound expressed in terms of two key parameter of the graph, namely,
the average degree and the mixing time.
· Dennis Feehan (UC
Berkeley): Network Reporting Methods for Sample Surveys [Slides]
Abstract: Surveys
have traditionally been based on the idea that researchers can estimate
characteristics of a population by obtaining a sample of individuals and asking
them to report about themselves. Network reporting surveys generalize this
traditional approach by asking survey respondents to report about members of
their personal networks. By asking respondents to report about others, network
reporting surveys can be used to study many important rare and hidden
populations for which traditional survey methods are inadequate; for example,
the approach has been used to estimate the size of epidemiologically important
groups like sex workers, drug injectors, and men who have sex with men. In my
talk, I first describe a framework for developing estimators from network
reporting survey data. Next, I present some results from a
nationally-representative survey experiment that my colleagues and I conducted in Rwanda. Finally, I outline some
directions for future work.
· Michael Mahoney (UC Berkeley): Algorithmic
Methods for Characterizing Local Structure in Large Graphs [Slides]
Abstract: Large networks can have surprisingly subtle and counterintuitive
properties, so much so that even characterizing their basic structure can be
challenging. In many cases, there is meaningful small-scale of local structure,
e.g., a small cluster or an ego-network around an individual, but there is not
particularly meaningful large-scale structure. This is particularly challenging
to deal with since most algorithmic and statistical tools implicitly assume
exactly the opposite. Recent work has focused on developing local and
locally-biased algorithms, i.e., algorithms that compute solutions around
exogenously-specified seed set of nodes, in some cases without even touching
most of the nodes of the graph. We will describe some recent results along
these lines.
· David Gleich (Purdue):
Using Local Spectral Methods to Robustify Graph-based
Learning Algorithms [Slides]
Abstract: Graph-based learning methods have a variety of names including
semi-supervised and transductive learning. They
typically use a diffusion to propagate labels from a small set of nodes with
known class labels to the remaining nodes of the graph. While popular, these
algorithms, when implemented in a straightforward fashion, are extremely
sensitive to the details of the graph construction from the real-world data. We
focus the question on how to make these procedures robust to missing data and
mistakes in the graph construction process. Here, we provide four procedures:
recognizing implicit regularization in the diffusion, using a scalable push
method to evaluate the diffusion, using rank-based rounding, and densifying the
graph through a matrix polynomial. We study robustness with respect to the
details of graph constructions, errors in node labeling, degree variability,
and a variety of other real-world heterogeneities, studying these methods
through a precise relationship with mincut problems.
For instance, the densification strategy explicitly adds new weighted edges to
a sparse graph. We find that this simple densification creates a graph where
multiple diffusion methods are robust to several types of errors. This is
demonstrated by a study with predicting product categories from an Amazon co-purchasing
network.
· Ken Clarkson (IBM Research): Low-Rank
Approximation in Input-Sparsity Time, with Regularization and Robustness [Slides]
Abstract: The
classical problem of finding the best rank-k approximation to a matrix A under
the Frobenius norm, min_{rank(B)=k} || A - B ||F,
can be solved using the singular value decomposition. I will outline fast
sketching-based approximation algorithms for two variations on this problem:
where the Frobenius norm is replaced by a more robust
measure, and where the objective has an added regularization term L*||A||_*
using the nuclear (trace) norm, with L a weight parameter. Here
"sketching" is a matrix compression technique, including for example
random projections. These algorithms have a "leading term" in running
time of tilde{O}(nnz(A)), where nnz(A)
is the number of nonzero entries of A, and an additive term dependent on
poly(k/eps), where eps is the relative approximation error. The first algorithm
includes a step taking exp(poly(k/eps)), and we have
an NP-hardness result suggesting that this may be unavoidable. The running time
of the second algorithm has a dependence on the "statistical
dimension" of A, which is decreasing in L. The algorithmic techniques can
be applied to broader classes of robust and regularized problems, and there are
related algorithms for regression. This is joint work with Haim Avron and
David Woodruff.
· Jure Leskovec
(Stanford): Missing Data in Information Networks [Slides]
Abstract: Information networks and knowledge graphs represent many important
and challenging problems. For example, identifying pairs of nodes in an
information network that should be linked may be very hard and time consuming
for humans. Similarly, identifying missing entities in an information network
is a daunting task for humans and automatic approaches to information network
completion would have many important applications. In this talk I will present
a set of approaches for automatically finding missing links and missing nodes
in information networks based on the way people access and browse the
information network. In particular, we will develop methods that use passively
collected data as implicit signals indicating which nonexistent links would be
useful if they were to be introduced. We develop a set of scalable methods and
deploy them on full Wikipedia network.
· Forrest Crawford (Yale): Identification and
Estimation of Partially-observed Networks [Slides]
Abstract Learning
about the structure and function of hidden or covert networks is a major
challenge in epidemiology, social science, and intelligence analysis. Vertices
in hidden networks usually cannot be enumerated or sampled in a systematic way;
they can only be revealed by tracing links emanating from already-observed
vertices. Observers sometimes cannot follow links directly, and instead must
rely on passive observation of a dynamic process to reveal vertices and edges.
In this talk, I will outline a framework for estimating network structures and
network functionals from partial observation of
information diffusion through the network. I will describe how analysts can
reconstruct topological features of networks that are partially revealed by
diffusion processes, and show how to estimate more general graph functionals, with corresponding uncertainty quantification.
The framework is general and applies to network data that arise from a variety
of missing-data mechanisms. In addition, I will discuss estimation of the
number of vertices in a graph from small samples and detection of information
cascades from point process or communication data.
· Sucheta Soundarajan
(Syracuse): Increasing the Size of Partially Observed Networks [Slides]
Abstract: Networked
representations of real-world phenomena are often partially observed, which
lead to incomplete networks. Analysis of such incomplete networks can lead to
skewed results. We examine the following problem: given an incomplete network,
which b nodes should be probed to bring the largest number of new nodes into
the observed network? Many graph-mining tasks require having observed a
considerable amount of the network. Examples include community discovery,
belief propagation, information propagation, etc. For instance, consider
someone who has observed a portion (say 1%) of the Twitter retweet network via
random tweet sampling. She wants to estimate the size of the largest connected
component of the fully observed retweet network. To improve her estimate, she
uses her limited budget to reduce the incompleteness of the network. In this
work, we propose a novel algorithm, called MaxOutProbe,
which uses a budget b (on nodes probed) to increase the size of the observed
network in terms of the number of nodes. We compare MaxOutProbe
to the best existing probing approaches across various incomplete networks
observed via different sampling methods. Across a range of conditions, MaxOutProbe demonstrates consistently high performance in
comparison to existing methods.
· David Kempe (USC): Algorithmic Robustness in the
Face of Uncertain Network Data [Slides]
Abstract: Large
amounts of noise are unavoidable in network data. Lack of observability,
uncertainty about models, or even uncertainty about the precise meaning of an
"edge" (e.g., where does "acquaintance" end and
"friend" start?) combine to make social networks and their inferred
parameters at best crude approximations to reality. Nonetheless, researchers
and end users want to draw reliable conclusions, and perform successful
interventions, based on the output of algorithms on such uncertain data. This
means that algorithms must be designed with the goal of robustness in mind. In
this talk, I will touch on the issue of algorithmic robustness in the context
of several classical social network problems, including influence maximization
and community detection. The talk will be heavy
on problems, and light on answers.
· Anil Vullikanti
(Virginia Tech): Sensitivity of Structural and Dynamical Properties to Network
Uncertainty [Slides]
Abstract: Many
phenomena can be modeled by diffusion processes on networks and the structure
has a significant impact on the dynamics. In most applications, however, the
underlying networks are inherently noisy and incomplete, since they are often
inferred by indirect measurements. Surprisingly, there is limited work on
understanding the effects of network uncertainty, despite a large body of
research on complex networks. In this talk, we discuss the sensitivity of
structural and dynamical properties to network uncertainty. First, we study the
sensitivity of the core decomposition of the network. The k-core of an
undirected graph is defined as the maximal subgraph in which each node has
degree at least k; the core number of a node is the largest k such that it
belongs to the k-core. We find that, in general, the top core is quite
sensitive to both noise and sampling. Most importantly, the overlap with the
top core set varies non-monotonically with the extent of
perturbations/sampling. Next, we consider the sensitivity of dynamical properties.
We find that the number of nodes that are influenced in two different models,
namely the IC and LT models, behaves quite differently under network
uncertainty. Our work has important implications for the use of network and
dynamical properties, and suggests the need for a more careful characterization
of the missing data and sensitivity to it.
· Jeremy Wendt (Sandia): Social Media Network
Collection Problems [Slides]
Abstract: Social Media is rife with network collection problems. In this
talk, I’ll describe two problems we regularly encounter that lead to
incomplete, muddied,
or otherwise troubled data. First, web crawling is
still a potent situational awareness tool. However, modern websites are
intentionally designed to trap both crawlers and users within the hosting
domain or its partners — not to enhance relevant data collection. Second, social media is rife with semantic graphs. For
instance, Twitter graphs could contain edges for following, @-referring, or any
number of other actions. These different edge or node types often require
different querying APIs, with varying data rates. While the sampling problem is well known and explored in
traditional graphs, semantic graphs lead to both new problems and new
opportunities. In this talk, I will describe both specific problems that led to
our interest in the area and some early generalizations we’ve been exploring.
· Jiawei Han (UIUC): Structuring and Mining of Heterogeneous
Information Networks [Slides]
Abstract: The real-world data are largely unstructured but interconnected. The
majority of such data is in the form of natural language text or multimedia
forms. One of the grand challenges is to turn such massive data into actionable
knowledge. We present our vision on how to turn massive unstructured,
text-rich, but interconnected data into knowledge. We propose a D2N2K (i.e.,
data-to-network-to-knowledge) paradigm, that is, first turn data into
relatively structured heterogeneous information networks, and then mine such
text-rich and structure-rich heterogeneous networks to generate useful
knowledge. We show why such a paradigm represents a promising direction and
present some recent progress on the development of effective methods for
structuring and mining of heterogeneous information networks from text data.
· Lise Getoor (UCSC):
Collective Graph Identification [Slides]
Abstract: Graph
data (e.g., communication data, financial transaction networks, data describing
biological systems, collaboration networks, the Web, etc.) is ubiquitous. While
this observational data is useful, it is usually noisy, often only partially
observed, and only hints at the actual underlying social, scientific or
technological structures that give rise to the interactions. For example, an
email communication network provides useful insight, but is not the same as the
“real” social network among individuals. In this talk, I introduce the problem
of graph identification, i.e., the discovery of the true graph structure
underlying an observed network. This involves inferring the nodes, edges, and
node labels of a hidden graph based on evidence provided by the observed graph.
I show how this can be cast as a collective probabilistic inference task and
describe a scalable approach to solving this
problem.
· Jared Saia (UNM): Cooperative Computing for
Autonomous Data Centers [Slides]
Abstract: We
present a new distributed model for graph computations motivated by limited
information sharing. Two or more independent entities have collected large social
graphs. They wish to compute the result of running graph algorithms on the
entire set of relationships. Because the information is sensitive or
economically valuable, they do not wish to simply combine the information in a
single location. We consider two models for computing the solution to graph
algorithms in this setting: 1) limited-sharing: the two entities can share only
a poly logarithmic size subgraph, 2) low-trust: the entities must not reveal
any information beyond the query answer, assuming they are all honest but
curious. This second model is closely related to the classic secure multiparty
computation problem (MPC), except that input sizes are much larger than is
common. We describe efficient algorithms in these models for two graph theoretic
problems: 1) cooperative computation of s-t connectivity, and 2) finding a
planted clique in a social network. This work is joint with Jonathan W. Berry, Michael J. Collins, Aaron Kearns,
Cynthia A. Phillips, and Randy Smith.
· Bradley Huffaker
(CAIDA): Autonomous Systems (AS) Topology: An Introduction [Slides]
Abstract: Topology
maps of the Internet are indispensable for characterizing this critical
infrastructure and understanding its properties, dynamics, and evolution. Most
Internet mapping methods have focused on characterizing and modeling the
network structure at the level of interconnected Autonomous Systems (ASes). In this talk we will introduce various ways to
annotate ASes, go over available caveats, and
possible areas of interest.