2016 SIAM International
Conference on Data Mining (SDMÕ16)
May 5-7,
2016 ¥ Miami, Florida
Tutorial Title: Problems with Incomplete Networks: Biases, Skewed Results,
and Solutions
Abstract: Networked representations of physical and social phenomena
are often incomplete because the phenomena are partially observed. Working with
incomplete networks can skew analyses. Hoping to acquire the full data is often
unrealistic, but one may be able to collect data selectively to enrich the
incomplete network. For example, suppose a cyber-network administrator has
partially observed a network through trace-routes. Which parts of the partially
observed network should be more closely examined to give the best (i.e., most
complete) view of the entire network? With a limited query budget, how should
this further exploration be done? Alternatively, suppose that one has obtained
a sample of a Twitter retweet network from a Web site. The sample was collected
for some other purpose (unbeknownst to us), and so may not contain the most
useful structural information for oneÕs purposes. How should one best
supplement this sampled data? This tutorial addresses the aforementioned
questions.
Presenters
_
Tina Eliassi-Rad,
Rutgers University & Northeastern University, tina@eliassi.org
_
Sucheta Soundarajan, Syracuse
University, susounda@syr.edu
_
Ali Pinar, Sandia National
Laboratories, apinar@sandia.gov
_
Brian Gallagher, Lawrence
Livermore National Laboratory, bgallagher@llnl.gov
Schedule: This two-hour tutorial will cover the
following:
_
Session
One (1st Hour)
o Graph crawling [13]
o Graph sampling [1][2][3][4][6][7][8][21]
o Estimating network parameters [9][22][23][24]
_
Session
Two (2nd Hour)
o Enriching nodes and edges [5][19][20]
o Applications [14][15][16][17][18]
Slides: http://eliassi.org/sdm16-tutorial-slides.pdf
Target Audience and Prerequisites: Our target audience includes researchers
and practitioners in data mining and machine learning, with an interest in incomplete (a.k.a.
partially observed) networks and graphs. We are targeting people who are concerned about the latent
biases in the Òreal-worldÓ data being used in research and industry. We expect the audience to
come away with an overview of the state-of-art in enriching incomplete networks and have a better
understanding of the challenges in this area. No
assumption is made about familiarity with complex networks, graph mining, graph
sampling, and incomplete data. A brief overview of them
will be included in the tutorial.
References
[1] J. Leskovec and C. Faloutsos.
Sampling from large graphs. In KDD, pages 631Ð636, 2006.
[2] A. S.
Maiya and T. Berger-Wolf. Sampling community
structure. In WWW, pages 701Ð710, 2010.
[3] A. S.
Maiya and T. Berger-Wolf. Online sampling of high
centrality individuals in social networks. In PAKDD, pages 91Ð98, 2010.
[4] S.
Hanneke and E. P. Xing. Network completing and survey sampling. In AISTATS,
pages 209Ð215, 2009.
[5] M.
Kim and J. Leskovec. The network completion problem:
Inferring missing nodes and edges in networks. In SDM, pages 47Ð58, 2011.
[6] J. J.
Pfeiffer III, J. Neville, and P. N. Bennett. Active sampling of networks. In
MLG Workshop, 2012.
[7] J. J.
Pfeiffer III, J. Neville, and P. N. Bennett. Active exploration in networks:
Using probabilistic relationships for learning and inference. In CIKM, pages
639Ð648, 2014.
[8] D.
Thompson, J. Bennett, C. Seshadhri, and A. Pinar. A
provably-robust sampling method for generating colormaps
of large data. In LDAV, pages 77Ð84, 2013.
[9] C. Tsourakakis. Fast counting of triangles in large real
networks without counting: Algorithms and laws. In ICDM, pages 608Ð617, 2008.
[10] J.
Wu, X. Li, L. Jiao, X. Wang, and B. Sun. Minimum spanning trees for community
detection. Physica A, vol. 392, no. 9, pages
2265Ð2277, 2013.
[11] A.
S. Maiya and T. Y. Berger-Wolf. Benefits of bias:
Towards better characterization of network sampling. In KDD, pages 105Ð113,
2011.
[12] K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro, and D. Towsley.
Pay Few, Influence Most: Online Myopic Network Covering, In IEEE NetSciCom Workshop, 2014.
[13] J.
Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. In
WWW, pages 161Ð172, 1998.
[14] K. Avrachenkov, N. Litvak, L. O. Prokhorenkova,
and E. Sayargulova. Quick detection of high-degree
entities in large directed networks. In ICDM, pages 54Ð65, 2014.
[15] R.
Cohen, S. Havlin, and D. Ben-Avraham. Efficient
immunization strategies for computer networks and populations. Phys. Rev.
Lett., vol. 91, no. 24, p. 247901, 2003.
[16] S.
A. Macskassy and F. Provost. Suspicion scoring based
on guilt-by association, collective inference, and focused data access. In
International Conference on Intelligence Analysis, 2005.
[17] M. Bilgic, L. Mihalkova, and L. Getoor. Active learning for networked data. In ICML, pages
79Ð86, 2010.
[18] V.
S. Sheng, F. Provost, and P. G. Ipeirotis. Get
another label? Improving data quality and data mining using multiple, noisy
labelers. In KDD, pages 614Ð622, 2008.
[19] J. Vermorel and M. Mohri.
Multi-armed bandit algorithms and empirical evaluation. In ECML, pages 437Ð448,
2005.
[20] S. Bubeck and N. Cesa-Bianchi.
Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends in Machine Learning, vol.
5, no. 1, pages 1Ð122, 2012.
[21] A. Dasgupta, R. Kumar, and D. Sivakumar.
Social sampling. In KDD, pages 235Ð243, 2012.
[22] A. Dasgupta, R. Kumar, and T. Sarl—s.
On estimating the average degree. In WWW, pages 795Ð806, 2014.
[23] C.
Cooper, T. Radzik, and Y. Siantos.
Estimating network parameters using random walks. In CASoN,
pages 33Ð40, 2012.
[24] L.
Peel. Estimating network parameters for selecting community detection
algorithms. In FUSION, pages 1Ð8, 2010.