2016 SIAM International Conference on Data Mining (SDM’16)

May 5-7, 2016      Miami, Florida


Tutorial Title:  Problems with Incomplete Networks: Biases, Skewed Results, and Solutions


Abstract:  Networked representations of physical and social phenomena are often incomplete because the phenomena are partially observed. Working with incomplete networks can skew analyses. Hoping to acquire the full data is often unrealistic, but one may be able to collect data selectively to enrich the incomplete network. For example, suppose a cyber-network administrator has partially observed a network through trace-routes. Which parts of the partially observed network should be more closely examined to give the best (i.e., most complete) view of the entire network? With a limited query budget, how should this further exploration be done? Alternatively, suppose that one has obtained a sample of a Twitter retweet network from a Web site. The sample was collected for some other purpose (unbeknownst to us), and so may not contain the most useful structural information for one’s purposes. How should one best supplement this sampled data? This tutorial addresses the aforementioned questions.



_      Tina Eliassi-Rad, Rutgers University & Northeastern University, tina@eliassi.org

_      Sucheta Soundarajan, Syracuse University, susounda@syr.edu

_      Ali Pinar, Sandia National Laboratories, apinar@sandia.gov

_      Brian Gallagher, Lawrence Livermore National Laboratory, bgallagher@llnl.gov


Schedule: This two-hour tutorial will cover the following:

_      Session One (1st Hour)

o   Graph crawling [13]

o   Graph sampling [1][2][3][4][6][7][8][21]

o   Estimating network parameters [9][22][23][24]

_      Session Two (2nd Hour)

o   Enriching nodes and edges [5][19][20]

o   Applications [14][15][16][17][18]


Slides: http://eliassi.org/sdm16-tutorial-slides.pdf


Target Audience and Prerequisites: Our target audience includes researchers and practitioners in data mining and machine learning, with an interest in incomplete (a.k.a. partially observed) networks and graphs. We are targeting people who are concerned about the latent biases in the “real-world” data being used in research and industry. We expect the audience to come away with an overview of the state-of-art in enriching incomplete networks and have a better understanding of the challenges in this area. No assumption is made about familiarity with complex networks, graph mining, graph sampling, and incomplete data. A brief overview of them will be included in the tutorial.



[1] J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD, pages 631–636, 2006.

[2] A. S. Maiya and T. Berger-Wolf. Sampling community structure. In WWW, pages 701–710, 2010.

[3] A. S. Maiya and T. Berger-Wolf. Online sampling of high centrality individuals in social networks. In PAKDD, pages 91–98, 2010.

[4] S. Hanneke and E. P. Xing. Network completing and survey sampling. In AISTATS, pages 209–215, 2009.

[5] M. Kim and J. Leskovec. The network completion problem: Inferring missing nodes and edges in networks. In SDM, pages 47–58, 2011.

[6] J. J. Pfeiffer III, J. Neville, and P. N. Bennett. Active sampling of networks. In MLG Workshop, 2012.

[7] J. J. Pfeiffer III, J. Neville, and P. N. Bennett. Active exploration in networks: Using probabilistic relationships for learning and inference. In CIKM, pages 639–648, 2014.

[8] D. Thompson, J. Bennett, C. Seshadhri, and A. Pinar. A provably-robust sampling method for generating colormaps of large data. In LDAV, pages 77–84, 2013.

[9] C. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM, pages 608–617, 2008.

[10] J. Wu, X. Li, L. Jiao, X. Wang, and B. Sun. Minimum spanning trees for community detection. Physica A, vol. 392, no. 9, pages 2265–2277, 2013.

[11] A. S. Maiya and T. Y. Berger-Wolf. Benefits of bias: Towards better characterization of network sampling. In KDD, pages 105–113, 2011.

[12] K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro, and D. Towsley. Pay Few, Influence Most: Online Myopic Network Covering, In IEEE NetSciCom Workshop, 2014.

[13] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. In WWW, pages 161–172, 1998.

[14] K. Avrachenkov, N. Litvak, L. O. Prokhorenkova, and E. Sayargulova. Quick detection of high-degree entities in large directed networks. In ICDM, pages 54–65, 2014.

[15] R. Cohen, S. Havlin, and D. Ben-Avraham. Efficient immunization strategies for computer networks and populations. Phys. Rev. Lett., vol. 91, no. 24, p. 247901, 2003.

[16] S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.

[17] M. Bilgic, L. Mihalkova, and L. Getoor. Active learning for networked data. In ICML, pages 79–86, 2010.

[18] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD, pages 614–622, 2008.

[19] J. Vermorel and M. Mohri. Multi-armed bandit algorithms and empirical evaluation. In ECML, pages 437–448, 2005.

[20] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, vol. 5, no. 1, pages 1–122, 2012.

[21] A. Dasgupta, R. Kumar, and D. Sivakumar. Social sampling. In KDD, pages 235–243, 2012.

[22] A. Dasgupta, R. Kumar, and T. Sarlós. On estimating the average degree. In WWW, pages 795–806, 2014.

[23] C. Cooper, T. Radzik, and Y. Siantos. Estimating network parameters using random walks. In CASoN, pages 33–40, 2012.

[24] L. Peel. Estimating network parameters for selecting community detection algorithms. In FUSION, pages 1–8, 2010.