2016 SIAM International Conference on Data Mining (SDMÕ16)

May 5-7, 2016   ¥   Miami, Florida

 

Tutorial Title:  Problems with Incomplete Networks: Biases, Skewed Results, and Solutions

 

Abstract:  Networked representations of physical and social phenomena are often incomplete because the phenomena are partially observed. Working with incomplete networks can skew analyses. Hoping to acquire the full data is often unrealistic, but one may be able to collect data selectively to enrich the incomplete network. For example, suppose a cyber-network administrator has partially observed a network through trace-routes. Which parts of the partially observed network should be more closely examined to give the best (i.e., most complete) view of the entire network? With a limited query budget, how should this further exploration be done? Alternatively, suppose that one has obtained a sample of a Twitter retweet network from a Web site. The sample was collected for some other purpose (unbeknownst to us), and so may not contain the most useful structural information for oneÕs purposes. How should one best supplement this sampled data? This tutorial addresses the aforementioned questions.

 

Presenters

_      Tina Eliassi-Rad, Rutgers University & Northeastern University, tina@eliassi.org

_      Sucheta Soundarajan, Syracuse University, susounda@syr.edu

_      Ali Pinar, Sandia National Laboratories, apinar@sandia.gov

_      Brian Gallagher, Lawrence Livermore National Laboratory, bgallagher@llnl.gov

 

Schedule: This two-hour tutorial will cover the following:

_      Session One (1st Hour)

o   Graph crawling [13]

o   Graph sampling [1][2][3][4][6][7][8][21]

o   Estimating network parameters [9][22][23][24]

_      Session Two (2nd Hour)

o   Enriching nodes and edges [5][19][20]

o   Applications [14][15][16][17][18]

 

Slides: http://eliassi.org/sdm16-tutorial-slides.pdf

 

Target Audience and Prerequisites: Our target audience includes researchers and practitioners in data mining and machine learning, with an interest in incomplete (a.k.a. partially observed) networks and graphs. We are targeting people who are concerned about the latent biases in the Òreal-worldÓ data being used in research and industry. We expect the audience to come away with an overview of the state-of-art in enriching incomplete networks and have a better understanding of the challenges in this area. No assumption is made about familiarity with complex networks, graph mining, graph sampling, and incomplete data. A brief overview of them will be included in the tutorial.

 

References

[1] J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD, pages 631Ð636, 2006.

[2] A. S. Maiya and T. Berger-Wolf. Sampling community structure. In WWW, pages 701Ð710, 2010.

[3] A. S. Maiya and T. Berger-Wolf. Online sampling of high centrality individuals in social networks. In PAKDD, pages 91Ð98, 2010.

[4] S. Hanneke and E. P. Xing. Network completing and survey sampling. In AISTATS, pages 209Ð215, 2009.

[5] M. Kim and J. Leskovec. The network completion problem: Inferring missing nodes and edges in networks. In SDM, pages 47Ð58, 2011.

[6] J. J. Pfeiffer III, J. Neville, and P. N. Bennett. Active sampling of networks. In MLG Workshop, 2012.

[7] J. J. Pfeiffer III, J. Neville, and P. N. Bennett. Active exploration in networks: Using probabilistic relationships for learning and inference. In CIKM, pages 639Ð648, 2014.

[8] D. Thompson, J. Bennett, C. Seshadhri, and A. Pinar. A provably-robust sampling method for generating colormaps of large data. In LDAV, pages 77Ð84, 2013.

[9] C. Tsourakakis. Fast counting of triangles in large real networks without counting: Algorithms and laws. In ICDM, pages 608Ð617, 2008.

[10] J. Wu, X. Li, L. Jiao, X. Wang, and B. Sun. Minimum spanning trees for community detection. Physica A, vol. 392, no. 9, pages 2265Ð2277, 2013.

[11] A. S. Maiya and T. Y. Berger-Wolf. Benefits of bias: Towards better characterization of network sampling. In KDD, pages 105Ð113, 2011.

[12] K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro, and D. Towsley. Pay Few, Influence Most: Online Myopic Network Covering, In IEEE NetSciCom Workshop, 2014.

[13] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. In WWW, pages 161Ð172, 1998.

[14] K. Avrachenkov, N. Litvak, L. O. Prokhorenkova, and E. Sayargulova. Quick detection of high-degree entities in large directed networks. In ICDM, pages 54Ð65, 2014.

[15] R. Cohen, S. Havlin, and D. Ben-Avraham. Efficient immunization strategies for computer networks and populations. Phys. Rev. Lett., vol. 91, no. 24, p. 247901, 2003.

[16] S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by association, collective inference, and focused data access. In International Conference on Intelligence Analysis, 2005.

[17] M. Bilgic, L. Mihalkova, and L. Getoor. Active learning for networked data. In ICML, pages 79Ð86, 2010.

[18] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD, pages 614Ð622, 2008.

[19] J. Vermorel and M. Mohri. Multi-armed bandit algorithms and empirical evaluation. In ECML, pages 437Ð448, 2005.

[20] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, vol. 5, no. 1, pages 1Ð122, 2012.

[21] A. Dasgupta, R. Kumar, and D. Sivakumar. Social sampling. In KDD, pages 235Ð243, 2012.

[22] A. Dasgupta, R. Kumar, and T. Sarl—s. On estimating the average degree. In WWW, pages 795Ð806, 2014.

[23] C. Cooper, T. Radzik, and Y. Siantos. Estimating network parameters using random walks. In CASoN, pages 33Ð40, 2012.

[24] L. Peel. Estimating network parameters for selecting community detection algorithms. In FUSION, pages 1Ð8, 2010.