Fall 2018: CS6220 Data Mining Techniques, Section 01, CRN 12896

 

General Information

 

Lecture time: Tuesdays & Fridays
3:25 – 5:05 PM

Place: Behrakis Health Sciences Center, Room 320

Instructor: Tina Eliassi-Rad

Office hours: Tuesdays 5:30 – 6:30 PM in West Village H (WVH), Room 362
Also, available by appointment. Email eliassi [at] ccs [dot] neu [dot] edu to setup appointment; begin the subject line with
[fa18 cs6220].

TA: Zikun Lin & Hui “Sophie” Wang

TAs office hours:

·      Zikun Lin: Wednesdays 3:00 – 4:30 PM in Nightingale Hall, Room 132i

·      Sophie Wang: Thursdays 4:00 – 5:30 PM in Ryder Hall, Room 273
Also, available by appointment. Email lin.zik [at] husky [dot] neu [dot] edu and/or wang.hui1 [at] husky [dot] neu [dot] edu to setup appointment; begin the subject line with
[fa18 cs6220].

 

Overview

 

This 4-credit graduate-level course covers data mining and unsupervised learning. Its prerequisites are:

o   Graduate level CS 5800 Minimum Grade of C- or Graduate level CS 7800 Minimum Grade of C-.

o   Calculus and linear algebra.

o   An introductory course on statistics and probability.

o   Algorithms and programming (MATLAB, Python, or R).

 

Textbooks

 

This course does not have a designated textbook. The readings are assigned in the syllabus (see below). Here are some textbooks (all optional) related to the course.

 

Resources

 

 

Grading

 

o   Homework assignments (3*10%)

o   Midterm exam (20%)

o   Final exam (20%)

o   Class project (30%)

o   2-page proposal & 5-minute in-class pitch (8%)

·      Should include answers to the following questions:

-   What is the problem?

-   Why is it interesting and important?

-   Why is it hard? Why have previous approaches failed?

-   What are the key components of your approach?

-   What data sets and metrics will be used to validate the approach?

o   In-class presentation (10%)

o   Final report (12%) – maximum 6 pages

·      For guidance on writing the final report, see slide 70 of Eamonn Keogh's KDD'09 Tutorial on How to do good research, get it published in SIGKDD and get it cited!

·      Follow ACM formatting guidelines.

 

Schedule/Syllabus (Subject to Change)

 

Lec #

Date

Topic

Readings & Notes

1

F 9/7

Introduction and Overview

o   Chapter 1 of http://eliassi.org/mmds-book-v2L.pdf

o   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

o   http://eliassi.org/ml-science-2015.pdf

2

T 9/11

Review of Linear Algebra
& Probability

(Guest lecturer:
Dr. Onur Varol)

o   Linear Algebra Tutorial (C.T. Abdallah, Penn)

o   Linear Algebra Review and Reference (Zico Kolter and Chuong Do, Stanford)

o   Probability Review (David Blei, Princeton)

o   Probability Theory Review (Arian Maleki and Tom Do, Stanford)

3

F 9/14

Social Bots

(Guest lecturer:
Dr. Onur Varol)

o   https://arxiv.org/abs/1407.5225

o   https://arxiv.org/abs/1601.05140

o   https://arxiv.org/abs/1703.03107

4

T 9/18

Frequent Itemsets & Association Rules

o   Chapter 6 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

5

F 9/21

Frequent Itemsets & Association Rules

o   Chapter 6 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

6

T 9/25

Density Estimation

o   http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver_contents.html

o   http://eliassi.org/Sheather_StatSci_2004.pdf

o   Optional: Sections 6.6-6.9 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

Homework #1

o   out on Tuesday September 25

o   due on Friday October 5 at 11:59 PM Eastern

o   grade by Friday October 19

7

F 9/28

Finding Similar Items

o   Chapter 3 of http://eliassi.org/mmds-book-v2L.pdf

8

9

T 10/2

F 10/5

Mining Data Streams

o   Chapter 4 of http://eliassi.org/mmds-book-v2L.pdf

10

11

T 10/9

F 10/12

Dimensionality Reduction (PCA, SVD, CUR,
Kernel PCA)

o   Chapter 11 of http://eliassi.org/mmds-book-v2L.pdf

o   Section 14.5 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

o   https://alex.smola.org/papers/1999/MikSchSmoMuletal99.pdf

Homework #2

o   out on Friday October 12

o   due on Tuesday October 23 at 11:59 PM Eastern

o   grade by Tuesday November 6

12

13

 

T 10/16

F 10/19

 

Clustering:
K-means, Gaussian Mixture Models, Expectation Maximization (EM)

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.3 of http://eliassi.org/mmds-book-v2L.pdf 

o   Chapter 8 of https://www.cs.cornell.edu/jeh/book2016June9.pdf 

o   Section 14.3 of http://statweb.stanford.edu/~tibs/ElemStatLearn 

o   http://cs229.stanford.edu/notes/cs229-notes7b.pdf 

o   http://cs229.stanford.edu/notes/cs229-notes8.pdf 

o   Optional: https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

o   Optional: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf 

o   Optional: http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

14

F 10/23

EM, K-mediods, Hierarchical Clustering, Evaluation Metrics and Practical Issues

o   Same readings as for 10/16 & 10/19

15

T 10/26

Spectral Clustering

o   http://ai.stanford.edu/~ang/papers/nips01-spectral.pdf

o   http://www.cs.columbia.edu/~jebara/4772/papers/Luxburg07_tutorial.pdf

16

T 10/30

Midterm Exam
(in-class)

Grade by Tuesday November 13

17

F 11/2

Project Proposal Pitches

18

T 11/6

Link Analysis

o   Chapter 5 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: http://bit.ly/2iYxo82

19

F 11/9

Recommendation Systems

o   Chapter 9 of http://eliassi.org/mmds-book-v2L.pdf

20

T 11/13

Recommendation Systems

o   Chapter 9 of http://eliassi.org/mmds-book-v2L.pdf

21

F 11/16

 

Matrix Factorization

o   Chapter 14.6 of http://statweb.stanford.edu/~tibs/ElemStatLearn/

o   http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

o   Optional: http://www.sandia.gov/~tgkolda/pubs/pubfiles/TensorReview.pdf

Homework #3

o   out on Friday November 16

o   due on Tuesday November 27 at 11:59 PM Eastern

o   grade by Tuesday December 11

22

T 11/20

Topic Models

o   http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

o   http://www.cs.columbia.edu/~blei/papers/Blei2011.pdf

o   http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf

23

F 11/23

Thanksgiving Break

--- no class ---

 

24

T 11/27

Hidden Markov Models

& Review for Final

o   http://bit.ly/2iYxInD

o   http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf

o   https://www.nature.com/articles/nbt1004-1315

25

F 11/30

Final Exam
(in-class)

o   Grade by Friday December 14

26

T 12/4

Project Presentations: Group 1
(in-class)

27

F 12/7

Project Presentations: Group 2
(in-class)

Last class day

Project reports

o   due on Tuesday December 11 at 11:59 PM

o   grade by Sunday December 16

Final grades are due to the Registrar Office on Monday December 17 at 2:00 PM Eastern.

 

Notes, Policies, and Guidelines

 

o   We will use Northeastern’s Blackboard for announcements, assignments, and your contributions.

o   When emailing me or the TA about the course, begin the subject line with [fa18 cs6220].

o   Programming exercises will be accepted in MATLAB, Python, or R.

o   Homeworks must be done individually. Late homeworks are accepted up to 2 days after the deadline. A penalty of 20% be charged for each late day.

o   The class project can be done either individually or in groups of two.

o   For your class project, you can use whatever programming language that you like.

o   Any regrading request must be submitted in writing and within 48 hours of the returned material. The request must detail precisely and concisely the grading error.

o   Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating!

o   Letter grades will be assigned based on the following grade scale (with numbers rounding up; e.g., 92.4 becomes an A)

A

93-100

A-

90-92

B+

87-89

B

83-86

B-

80-82

C+

77-79

C

73-76

C-

70-72

F

< 70