Spring 2019: CS 6220 -- Data Mining Techniques, CRN 33181 cross-listed with

Spring 2019: DS 5230 -- Unsupervised Machine Learning and Data Mining, CRN 34643

 

General Information

 

Lecture time: Tuesdays & Fridays
1:35 – 3:15 PM

Place: International Village, Room 019

Instructor: Tina Eliassi-Rad

Office hours: Tuesdays 3:30 – 5:00 PM in International Village, Room 016
Also, available by appointment. Email eliassi [at] ccs [dot] neu [dot] edu to setup appointment; begin the subject line with
[sp19 dm].

TA: Deeksha Doddahonnaiah

Office hours: Thursdays 3:30 – 5:00 PM in West Village F, Room 118

Also, available by appointment. Email doddahonnaiah.d [at] husky [dot] neu [dot] edu; begin the subject line with [sp19 dm].

TA: Hui “Sophie” Wang

Office hours: Wednesdays 10:00 – 11:30 AM in Hastings Hall at the YMCA, Room 105

Also, available by appointment. Email wang.hui1 [at] husky [dot] neu [dot] edu; begin the subject line with [sp19 dm].

 

Overview

 

This 4-credit graduate-level course covers data mining and unsupervised learning. Its prerequisites are:

o   Graduate level CS 5800, CS 7800, or EECE 7205 with a minimum grade of C-.

o   Calculus and linear algebra.

o   An introductory course on statistics and probability.

o   Algorithms and programming (MATLAB, Python, or R).

 

Textbooks

 

This course does not have a designated textbook. The readings are assigned in the syllabus (see below). Here are some textbooks (all optional) related to the course.

 

Resources

 

 

Grading

 

o   Homework assignments (30% = 5 * 6%)

o   Midterm exam (20%)

o   Final exam (20%)

o   Class project (30%)

o   2-page proposal & 5-minute in-class pitch (8%)

·      Should include answers to the following questions:

-   What is the problem?

-   Why is it interesting and important?

-   Why is it hard? Why have previous approaches failed?

-   What are the key components of your approach?

-   What data sets and metrics will be used to validate the approach?

o   In-class presentation (10%)

o   Final report (12%) – maximum 6 pages

·      For guidance on writing the final report, see slide 70 of Eamonn Keogh's KDD'09 Tutorial on How to do good research, get it published in SIGKDD and get it cited!

·      Follow ACM formatting guidelines.

 

Schedule/Syllabus (Subject to Change)

 

Lec #

Date

Topic

Readings & Notes

1

T 1/8

Introduction and Overview

o   Chapter 1 of http://eliassi.org/mmds-book-v2L.pdf

o   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

o   http://eliassi.org/ml-science-2015.pdf

o   https://www.pnas.org/content/pnas/114/33/8689.full.pdf

2

F 1/11

Review of Linear Algebra
& Probability

o   Linear Algebra Tutorial (C.T. Abdallah, Penn)

o   Linear Algebra Review and Reference (Zico Kolter and Chuong Do, Stanford)

o   Probability Review (David Blei, Princeton)

o   Probability Theory Review (Arian Maleki and Tom Do, Stanford)

3

T 1/15

Density Estimation

o   http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver_contents.html

o   http://eliassi.org/Sheather_StatSci_2004.pdf

o   Optional: Sections 6.6-6.9 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

4

F 1/18

Frequent Itemsets & Association Rules

o   Chapter 6 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

5

T 1/22

Frequent Itemsets & Association Rules

o   Chapter 6 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Homework #1

o   out on Tuesday January 22

o   due on Friday February 1 at 11:59 PM Eastern

o   graded by Friday February 15

6

F 1/25

Finding Similar Items

o   Chapter 3 of http://eliassi.org/mmds-book-v2L.pdf

7

T 1/29

Finding Similar Items

o   Chapter 3 of http://eliassi.org/mmds-book-v2L.pdf

8

F 2/1

Mining Data Streams

o   Chapter 4 of http://eliassi.org/mmds-book-v2L.pdf

9

T 2/5

Mining Data Streams

o   Chapter 4 of http://eliassi.org/mmds-book-v2L.pdf

Homework #2

o   out on Tuesday February 5

o   due on Friday February 15 at 11:59 PM Eastern

o   graded by Friday March 1

10

11

F 2/8

T 2/12

Dimensionality Reduction (PCA, SVD, CUR,
Kernel PCA)

o   Chapter 11 of http://eliassi.org/mmds-book-v2L.pdf

o   Section 14.5 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

o   https://alex.smola.org/papers/1999/MikSchSmoMuletal99.pdf

12

13

 

F 2/15

T 2/19

Clustering:
K-means, Gaussian Mixture Models, Expectation Maximization (EM)

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.3 of http://eliassi.org/mmds-book-v2L.pdf 

o   Chapter 8 of https://www.cs.cornell.edu/jeh/book2016June9.pdf 

o   Section 14.3 of http://statweb.stanford.edu/~tibs/ElemStatLearn 

o   http://cs229.stanford.edu/notes/cs229-notes7b.pdf 

o   http://cs229.stanford.edu/notes/cs229-notes8.pdf 

o   Optional: https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

o   Optional: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf 

o   Optional: http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

Homework #3

o   out on Tuesday February 19

o   due on Friday March 1 at 11:59 PM Eastern

o   graded by Friday March 15

14

F 2/22

T 2/26

EM, K-mediods, Hierarchical Clustering, Evaluation Metrics and Practical Issues

o   Same readings as for 10/16 & 10/19

15

F 3/1

Midterm Exam
(in-class)

Graded by Tuesday March 19

--

T 3/5

F 3/8

Spring Break

 

16

T 3/12

Project Proposal Pitches

(in-class)

Graded by Tuesday March 19

17

F 3/15

Spectral Clustering

o   http://ai.stanford.edu/~ang/papers/nips01-spectral.pdf

o   http://www.cs.columbia.edu/~jebara/4772/papers/Luxburg07_tutorial.pdf

18

T 3/19

Link Analysis

o   Chapter 5 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: http://bit.ly/2iYxo82

Homework #4

o   out on Tuesday March 19

o   due on Friday March 29 at 11:59 PM Eastern

o   graded by Friday April 12

19

F 3/22

Recommendation Systems

o   Chapter 9 of http://eliassi.org/mmds-book-v2L.pdf

20

T 3/26

Recommendation Systems

o   Chapter 9 of http://eliassi.org/mmds-book-v2L.pdf

21

F 3/29

Matrix Factorization

o   Chapter 14.6 of http://statweb.stanford.edu/~tibs/ElemStatLearn/

o   http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

o   Optional: http://www.sandia.gov/~tgkolda/pubs/pubfiles/TensorReview.pdf

Homework #5

o   out on Friday March 29

o   due on Tuesday April 9 at 11:59 PM Eastern

o   graded by Tuesday April 16

22

23

T 4/2

F 4/5

Topic Models

o   http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

o   http://www.cs.columbia.edu/~blei/papers/Blei2011.pdf

o   http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf

24

T 4/9

Hidden Markov Models

& Review for Final

o   http://bit.ly/2iYxInD

o   http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf

o   https://www.nature.com/articles/nbt1004-1315

25

F 4/12

Final Exam
(in-class)

Graded by Friday April 26

26

T 4/16

Project Presentations: Group 1
(in-class)

27

F 4/19

Project Presentations: Group 2
(in-class)

Last class day

Project reports

o   due on Tuesday April 23 at 11:59 PM

o   graded by Sunday April 28

Final grades are due to the Registrar Office on Monday April 29 at 9:00 AM Eastern.

 

Notes, Policies, and Guidelines

 

o   We will use Northeastern’s Blackboard for announcements, assignments, and your contributions.

o   When emailing me or the TA about the course, begin the subject line with [sp19 dm].

o   Programming exercises will be accepted in MATLAB, Python, or R.

o   Homeworks must be done individually. Late homeworks are accepted up to 2 days after the deadline. A penalty of 20% be charged for each late day.

o   The class project can be done either individually or in groups of two.

o   For your class project, you can use whatever programming language that you like.

o   Any regrading request must be submitted in writing and within 48 hours of the returned material. The request must detail precisely and concisely the grading error.

o   Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating!

o   Letter grades will be assigned based on the following grade scale (with numbers rounding up; e.g., 92.4 becomes an A)

A

93-100

A-

90-92

B+

87-89

B

83-86

B-

80-82

C+

77-79

C

73-76

C-

70-72

F

< 70