Spring 2017: CS6220 Data Mining Techniques, Section 01

 

General Information

 

Lecture time: Mondays & Wednesdays
2:50 Ð 4:30 PM

Place: Science Engineering Complex, Room 136
(Building #83 on this campus map)

Instructor: Tina Eliassi-Rad

Office hours: Mondays 5:00 Ð 6:00 PM in WVH 242
Also, available by appointment. Email eliassi [at] ccs [dot] neu [dot] edu to setup appointment; begin the subject line with [sp17 cs6220].

TA: Xuan Han

TA office hours: Wednesdays and Thursdays 6:00 Ð 7:00 PM in WVH 242
Also, available by appointment. Email han.xua [at] husky [dot] neu [dot] edu to setup appointment; begin the subject line with [sp17 cs6220].

 

Overview

 

This 4-credit graduate-level course covers data mining and unsupervised learning. Its prerequisites are:

 

o   Graduate level CS 5800 Minimum Grade of C- or Graduate level CS 7800 Minimum Grade of C-.

o   Calculus and linear algebra.

o   An introductory course on statistics and probability.

o   Algorithms and programming (MATLAB, Python, or R).

 

Textbooks

 

This course does not have a designated textbook. The readings are assigned in the syllabus (see below). Here are some textbooks (all optional) related to the course.

 

 

Resources

 

 

Grading

 

o   Homework assignments (3*10%)

o   Midterm exam (20%)

o   Final exam (20%)

o   Class project (30%)

o   2-page proposal & 5-minute in-class pitch (8%)

á      Should include answers to the following questions:

¤  What is the problem?

¤  Why is it interesting and important?

¤  Why is it hard? Why have previous approaches failed?

¤  What are the key components of your approach?

¤  What data sets and metrics will be used to validate the approach?

o   In-class presentation (10%)

o   Final report (12%) Ð maximum 6 pages

á      For guidance on writing the final report, see slide 70 of Eamonn Keogh's KDD'09 Tutorial on How to do good research, get it published in SIGKDD and get it cited!

á      Follow ACM formatting guidelines.

 

Schedule/Syllabus (Subject to Change)

 

Lec #

Date

Topic

Readings & Notes

1

M 1/9

Introduction and Overview

o   Chapter 1 of http://www.mmds.org/#book

o   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

o   http://eliassi.org/ml-science-2015.pdf

2

W 1/11

Frequent Itemsets & Association Rules

o   Chapter 6 of http://www.mmds.org/#book

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Ð

M 1/16

MLK Jr Day

Ð no class Ð

 

3

W 1/18

Frequent Itemsets & Association Rules

o   Chapter 6 of http://www.mmds.org/#book

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

4

M 1/23

Density Estimation

o   http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver_contents.html

o   http://www.stat.washington.edu/courses/stat527/s13/readings/Sheather_StatSci_2004.pdf

o   Optional: Sections 6.6-6.9 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

Homework #1

o   out on Monday January 23

o   due on Sunday February 5 at 11:59 PM Eastern

o   grade by Monday February 20

5

W 1/25

Finding Similar Items

o   Chapter 3 of http://www.mmds.org/#book

6

7

M 1/30

W 2/1

Mining Data Streams

o   Chapter 4 of http://www.mmds.org/#book

8

9

M 2/6

W 2/8

Dimensionality Reduction (PCA, SVD, CUR, Kernel PCA)

o   Chapter 11 of http://www.mmds.org/#book

o   Section 14.5 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

o   https://alex.smola.org/papers/1999/MikSchSmoMuletal99.pdf

Homework #2

o   out on Wednesday February 8

o   due on Tuesday February 21 at 11:59 PM Eastern

o   grade by Monday March 13

10

11

 

M 2/13

W 2/15

 

Clustering:
K-means, Gaussian Mixture Models, Expectation Maximization (EM)

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.3 of http://www.mmds.org/#book 

o   Chapter 8 of https://www.cs.cornell.edu/jeh/book2016June9.pdf 

o   Section 14.3 of http://statweb.stanford.edu/~tibs/ElemStatLearn 

o   http://cs229.stanford.edu/notes/cs229-notes7b.pdf 

o   http://cs229.stanford.edu/notes/cs229-notes8.pdf 

o   Optional: https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

o   Optional: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf 

o   Optional: http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

Ð

M 2/20

PresidentÕs Day
Ð no class Ð

 

12

13

W 2/22

M 2/27

EM, K-mediods, Hierarchical Clustering, Evaluation Metrics and Practical Issues

o   Same readings as for 2/13 and 2/15

14

W 3/1

Midterm Exam
(in-class)

Grade by Monday March 13

Ð

M 3/06

W 3/08

Spring Break

Ð no class Ð

 

15

M 3/13

Project Proposal Pitches + Review of Midterm

16

W 3/15

Spectral Clustering

o   http://ai.stanford.edu/~ang/papers/nips01-spectral.pdf

o   http://www.cs.columbia.edu/~jebara/4772/papers/Luxburg07_tutorial.pdf

17

M 3/20

Link Analysis

o   Chapter 5 of http://www.mmds.org/#book

o   Optional: http://bit.ly/2iYxo82

18

W 3/22

Recommendation Systems

o   Chapter 9 of http://www.mmds.org/#book

19

M 3/27

 

Matrix Factorization

o   Chapter 14.6 of http://statweb.stanford.edu/~tibs/ElemStatLearn/

o   http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

o   Optional: http://www.sandia.gov/~tgkolda/pubs/pubfiles/TensorReview.pdf

Homework #3

o   out on Monday March 27

o   due on Sunday April 9 at 11:59 PM Eastern

o   grade by Monday April 24

20

21

W 3/29

M 4/3

 

Topic Models

o   http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf

o   http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf

o   http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf

22

W 4/5

Hidden Markov Models

o   http://bit.ly/2iYxInD

o   http://digital.cs.usu.edu/%7Ecyan/CS7960/hmm-tutorial.pdf

23

M 4/10

Model Selection
(BIC, MDL)

Theory of Clustering

o   Sections 7.7-7.8 of http://statweb.stanford.edu/~tibs/ElemStatLearn/

o   Section 8.12 of https://www.cs.cornell.edu/jeh/book2016June9.pdf

o   http://www.cs.cornell.edu/home/kleinber/nips15.pdf

o   http://bit.ly/2j278xP

24

W 4/12

Final Exam
(in-class)

Grade by Sunday April 30

Ð

M 4/17

PatriotÕs Day

 Ð no class Ð

 

25

W 4/19

Project Presentations
(in-class)

Last class day

Project reports

o   due on Wednesday April 26 at 11:59 PM

o   grade by Sunday April 30

Final grades deadline on Monday May 1 at 9:00 AM

 

Notes, Policies, and Guidelines

 

o   We will use NortheasternÕs Blackboard for announcements, assignments, and your contributions.

o   When emailing me or the TA about the course, begin the subject line with [sp17 cs6220].

o   Programming exercises will be accepted in MATLAB, Python, or R.

o   Homeworks must be done individually. Late homeworks are accepted up to 2 days after the deadline. A penalty of 20% be charged for each late day.

o   The class project can be done either individually or in groups of two.

o   For your class project, you can use whatever programming language that you like.

o   Any regrading request must be submitted in writing and within one week of the returned material. The request must detail precisely and concisely the grading error.

o   Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating!

o   Letter grades will be assigned based on the following grade scale (with numbers rounding up; e.g., 92.4 becomes an A)

 

A

93Ð100

A-

90Ð92

B+

87Ð89

B

83Ð86

B-

80Ð82

C+

77Ð79

C

73Ð76

C-

70Ð72

F

< 70