A black sign with white letters

Description automatically generated

Fall 2019: CS 6220 -- Data Mining Techniques, CRN 12564 cross-listed with

Fall 2019: DS 5230 -- Unsupervised Machine Learning and Data Mining, CRN 15043

 

General Information

 

Lecture time: Tuesdays 11:45 – 1:25 pm & Thursdays 2:50 – 4:30 pm

Place: West Village G, Room 102

Instructor: Tina Eliassi-Rad

Office hours: Tuesdays 1:30 – 3:00 PM in Kariotis Hall, Room 304
Also, available by appointment. Email eliassi [at] ccs [dot] neu [dot] edu to setup appointment; begin the subject line with
[fa19 dm].

TA: Govind Bhala

Office hours: Wednesdays 4:50 PM – 6:20 PM in West Village F, Room 116

Also, available by appointment. Email bhala.g [at] husky [dot] neu [dot] edu;
begin the subject line with
[fa19 dm].

TA: Hui “Sophie” Wang

Office hours: Thursdays 10:00 AM – 11:30 AM in Behrakis Health Sciences Center, Room 210

Also, available by appointment. Email wang.hui1 [at] husky [dot] neu [dot] edu;
begin the subject line with
[fa19 dm].

 

Overview

 

This 4-credit graduate-level course covers data mining and unsupervised learning. Its prerequisites are:

o   Graduate level CS 5800, CS 7800, or EECE 7205 with a minimum grade of C-.

o   Calculus and linear algebra.

o   An introductory course on statistics and probability.

o   Algorithms and programming (MATLAB, Python, or R).

 

Textbooks

 

This course does not have a designated textbook. The readings are assigned in the syllabus (see below). Here are some textbooks (all optional) related to the course.

 

Resources

 

 

Grading

 

o   Homework assignments (24% = 4 * 6%)

o   Midterm exam (20%)

o   Final exam (26%)

o   Class project (30%)

o   2-page proposal & 5-minute in-class pitch (9%)

·      Should include answers to the following questions:

-   What is the problem?

-   Why is it interesting and important?

-   Why is it hard? Why have previous approaches failed?

-   What are the key components of your approach?

-   What data sets and metrics will be used to validate the approach?

o   Poster (9%) – See template here.  An example is available here.

o   Final report (12%) – maximum 6 pages

·      For guidance on writing the final report, see slide 70 of Eamonn Keogh's KDD'09 Tutorial on How to do good research, get it published in SIGKDD and get it cited!

·      Follow ACM formatting guidelines.

 

Schedule/Syllabus (Subject to Change)

 

Lec #

Date

Topic

Readings & Notes

1

R 9/5

Introduction and Overview

o   Chapter 1 of http://eliassi.org/mmds-book-v2L.pdf

o   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

o   http://eliassi.org/ml-science-2015.pdf

o   https://www.pnas.org/content/pnas/114/33/8689.full.pdf

2

T 9/10

Density Estimation

o   http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver_contents.html

o   http://eliassi.org/Sheather_StatSci_2004.pdf

o   Optional: Sections 6.6-6.9 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

3

R 9/12

Frequent Itemsets & Association Rules

o   Chapter 6 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

4

T 9/17

Frequent Itemsets & Association Rules

o   Chapter 6 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

5

R 9/19

Social Bots

(Guest lecturer:

Dr. Onur Varol)

o   The Rise of Social Bots: https://arxiv.org/abs/1407.5225

o   Online Human-Bot Interactions: Detection, Estimation, and Characterization: https://arxiv.org/abs/1703.03107

o   Arming The Public with Artificial Intelligence to Counter Social Bots: https://arxiv.org/abs/1901.00912

o   Deception Strategies and Threats for Online Discussions: https://arxiv.org/abs/1906.11371

6

T 9/24

Finding Similar Items

o   Chapter 3 of http://eliassi.org/mmds-book-v2L.pdf

7

R 9/26

Finding Similar Items

o   Chapter 3 of http://eliassi.org/mmds-book-v2L.pdf

Homework #1 covers density estimation, frequent itemsets & association rules, plus finding similar items.

o   out on Thursday September 26

o   due on Sunday October 6 at 11:59 PM Eastern

o   graded by Wednesday October 16

8

T 10/1

Mining Data Streams

o   Chapter 4 of http://eliassi.org/mmds-book-v2L.pdf

9

R 10/3

Mining Data Streams

o   Chapter 4 of http://eliassi.org/mmds-book-v2L.pdf

10

T 10/8

Mining Data Streams

o   Chapter 4 of http://eliassi.org/mmds-book-v2L.pdf

11

R 10/10

Dimensionality Reduction (PCA, SVD, CUR,
Kernel PCA, MDS, ISOMAP)

o   Chapter 11 of http://eliassi.org/mmds-book-v2L.pdf

o   Section 14.5 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

o   https://en.wikipedia.org/wiki/Multidimensional_scaling

o   https://alex.smola.org/papers/1999/MikSchSmoMuletal99.pdf

12

T 10/15

Dimensionality Reduction (PCA, SVD, CUR,
Kernel PCA, MDS, ISOMAP)

o   Chapter 11 of http://eliassi.org/mmds-book-v2L.pdf

o   Section 14.5 of http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf

o   https://alex.smola.org/papers/1999/MikSchSmoMuletal99.pdf

o   https://en.wikipedia.org/wiki/Multidimensional_scaling

o   https://web.mit.edu/cocosci/isomap/isomap.html

Homework #2 covers mining data streams and dimensionality reduction.

o   out on Tuesday October 15

o   due on Friday October 25 at 11:59 PM Eastern

o   graded by Monday November 4

13

R 10/17

Dimensionality Reduction
(t-SNE and UMAP)
(Guest lecturer:
Mr. Leo Torres)

o   t-SNE paper: http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

o   t-SNE website: https://lvdmaaten.github.io/tsne/

o   UMAP paper: https://arxiv.org/abs/1802.03426

o   UMAP website: https://umap-learn.readthedocs.io/en/latest/

o   A nice presentation on UMAP: https://www.youtube.com/watch?v=nq6iPZVUxZU

14

T 10/22

Project Proposal Pitches

(in-class)

o   Proposals are due at 9:00 AM Eastern on Tuesday October 22; there are no late days for this assignment.

o   Graded by Tuesday October 29

15

R 10/24

Clustering: K-means, Gaussian Mixture Models, Expectation Maximization (EM)

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.3 of http://eliassi.org/mmds-book-v2L.pdf 

o   Chapter 8 of https://www.cs.cornell.edu/jeh/book2016June9.pdf 

o   Section 14.3 of http://statweb.stanford.edu/~tibs/ElemStatLearn 

o   http://cs229.stanford.edu/notes/cs229-notes7b.pdf 

o   http://cs229.stanford.edu/notes/cs229-notes8.pdf 

o   Optional: https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

o   Optional: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf 

o   Optional: http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

16

T 10/29

Midterm Exam (in-class)

Graded by Tuesday November 12

17

R 10/31

Clustering: K-means, Gaussian Mixture Models, Expectation Maximization (EM)

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.3 of http://eliassi.org/mmds-book-v2L.pdf 

o   Chapter 8 of https://www.cs.cornell.edu/jeh/book2016June9.pdf 

o   Section 14.3 of http://statweb.stanford.edu/~tibs/ElemStatLearn 

o   http://cs229.stanford.edu/notes/cs229-notes7b.pdf 

o   http://cs229.stanford.edu/notes/cs229-notes8.pdf 

o   Optional: https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

o   Optional: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf 

o   Optional: http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

18

T 11/5

Clustering: EM, K-medoids, Hierarchical Clustering, Evaluation Metrics and Practical Issues

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.3 of http://eliassi.org/mmds-book-v2L.pdf 

o   Chapter 8 of https://www.cs.cornell.edu/jeh/book2016June9.pdf 

o   Section 14.3 of http://statweb.stanford.edu/~tibs/ElemStatLearn 

o   http://cs229.stanford.edu/notes/cs229-notes7b.pdf 

o   http://cs229.stanford.edu/notes/cs229-notes8.pdf 

o   Optional: https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf

o   Optional: http://web.itu.edu.tr/sgunduz/courses/verimaden/paper/validity_survey.pdf 

o   Optional: http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

Homework #3 covers clustering: k-means, Gaussian mixture models, expectation maximization, k-mediods, hierarchical clustering, and evaluation metrics.

o   out on Tuesday November 5

o   due on Friday November 15 at 11:59 PM Eastern

o   graded by Monday November 25

19

R 11/7

Spectral Clustering

o   http://ai.stanford.edu/~ang/papers/nips01-spectral.pdf

o   http://www.cs.columbia.edu/~jebara/4772/papers/Luxburg07_tutorial.pdf

20

T 11/12

Matrix Factorization

o   Chapter 14.6 of http://statweb.stanford.edu/~tibs/ElemStatLearn/

o   http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

o   Optional: http://www.sandia.gov/~tgkolda/pubs/pubfiles/TensorReview.pdf

21

R 11/14

Recommendation Systems (Guest lecturer:

Dr. Onur Varol)

o   Chapter 9 of http://eliassi.org/mmds-book-v2L.pdf

22

T 11/19

Recommendation Systems

o   Chapter 9 of http://eliassi.org/mmds-book-v2L.pdf

23

R 11/21

Link Analysis

o   Chapter 5 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: http://bit.ly/2iYxo82

Homework #4 covers spectral clustering, matrix factorization, recommendation systems, and link analysis.

o   out on Thursday November 21

o   due on Sunday December 1 at 11:59 PM Eastern

o   graded by Tuesday December 10

24

T 11/26

Link Analysis and
Review for Final

o   Chapter 5 of http://eliassi.org/mmds-book-v2L.pdf

o   Optional: http://bit.ly/2iYxo82

o   The final covers all the material since the beginning of the term.

25

R 11/28

Thanksgiving Break

 

26

T 12/3

Final Exam (in-class)

Graded by Friday December 13

Project posters

o   due on Wednesday December 4 at 11:59 PM Eastern; there are no late days for this assignment.

o   graded by Friday December 7

27

R 12/5

Data Science and Ethics

o   https://www.ted.com/talks/damon_horowitz?language=en

o   https://ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

o   https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

o   https://hbr.org/2016/12/a-guide-to-solving-social-problems-with-machine-learning

Project reports

o   due on Tuesday December 10 at 11:59 PM Eastern; there are no late days for this assignment.

o   graded by Sunday December 15

Final grades are due to the Registrar Office on Monday December 16 at 2:00 PM Eastern.

 

 

Notes, Policies, and Guidelines

 

o   We will use Northeastern’s Blackboard for announcements, assignments, and your contributions.

o   When emailing me or the TAs about the course, begin the subject line with [fa19 dm].

o   Programming exercises will be accepted in MATLAB, Python, or R.

o   Homeworks must be done individually. Late homeworks are accepted up to 2 days after the deadline. A penalty of 20% will be charged for each late day.

o   The class project can be done either individually or in groups of two.

o   For your class project, you can use whatever programming language that you like.

o   Any regrading request must be submitted in writing and within 48 hours of the returned material. The request must detail precisely and concisely the grading error.

o   Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating!

o   Letter grades will be assigned based on the following grade scale (with numbers rounding up; e.g., 92.4 becomes an A)

A

93-100

A-

90-92

B+

87-89

B

83-86

B-

80-82

C+

77-79

C

73-76

C-

70-72

F

< 70