A black sign with white letters

Description automatically generated

Spring 2022: DS 5230 -- Unsupervised Machine Learning and Data Mining, CRN 32874

 

General Information

 

Lecture time: Mondays & Wednesdays
2:50 – 4:30 pm

Place: West Village G, Room 106

Instructor: Tina Eliassi-Rad

Office hours: Tuesdays 4:30 – 6:00 PM via Zoom
Also, available by appointment. Email eliassi [at] ccs [dot] neu [dot] edu to setup appointment; begin the subject line with
[sp22 dm].

TA: Priya Garg

Office hours: Available by appointment. Email garg.p [at] northeastern [dot] edu;
begin the subject line with
[sp22 dm].

TA: Hani Haider

Office hours: Available by appointment. Email haider.sy [at] northeastern [dot] edu;
begin the subject line with
[sp22 dm].

TA: Oj Sindher

Office hours: Available by appointment: Email sindher.o [at] northeastern [dot] edu;
begin the subject line with
[sp22 dm].

 

Overview

 

This 4-credit graduate-level course covers data mining and unsupervised learning. Students are expected to have taken courses on or have knowledge of the following:

o   Calculus and linear algebra

o   Basic statistics and probability

o   Algorithms, data structures, and programming skills (e.g., Python, Julia, C, Java, Matlab, or any modern programming language).

 

Textbooks

 

There is no specific textbook for this course. Readings are assigned in the syllabus (see below). Here are some textbooks (all optional) for this course. Those that are freely available online are listed first.

o   Charu C. Aggarwal, Data Mining, The Textbook, Springer 2015. (free online; visit this site and log in via your institutional account)

 

Resources

 

 

Grading

 

o   Homework assignments (20% = 4 * 5%)

o   Homework assignments will be graded on the pass/fail system. If your grade on an assignment is above 70, you will “pass” that assignment. Otherwise, you will “fail” it (i.e., get a zero for it).

o   Midterm exam (22%)

o   Final exam (28%)

o   Class project (30%)

o   2-page proposal (9%)

·      Should include answers to the following questions:

-   What is the problem?

-   Why is it interesting and important?

-   Why is it hard? Why have previous approaches failed?

-   What are the key components of your approach?

-   What data sets and metrics will be used to validate the approach?

o   Poster (9%) – See template here. An example is available here.

o   Final report (12%) – maximum 6 pages

·      For guidance on writing the final report, see slide 70 of Eamonn Keogh's KDD'09 Tutorial on How to do good research, get it published in SIGKDD and get it cited!

·      Follow ACM formatting guidelines.

 

Schedule/Syllabus (Subject to Change)

 

Lec #

Date

Topic

Readings & Notes

1

W 1/19

Introduction and Overview

o   http://infolab.stanford.edu/~ullman/mmds/ch1n.pdf

o   http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

o   http://eliassi.org/ml-science-2015.pdf

o   https://www.pnas.org/content/pnas/114/33/8689.full.pdf

2

M 1/24

Density Estimation

o   http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver_contents.html

o   http://eliassi.org/Sheather_StatSci_2004.pdf

o   Optional: Sections 6.6.1 of https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf

o   Optional: https://dl.acm.org/doi/pdf/10.1145/3422622

3

W 1/26

No Class (Professor Away)

4

M 1/31

Frequent Itemsets & Association Rules

o   http://infolab.stanford.edu/~ullman/mmds/ch6.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

5

W 2/2

Frequent Itemsets & Association Rules

o   http://infolab.stanford.edu/~ullman/mmds/ch6.pdf

o   Optional: Sections 6.1-6.6 of http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Homework #1 covers density estimation and frequent itemsets & association rules.

o   out on Wednesday February 2

o   due on Sunday February 13 at 11:59 PM Eastern

o   graded by Wednesday February 23

6

M 2/7

Finding Similar Items

o   http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf

7

W 2/9

Finding Similar Items

o   http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf

8

M 2/14

Mining Data Streams

o   http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

9

W 2/16

Mining Data Streams

o   http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

10

M 2/21

No Class (Presidents Day)

11

W 2/23

Mining Data Streams

o   http://infolab.stanford.edu/~ullman/mmds/ch4.pdf

Homework #2 covers finding similar items and mining data streams.

o   out on Wednesday February 23

o   due on Sunday March 6 at 11:59 PM Eastern

o   graded by Wednesday March 16

12

M 2/28

Dimensionality Reduction (SVD, CUR)

o   http://infolab.stanford.edu/~ullman/mmds/ch11.pdf

o   Chapter 3 of http://www.eliassi.org/FODS-book-2019.pdf

13

W 3/2

Dimensionality Reduction (PCA, Kernel PCA, MDS, ISOMAP)

o   http://www.eliassi.org/ang/cs229-notes10-pca.pdf

o   Section 14.5 of https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf

o   https://alex.smola.org/papers/1999/MikSchSmoMuletal99.pdf

o   https://en.wikipedia.org/wiki/Multidimensional_scaling

o   http://www.eliassi.org/tenenbaum-isomap-Science2000.pdf (supplementary material)

14

M 3/7

Dimensionality Reduction
(t-SNE and UMAP)

o   t-SNE paper: http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

o   t-SNE website: https://lvdmaaten.github.io/tsne/

o   UMAP paper: https://arxiv.org/abs/1802.03426

o   UMAP website: https://umap-learn.readthedocs.io/en/latest/

o   A nice presentation on UMAP: https://www.youtube.com/watch?v=nq6iPZVUxZU

o   Optional: https://www.jmlr.org/papers/volume22/20-1061/20-1061.pdf

Homework #3 covers dimensionality reduction.

o   out on Monday March 7

o   due on Thursday March 17 at 11:59 PM Eastern

o   graded by Sunday March 27

15

W 3/9

Dimensionality Reduction

(autoencoders)

o   https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

o   https://www.deeplearningbook.org/contents/autoencoders.html

o   A nice presentation on autoencoders: https://www.youtube.com/watch?v=R3DNKE3zKFk

o   Optional: https://www.jeremyjordan.me/autoencoders/

Project proposals

o   due on Thursday March 10 at 11:59 PM Eastern; there are no late days for this assignment.

o   graded by Sunday March 20

16

M 3/14

No Class (Spring Break)

17

W 3/16

No Class (Spring Break)

18

M 3/21

Non-negative Matrix Factorization

o   http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

o   Chapter 14.6 of https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf

o   Optional: http://eliassi.org/papers/henderson-kdd2012.pdf 

19

W 3/23

Midterm Exam (in-class)

Graded by Monday April 11

20

M 3/28

Clustering: K-means, Gaussian Mixture Models, Expectation Maximization (EM)

o   http://www.eliassi.org/ang/cs229-notes7a-kmeans.pdf

o   http://www.eliassi.org/ang/cs229-notes7b-mixture-of-guassians.pdf

o   http://www.eliassi.org/ang/cs229-notes8-em.pdf

o   Sections 7.1-7.3 of http://infolab.stanford.edu/~ullman/mmds/ch7.pdf

21

W 3/30

Clustering: EM, K-medoids, Hierarchical Clustering, Evaluation Metrics and Practical Issues

o   Chapter 9 of http://robotics.stanford.edu/~nilsson/MLBOOK.pdf

o   Sections 7.1-7.4 of http://www.eliassi.org/FODS-book-2019.pdf

o   Section 14.3 of https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf

o   Optional: http://www.eliassi.org/jain99data-clustering-review.pdf

o   Optional: http://www.eliassi.org/validity_survey.pdf

o   Optional: http://www.eliassi.org/dbscan.pdf

o   Optional: http://www.eliassi.org/ang/cs229-notes9-factor-analysis.pdf

22

M 4/4

Spectral Clustering

o   http://ai.stanford.edu/~ang/papers/nips01-spectral.pdf

o   http://www.cs.columbia.edu/~jebara/4772/papers/Luxburg07_tutorial.pdf

o   Optional: Section 7.5 of http://www.eliassi.org/FODS-book-2019.pdf

23

W 4/6

Recommendation Systems

o   http://infolab.stanford.edu/~ullman/mmds/ch9.pdf 

24

M 4/11

Midterms returned and
solution set discussed

o   Any regrading request for the midterm must be made at the end of the lecture.

o   Midterms will be collected at the end of the lecture.

25

W 4/13

Recommendation Systems

o   http://infolab.stanford.edu/~ullman/mmds/ch9.pdf 

Homework #4 covers clustering, matrix factorization, and recommendation systems.

o   out on Wednesday April 13

o   due on Friday April 22 at 11:59 PM Eastern

o   graded by Sunday May 1

26

M 4/18

No Class (Patriots Day)

27

W 4/20

Recommendation Systems

o   http://infolab.stanford.edu/~ullman/mmds/ch9.pdf 

28

M 4/25

Link Analysis

o   http://infolab.stanford.edu/~ullman/mmds/ch5.pdf

o   Optional: http://bit.ly/2iYxo82

29

W 4/27

Link Analysis

o   http://infolab.stanford.edu/~ullman/mmds/ch5.pdf

o   Optional: http://bit.ly/2iYxo82

30

M 5/2

Final Exam (in-class)

Graded by Friday May 6th

31

W 5/4

No Class

 

Project posters and reports

o   due on Wednesday May 4th at 11:59 PM Eastern; there are no late days for this assignment.

o   graded by Sunday May 8th

Final grades are due to the Registrar Office on Monday May 9th at 9:00 AM Eastern.

 

Notes, Policies, and Guidelines

 

o   We will use Northeastern’s Canvas for announcements, assignments, and your contributions.

o   I do not use Piazza because (1) it is easily used for cheating and (2) I do not wish to police/moderate its content.

o   When emailing me or the TAs about the course, begin the subject line with [sp22 dm].

o   Programming exercises will be accepted in MATLAB, Python, or R.

o   Homeworks must be done individually. Late homeworks are accepted up to 1 day after the deadline. A penalty of 20% will be charged for the late day.

o   The class project can be done either individually or in groups of two.

o   For your class project, you can use whatever programming language that you like.

o   Any regrading request must be submitted in writing and within 48 hours of the returned material. The request must detail precisely and concisely the grading error.

o   Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating!

o   Letter grades will be assigned based on the following grade scale (with numbers rounding up; e.g., 92.4 becomes an A)

A

93-100

A-

90-92

B+

87-89

B

83-86

B-

80-82

C+

77-79

C

73-76

C-

70-72

F

< 70