Logistics
- Semester: Spring 2015
- Course number: 01:198:443
- Course title: Introduction to Data Science
- Credits: 3
- Lecture: Mondays & Thursdays 12:00pm-1:20pm
- Location: Busch Campus, SEC-205
- Course website: here and in Sakai
- Instructor: Tina Eliassi-Rad
- Office: CBIM 8
- Office hours: Mondays 1:30pm-2:30pm
- Teaching assistant: Chetan Tonde
- Office: CBIM (cubicle near printer room)
- Office hours: Thursdays 3:00pm-5:00pm
Description
Advances in technology have allowed us to collect massive amounts of data. A data scientist is a person who has the skills, knowledge, and ability to extract actionable knowledge from the data -- either for the good of society, advancement of science, profits in business, etc. This course will cover the topics needed to solve data-science problems, which include data preparation (collection & integration), data characterization & presentation, data analysis (experimentation & observational studies), and data products.
Syllabus / Schedule
Textbooks
This course does not have a designated textbook. The readings are assigned in the syllabus.
Here are some textbooks (all optional) related to the course.
- Anand Rajaraman, Jurij Leskovec, and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. (free online)
- Foster Provost, Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323.
- Tom Mitchell. Machine Learning. ISBN 0070428077.
- Christopher Bishop. Pattern Recognition and Machine Learning. ISBN 0387310738.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020.
- Peter Flach. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. ISBN 1107422221.
- Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning. ISBN 0387952845. (free online)
- David J. Hand, Heikki Mannila, Padhraic Smyth. Principles of Data Mining. ISBN 026208290X.
- Jiawei Han, Micheline Kamber, Jian Pei. Data Mining: Concepts and Techniques, Third Edition. ISBN 0123814790.
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. ISBN 0321321367.
- Ian H. Witten, Eibe Frank, Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. ISBN 0123748569.
Prerequisites
The class requires an ability to deal with abstract mathematical concepts such as the ones covered in 01:198:112, 01:198:205, and 01:198:206. You need an introductory-level background in algorithms, probability, and linear algebra. You also need to know programming to perform data manipulation and analysis (e.g., one of Python, Matlab, R, etc) and Web programming (e.g., one of HTML, CSS, Javascript, etc). The specific programming language is mostly your choice.
Grading Policies
- Class project (40%), where you solve a data-science problem from data preparation to data product
- Proposal report (10%) -- 2 pages maximum plus 4-minute in-class pitch -- due on Thu 3/12.
Should include answers to the following questions:
- What is the problem?
- Why is it interesting and important?
- Why is it hard? Why have previous approaches failed?
- What are the key components of your approach?
- What data sets and metrics will be used to validate the approach?
- Class presentation (12%) -- 6-minute presentations -- due on Mon 4/27 (Group 1) and Mon 5/4 (Group 2). Groups will be determined later.
- Final report (18%) -- 6 pages maximum -- due on Mon 5/11.
- Three homework assignments (36% total; 12% per HW)
- HW#1 out on Mon 2/9; due on Mon 2/23; graded by Mon 3/9.
- HW#2 out on Mon 3/9; due on Mon 3/23; graded by Mon 4/6.
- HW#3 out on Mon 3/30; due on Mon 4/13; graded by Mon 4/27.
- In-class exam (24%) on Thu 4/16; graded by Mon 5/4
Notes, Policies, and Guidelines
-
We will use the class Sakai site for announcements, assignments, and your contributions.
- Homeworks must be done individually. Late homeworks are accepted up to 4 days after the deadline. A penalty of 20% will be charged for each late day.
- The class project can be done either individually or in groups of two.
- Any regrading request must be submitted in writing and within one week of the returned material. The request must detail precisely and concisely the grading error.
- When emailing me or the TA about the course, begin the subject line with [sp15 cs443].
- Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating.
- Letter grades will be assigned based on Rutgers Undergraduate Grade Scale, which is as follows:
A in [90, 100] |
B+ in [85, 89.99] |
B in [80, 84.99] |
C+ in [75, 79.99] |
C in [70, 74.99] |
D in [60, 69.99] |
F in [0, 59.99] |
Resources & Recent Stories
- UC Berkeley's Data Science Resources
- Some software, tools, and data resources
- Claire Cain Miller, Data Science: The Numbers of Our Lives, New York Times, April 11, 2013.
- Steve Lor. Sure, Big Data Is Great. But So Is Intuition, New York Times, December 29, 2012.
- Thomas H. Davenport and D.J. Patil. Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review, October 2012.
- Data Science: An Introduction (Wikibook)
- Shamanth Kumar, Fred Morstatter, Huan Liu. Twitter Data Analytics, Springer 2013.