Logistics
- Semester: Fall 2013
- Course number: 01:198:444
- Course title: Introduction to Data Science
- Credits: 3
- Lecture: Mondays & Thursdays 12:00pm-1:20pm
- Location: Busch Campus, SEC-220
- Course website: here and in Sakai
- Instructor: Tina Eliassi-Rad
- Office: CBIM 8
- Office hours: Mondays 1:30pm-2:30pm
- Teaching assistant: Vukosi Marivate
- Office: Hill 486
- Office hours: Thursdays 3:00pm-4:00pm
Description
Advances in technology have allowed us to collect massive amounts of data. A data scientist is a person who has the skills, knowledge, and ability to extract actionable knowledge from the data -- either for the good of society, advancement of science, profits in business, etc. This course will cover the topics needed to solve data-science problems, which include data preparation (collection & integration), data characterization & presentation, data analysis (experimentation & observational studies), and data products.
Syllabus / Schedule
Available here.
Textbooks
This course does not have a designated textbook. The readings are assigned in the syllabus.
Here are some textbooks (all optional) related to the course.
- Anand Rajaraman, Jurij Leskovec, and Jeffrey Ullman. Mining of Massive Datasets. Cambridge University Press. 2012. (free online)
- Foster Provost, Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323.
- Tom Mitchell. Machine Learning. ISBN 0070428077.
- Christopher Bishop. Pattern Recognition and Machine Learning. ISBN 0387310738.
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020.
- Peter Flach. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. ISBN 1107422221.
- Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning. ISBN 0387952845. (free online)
- David J. Hand, Heikki Mannila, Padhraic Smyth. Principles of Data Mining. ISBN 026208290X.
- Jiawei Han, Micheline Kamber, Jian Pei. Data Mining: Concepts and Techniques, Third Edition. ISBN 0123814790.
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. ISBN 0321321367.
- Ian H. Witten, Eibe Frank, Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques, Third Edition. ISBN 0123748569.
Prerequisites
The class requires an ability to deal with abstract mathematical concepts such as the ones covered in 01:198:112, 01:198:205, and 01:198:206. You need an introductory-level background in algorithms, probability, and linear algebra. You also need to know programming to perform data manipulation and analysis (e.g., one of Python, Matlab, R, etc) and Web programming (e.g., one of HTML, CSS, Javascript, etc). The specific programming language is mostly your choice.
Grading Policies
- Class project (45%), where you solve a data-science problem from data preparation to data product
- Proposal report (10%) -- 2 pages maximum plus 5-minute in-class pitch -- due on Thu 10/24.
Should include answers to the following questions:
- What is the problem?
- Why is it interesting and important?
- Why is it hard? Why have previous approaches failed?
- What are the key components of your approach?
- What data sets and metrics will be used to validate the approach?
- Class presentation (15%) -- 8-minute presentation -- due on Thu 12/5 (Group 1) and Mon 12/9 (Group 2). Groups will be determined later.
- Final report (20%) -- 6 pages maximum -- due on Thu 12/19.
- Three homework assignments (45% total; 15% per HW)
- HW#1 out on Mon 9/23; due on Mon 10/7; graded by Mon 10/21.
- HW#2 out on Mon 10/21; due on Mon 11/4; graded by Mon 11/18.
- HW#3 out on Mon 11/11; due on Mon 11/25; graded by Mon 12/9.
- Class participation (10%)
Notes, Policies, and Guidelines
-
We will use the class Sakai site for announcements, assignments, and your contributions.
- Homeworks must be done individually. Late homeworks are accepted up to 4 days after the deadline. A penalty of 20% will be charged for each late day.
- The class project can be done either individually or in groups of two.
- Any regrading request must be submitted in writing and within one week of the returned material. The request must detail precisely and concisely the grading error.
- Refresh your knowledge of the university's academic integrity policy and plagiarism. There is zero-tolerance for cheating.
Resources & Recent Stories
- UC Berkeley's Data Science Resources
- Some software, tools, and data resources
- Claire Cain Miller, Data Science: The Numbers of Our Lives, New York Times, April 11, 2013.
- Steve Lor. Sure, Big Data Is Great. But So Is Intuition, New York Times, December 29, 2012.
- Thomas H. Davenport and D.J. Patil. Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review, October 2012.
- Data Science: An Introduction (Wikibook)
- Shamanth Kumar, Fred Morstatter, Huan Liu. Twitter Data Analytics, Springer 2013.