Practice and Applications of Data Science (DSC 80) is a core sophomore level course I developed for the data science curriculum. It’s an intermediate touchpoint for students in the Data Science program:
- it applies material students learned in both the introductory computational sequence (algorithms and data structures) and the introductory math/stat sequence (probability and statistics) to realistic problems a Data Scientist might encounter.
- it sets the stage for the overall structure of Data Science projects, so students understand the broader context of how more advance material (e.g. machine learning algorithms) fits into a typical investigation.
A working version of the course notes are found here.
The assignments and projects in this course drive student learning; these are found on the course repository.
In this course, students will:
- Leave equipped to “bring their own dataset” to apply materials they are learning in upper-division classes and outside projects.
- Understand how to clean (reasonable) real world data and understand the statistical implications of those decisions.
- Translate ambiguous, plain english requirements into code that attempts to solve that problem, with a critical eye toward understanding the strengths and shortcomings of their solution.
- Have a “big picture” understanding of why data science projects are structured the way they are, so they have adequate context for how more advanced modeling techniques are used in a data project.
- Get a feel for the daily work of a data scientist and the context of that work in the greater world. Particular attention is paid to the interplay between data-driven decisions and their relationship to the fairness of those actions.
The course covers the following topics in roughly the given proportions. Topics are intertwined.
- Reading and Manipulating Data (35%)
- Exploratory Data Analysis (35%)
- Applied Prediction (15%)
- Applied Inference (15%)
|Week 1||Data Science Life Cycle; Data Generating Processes|
|Tabular Data Formats; Numpy/Pandas|
|Week 2||Data Cleaning; Faithful Representations of DGP|
|Hypothesis testing via Simulation|
|Week 3||Data Granularity and Grouping|
|Combining Data; Time Series|
|Week 4||Permutation Tests; Choices of Statistics|
|Identifying Mechanisms of Missing Data|
|Week 5||Null Imputation and Statistical Implications|
|Week 6||Collecting Data; HTTP|
|Parsing and translating data; The HTML DOM|
|Week 7||Patterns in Unstructured Text|
|Text Features and Intro to NLP|
|Week 8||Generic Features and Intro to Modeling|
|Modeling Pipelines; Intro to scikit (building a model)|
|Week 9||Bias/Variance trade-off (building a good model)|
|Evaluation and Fairness; black-box reasoning|
|Week 10||Inference about Evaluation/Fairness (was my model good by chance?)|
Alongside lecture, discussion section pays careful attention to helping students get comfortable using production libraries and tools, warts and all. These include:
- Practice pulling and updating code with Git.
- Experience updating and maintaining specific versions of Python libraries.
- Experience working with common Data Science Python libraries:
- Pandas, Numpy
- Matplotlib, seaborn, folium
- requests, beautifulsoup
- Scipy, Scikit-Learn, Statsmodel
Alongside homework assignments that reinforce course materials, students work on 5 biweekly projects based on real-world data. These project can raise ethical issues, or ambiguous results that often come from complex investigations. Examples of these projects include:
- Project 1: Computing students grades in a theoretical DSC 80 course. Students practice using Pandas and learn the course syllabus very well. The observations in the data are relatable to students, and they understand importance of getting the correct answer at the end of a complicated series of computational steps. Considering possibilities for curving the course serve to approach issues of fairness.
- Project 2: An investigation into Flight Delay data. Students clean a large dataset of flight delays, trying to understand a picture of flights to and from San Diego Airport. In particular students investigate mechanisms of missing data in delays, accumulating evidence of censorship on the part of certain airlines.
- Project 3: Choose a dataset for a free exploratory data analysis. This project allows students to choose one of a 3 possible datasets to clean and understand the dataset, while performing an investigation of their own creation. These datasets change from quarter-to-quarter, so students can use the class-work to seed their own project work after the class.
- Project 4: Creating N-gram models from Project Gutenberg. Students create an N-gram model class from scratch using Pandas. Students solidify their understanding of conditional probability in code, while creating a class that can generate text in the style of any training text.
- Project 5: Choose a dataset for a prediction project. Students use the dataset from project 3 to pose and code a prediction problem. Once trained, students analyze the performance in terms of a quantitative notion of fairness (an aprropriately chosen parity measure).