Practical Data Science

Practice and Applications of Data Science (DSC 80) is a core sophomore level course I developed for the data science curriculum. It’s an intermediate touchpoint for students in the Data Science program:

  • it applies material students learned in both the introductory computational sequence (algorithms and data structures) and the introductory math/stat sequence (probability and statistics) to realistic problems a Data Scientist might encounter.
  • it sets the stage for the overall structure of Data Science projects, so students understand the broader context of how more advance material (e.g. machine learning algorithms) fits into a typical investigation.

A working version of the course notes are found here.

The assignments and projects in this course drive student learning; these are found on the course repository.

Learning Outcomes

In this course, students will:

  • Leave equipped to “bring their own dataset” to apply materials they are learning in upper-division classes and outside projects.
  • Understand how to clean (reasonable) real world data and understand the statistical implications of those decisions.
  • Translate ambiguous, plain english requirements into code that attempts to solve that problem, with a critical eye toward understanding the strengths and shortcomings of their solution.
  • Have a “big picture” understanding of why data science projects are structured the way they are, so they have adequate context for how more advanced modeling techniques are used in a data project.
  • Get a feel for the daily work of a data scientist and the context of that work in the greater world. Particular attention is paid to the interplay between data-driven decisions and their relationship to the fairness of those actions.

Course Outline

The course covers the following topics in roughly the given proportions. Topics are intertwined.

  • Reading and Manipulating Data (35%)
  • Exploratory Data Analysis (35%)
  • Applied Prediction (15%)
  • Applied Inference (15%)

Course Schedule

Week NumberTopic
Week 1Data Science Life Cycle; Data Generating Processes
 Tabular Data Formats; Numpy/Pandas
Week 2Data Cleaning; Faithful Representations of DGP
 Hypothesis testing via Simulation
Week 3Data Granularity and Grouping
 Combining Data; Time Series
Week 4Permutation Tests; Choices of Statistics
 Identifying Mechanisms of Missing Data
Week 5Null Imputation and Statistical Implications
Week 6Collecting Data; HTTP
 Parsing and translating data; The HTML DOM
Week 7Patterns in Unstructured Text
 Text Features and Intro to NLP
Week 8Generic Features and Intro to Modeling
 Modeling Pipelines; Intro to scikit (building a model)
Week 9Bias/Variance trade-off (building a good model)
 Evaluation and Fairness; black-box reasoning
Week 10Inference about Evaluation/Fairness (was my model good by chance?)

Discussion Section

Alongside lecture, discussion section pays careful attention to helping students get comfortable using production libraries and tools, warts and all. These include:

  • Practice pulling and updating code with Git.
  • Experience updating and maintaining specific versions of Python libraries.
  • Experience working with common Data Science Python libraries:
    • Pandas, Numpy
    • Matplotlib, seaborn, folium
    • requests, beautifulsoup
    • Scipy, Scikit-Learn, Statsmodel

Course assignments

Alongside homework assignments that reinforce course materials, students work on 5 biweekly projects based on real-world data. These project can raise ethical issues, or ambiguous results that often come from complex investigations. Examples of these projects include:

  • Project 1: Computing students grades in a theoretical DSC 80 course. Students practice using Pandas and learn the course syllabus very well. The observations in the data are relatable to students, and they understand importance of getting the correct answer at the end of a complicated series of computational steps. Considering possibilities for curving the course serve to approach issues of fairness.
  • Project 2: An investigation into Flight Delay data. Students clean a large dataset of flight delays, trying to understand a picture of flights to and from San Diego Airport. In particular students investigate mechanisms of missing data in delays, accumulating evidence of censorship on the part of certain airlines.
  • Project 3: Choose a dataset for a free exploratory data analysis. This project allows students to choose one of a 3 possible datasets to clean and understand the dataset, while performing an investigation of their own creation. These datasets change from quarter-to-quarter, so students can use the class-work to seed their own project work after the class.
  • Project 4: Creating N-gram models from Project Gutenberg. Students create an N-gram model class from scratch using Pandas. Students solidify their understanding of conditional probability in code, while creating a class that can generate text in the style of any training text.
  • Project 5: Choose a dataset for a prediction project. Students use the dataset from project 3 to pose and code a prediction problem. Once trained, students analyze the performance in terms of a quantitative notion of fairness (an aprropriately chosen parity measure).