# Practical Data Science

*Practice and Applications of Data Science* (DSC 80) is a core sophomore level course I developed for the data science curriculum. It’s an intermediate touchpoint for students in the Data Science program:

- it applies material students learned in both the introductory computational sequence (algorithms and data structures) and the introductory math/stat sequence (probability and statistics) to realistic problems a Data Scientist might encounter.
- it sets the stage for the overall structure of Data Science projects, so students understand the broader context of how more advance material (e.g. machine learning algorithms) fits into a typical investigation.

A working version of the *course notes* are found here.

The assignments and projects in this course drive student learning; these are found on the course repository.

## Learning Outcomes

In this course, students will:

- Leave equipped to “bring their own dataset” to apply materials they are learning in upper-division classes and outside projects.
- Understand how to clean (reasonable) real world data and understand the statistical implications of those decisions.
- Translate ambiguous, plain english requirements into code that attempts to solve that problem, with a critical eye toward understanding the strengths and shortcomings of their solution.
- Have a “big picture” understanding of why data science projects are structured the way they are, so they have adequate context for how more advanced modeling techniques are used in a data project.
- Get a feel for the daily work of a data scientist and the context of that work in the greater world. Particular attention is paid to the interplay between data-driven decisions and their relationship to the fairness of those actions.

## Course Outline

The course covers the following topics in roughly the given proportions. Topics are intertwined.

- Reading and Manipulating Data (35%)
- Exploratory Data Analysis (35%)
- Applied Prediction (15%)
- Applied Inference (15%)

## Course Schedule

Week Number | Topic |
---|---|

Week 1 | Data Science Life Cycle; Data Generating Processes |

Tabular Data Formats; Numpy/Pandas | |

Week 2 | Data Cleaning; Faithful Representations of DGP |

Hypothesis testing via Simulation | |

Week 3 | Data Granularity and Grouping |

Combining Data; Time Series | |

Week 4 | Permutation Tests; Choices of Statistics |

Identifying Mechanisms of Missing Data | |

Week 5 | Null Imputation and Statistical Implications |

Midterm | |

Week 6 | Collecting Data; HTTP |

Parsing and translating data; The HTML DOM | |

Week 7 | Patterns in Unstructured Text |

Text Features and Intro to NLP | |

Week 8 | Generic Features and Intro to Modeling |

Modeling Pipelines; Intro to scikit (building a model) | |

Week 9 | Bias/Variance trade-off (building a good model) |

Evaluation and Fairness; black-box reasoning | |

Week 10 | Inference about Evaluation/Fairness (was my model good by chance?) |

Review |

## Discussion Section

Alongside lecture, discussion section pays careful attention to helping students get comfortable using production libraries and tools, warts and all. These include:

- Practice pulling and updating code with Git.
- Experience updating and maintaining specific versions of Python libraries.
- Experience working with common Data Science Python libraries:
- Pandas, Numpy
- Matplotlib, seaborn, folium
- requests, beautifulsoup
- Scipy, Scikit-Learn, Statsmodel

## Course assignments

Alongside homework assignments that reinforce course materials, students work on 5 biweekly projects based on real-world data. These project can raise ethical issues, or ambiguous results that often come from complex investigations. Examples of these projects include:

- Project 1:
*Computing students grades in a theoretical DSC 80 course*. Students practice using Pandas and learn the course syllabus very well. The observations in the data are relatable to students, and they understand importance of getting the*correct*answer at the end of a complicated series of computational steps. Considering possibilities for curving the course serve to approach issues of fairness. - Project 2:
*An investigation into Flight Delay data*. Students clean a large dataset of flight delays, trying to understand a picture of flights to and from San Diego Airport. In particular students investigate mechanisms of missing data in delays, accumulating evidence of censorship on the part of certain airlines. - Project 3:
*Choose a dataset for a free exploratory data analysis*. This project allows students to choose one of a 3 possible datasets to clean and understand the dataset, while performing an investigation of their own creation. These datasets change from quarter-to-quarter, so students can use the class-work to seed their own project work after the class. - Project 4:
*Creating N-gram models from Project Gutenberg*. Students create an N-gram model class from scratch using Pandas. Students solidify their understanding of conditional probability in code, while creating a class that can generate text in the style of any training text. - Project 5:
*Choose a dataset for a prediction project*. Students use the dataset from project 3 to pose and code a prediction problem. Once trained, students analyze the performance in terms of a quantitative notion of fairness (an aprropriately chosen parity measure).