Understanding Code Through Graphs

Data science capstone domain of inquiry (DSC 180AB A04)

Developed by Aaron Fraenkel, Shivam Lakhotia.



Introduction

This domain of inquiry studies the problem of understanding computer code through machine learning on graphs. To approach this problem, we will initially focus on the problem of malware detection in Android applications.

In this domain, project proposals will be restricted to the following areas:

  • Computer Code Understanding (using a variety of techniques)
  • Data Driven Approaches to Malware and Cyber-Security

This domain centers around understanding computer code as data. You will spend a significant amount of work writing parsing logic for this code that measures the behaviors of interest (i.e. data processing).

Result replication (introduction to topic)

The bulk of the first half of the project will focus the task of “computer code understanding” on the specific problem of detection whether given source code is Malware or a benign application. The bulk of the result replication will focus on static code analysis using machine learning on graphs, as developed in the following papers:

The latter-half of Quarter 1 will introduce you to further topics, like graph embedding techniques and adversarial ML, to inform possible avenues for further research.

An initial note on ethics

The ability to identify and combat Malware requires understanding how Malware works. Such knowledge is divorced from how it gets used by the practitioner; you, as the practitioner must be aware of ethical concerns with the usage of this knowledge and behave accordingly. In particular, while working with the material in this course, you must never engage in illegal activity related to your coursework and you must adhere to the code of academic integrity, as laid out in th syllabus.

The topic of ethics will be regularly approached in this domain. For more reading on the ethics of teaching Malware, see this opinion pience On the Growing Harm of Not Teaching Malware and a related study on the ethics of teaching malware.


Section Participation

Participation in the weekly discussion section is mandatory. Each week, you are responsible for doing the reading/task assigned in the schedule. Come to section prepared to ask questions about and discuss the results of these tasks.

Each week, turn in answers to the weekly questions to Canvas. These questions are meant to focus your work for the week and help prepare you for discussion. If you have questions about your work, please ask them in section or office hours (I will rarely comment on your submission).

You are responsible for the entire weekly reading/task, even if portions are not covered in the weekly questions. The weekly tasks are the building blocks for the project proposals/assignments due at the end of the quarter.


Schedule

Week Topic
1 Introduction to Code-Understanding and Malware
2 Data: Code Parsing, Malware
3 Creating Graphs from Code; What is Malware?
4 Graph Invariants as Measurements
5 Building a Baseline Model
6 Evalulating the HinDroid Result
7 Graph Embedding I: node2vec
8 Graph Embedding II: metapath2vec
9 Production and Adversarial ML
10 Present Proposals

Office Hours

Fridays 9-11AM