Introduction to Code-Understanding and Malware
Topics
This week’s assignments will guide you through the following topics:
- How graphs can capture/summarize code functionality.
- How machine learning can identify malware via graphs.
Reading
Please read the following:
Tasks
Find your favorite python file and sketch out how you would train an
n-gram model on it (or actually try it!). Things you should consider
are:
- how would you tokenize it? (what do you consider a word?)
- what would you consider a sentence or paragraph?
- What syntactic structure would you likely pick up and what would
it miss? (What would co-occur with a function definition, for
example?).
If you actually tried this, train it on python projects from github
repositories. Use your language model to generate code and see how it
fails to run!
Note: Recently, the success of transformers in language modeling
has been capable of understanding computer code surprisingly well, by
essentially throwing massive amounts of computation power at the
problem.
Weekly Questions
Answer the following questions on Canvas:
- How is learning computer code different than typical approachs to
NLP (e.g. n-gram models)? What would an NLP model fail to learn?
Give an example.
- What are the different ways Malware might attack a mobile device
through an application?
- What are the primary ways of analyzing source code for evidence of Malware?