Topics
This week’s assignments will guide you through the following topics:
- Computational concerns of creating graphs from Smali
- Using graph statistics as features (and using these for EDA).
Reading
Please read the following:
- (Re)Read HinDroid
sections 3.1 - 3.2 (Features & HIN construction).
- From the
/teams/DSC180A_FA20/a04malware
directory, run the command
ls -l malware/ | shuf | head -3
. This will pull 3 Malware types at
random. Google them and try to find out what they do.
Bonus:
- Read MaMaDroid (Sections I,
IIABC). This paper uses something closer to an n-gram approach to
source code in the same setting.
Tasks
Complete the following tasks:
- Continue work on the ETL pipeline (work on the larger dataset)
- Complete an EDA on (a sample of) the app dataset in the
/teams
directory. You should be aiming to answer questions:
- how do the apps differ for Malware vs Not? Different types of
Malware? Random vs. Popular apps?
- which classes in an app ‘add no new information’? How might you
extract data from a subset of the smali files in an app?
- From the EDA, you’ve likely computed a table where the observations
are apps and the columns are statistics for that app. Use this to
train/test a baseline classifier (e.g. using Logistic
Regression). This will serve as a basis for later comparison.
Weekly Questions
Answer the following questions on Canvas:
- Give a summary of a Malware type in the training data that you
researched. What does it do? How does it work?
- Give an interesting observation from your EDA. Be specific, with
precise values!
- What features did you use in your baseline model? What was the model performance?
- What difficulties did you encounter that you would like to discuss
this week?