Creating Graphs from Code; What is Malware?

Topics

This week’s assignments will guide you through the following topics:

  • Think about what Malware is, and how you identify it in source code.
  • Build a data processing pipeline to extract features from smali.

Reading

Please read the following:

  • Read the HinDroid paper through section 3.1 (feature extraction). Read this thoroughly!
  • For more information on the structure of APKs and Android Bytecode (e.g. Smali), read section II of this paper.

Tasks

Note: These tasks will be difficult, and potential take awhile to compute, so start early!

Complete the following tasks:

  • Write parsing code for the test files in /teams/DSC180A_FA20_A00/a04malware/test-apps on the server. Your code should extract all the information needed to create the matrices A, B, P, I in the Hindroid paper and put that information in a csv file. My suggestion would be to have each observation in the csv correspond to an api call found in one of the classes/files. How exactly you organize it is up to you, though!
  • Calculate the distribution/counts of APIs, packages, package-families (L<family>/), and invoke-types in each app. How many distinct shared apis do each pair of apps have? distinct packages?
  • Build the adjacency matrix A for the given test apps.

Weekly Questions

Answer the following questions on Canvas:

  • How many distinct APIs occur in the Instagram App?
  • What are the dimensions of A for the test apps?
  • For each adjacency matrix A, B, P, interpret in plain english how two elements connected by an edge should be considered similar. How will these similarities help us identify malware?