This week’s assignments will guide you through the following topics:
Note how quality and coordination are measured in Aniket Kittur and Robert E. Kraut. “Harnessing the Wisdom of Crowds in Wikipedia: Quality Through Coordination.” CSCW ‘08. [Link]
Complete the following tasks:
Download some edit data from the Wikimedia Data Archives [Link]. Your data should come from a file with ‘All pages with complete page edit history’. These files are large, so try to find a smaller file.
Pick a few fields and try to run a preliminary EDA on the file.
Note: This task will require you to unzip and parse XML. You can
either unzip using the commandline or python libraries. Parsing XML
in memory can be done using beautifulsoup
. You may also find the
tools here
and here
useful.
When doing this task, think about building a library of functions that will process this type of data without manual intervention and with a small memory footprint, as the complete dataset may contain dozens of multi-GB files!
Answer the following questions: