This week’s assignments will guide you through the following topics:
Note how quality and coordination are measured in Aniket Kittur and Robert E. Kraut. “Harnessing the Wisdom of Crowds in Wikipedia: Quality Through Coordination.” CSCW ‘08. [Link]
Complete the following tasks:
Download some edit data from the Wikimedia Data Archives [Link]. Your data should come from a file with ‘All pages with complete page edit history’. These files are large, so try to find a smaller file.
Pick a few fields and try to run a preliminary EDA on the file.
Note: This task will require you to unzip and parse XML. You can
either unzip using the commandline or python libraries. Parsing XML
in memory can be done using beautifulsoup. You may also find the
tools here
and here
useful.
When doing this task, think about building a library of functions that will process this type of data without manual intervention and with a small memory footprint, as the complete dataset may contain dozens of multi-GB files!
Answer the following questions: