Analytics Datasets: MediaWiki History

Contents

This data set contains a historical record of revision, user and page events of Wikimedia wikis since 2001. The data is denormalized, meaning that all events for user, page and revision are stored in the same schema. This leads to some fields being always null for some events (for instance fields about page are null in events about user). Events about users and pages have been processed to rebuild an as coherent as possible history in term of user-renames and page-moves (see the page and user history reconstruction wikitech page). Also, some precomputed fields have been added to facilitate analyses, such as edit-count per user and per page, reverting and reverted revisions and more. For further details visit the MediaWiki history dumps dataset wikitech page, containing the schema and links to some code examples.

Updates

The updates for this data set are monthly, around the end of the month's first week. Each update contains a full dump since 2001 (the beginning of MediaWiki-time) up to the current month. The reason for this particularity is the underlying data, the MediaWiki databases. Every time a user gets renamed, a revision reverted, a page moved, etc. the existing related records in the logging table are updated accordingly. So an event triggered today may change the state of that table 10 years ago. And it turns out the logging table is the base of the MediaWiki history reconstruction process. Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs).

Versioning

Each update receives the name of the last featured month, in YYYY-MM format. For example if the dump spans from 2001 to August 2019 (included), it will be named 2019-08 even if it will be released on the first days of September 2019. There is a folder for each available month at the root of the download URL, and for storage reasons only the last two versions are available. This shouldn't be problematic as every version contains the whole historical dataset.

Partitioning

The data is organized by wiki and time range. This way it can be downloaded for a single wiki (or set of wikis). The time split is necessary because of file size reasons. There are 3 different time range splits: monthly, yearly and all-time. Very big wikis are partitioned monthly, while medium wikis are partitioned yearly, and small wikis are dumped in one single file. This way we ensure that files are not larger than ~2GB, and at the same time we prevent generating a very large number of files.

File format

The file format is tab-separated-value (TSV) instead of JSON in order to reduce the file sizes (JSON repeats field names for every record). Even if MediaWiki history data is pretty flat, it has some fields that are arrays of strings. The encoding of such arrays is the following: array(<value1>,<value2>,...,<valueN>). The compression algorithm is Bzip2, for it being widely used, free software, and having a high compression rate. Note that with Bzip2, you can concatenate several compressed files and treat them as a single Bzip2 file.

Directory structure

When choosing a file (or set of files) to download, the URL should look like this:
/<version>/<wiki>/<version>.<wiki>.<time range>.tsv.bz2
Where version is the YYYY-MM formated snapshot i.e. 2019-12; <wiki> is the wiki database name, i.e. enwiki or commonswiki; and <time_range> is either YYYY-MM for big wikis, YYYY for medium wikis, or all-time for the rest (see partitionning above). Examples of dump files:

Download MediaWiki History Data

If you're interested in how this data set is generated, have a look at the following articles:

Back to all Analytics Datasets


All Analytics datasets are available under the Creative Commons CC0 dedication.