Wikimedia Releases Wikipedia Search Data Starting From Today

Posted on Sep 20 2012 - 4:39am by Editorial Staff

Wikimedia foundation, the parent organization behind Wikipedia has announced that the availability of anonymous search log files for Wikipedia and its sister projects, starting from today, means now you can copy, modify, distribute, and perform work on the data.

The collection of data about search queries as it provides valuable feedback to the foundation editor community, who can use it to detect topics of interest that are currently insufficiently covered while lead to improve search index by benchmarking improvements against real queries.

Each line in the log files is tab separated and contains the following 10 fields:

  • Server hostname.
  • Timestamp (UTC).
  • Wikimedia project.
  • URL encoded search query.
  • Total number of results.
  • Lucene score of best match.
  • Interwiki result.
  • Namespace (coded as integer).
  • Namespace (human-readable).
  • Title of best matching article.

The log files contain queries for all Wikimedia projects and all languages and are unsampled and anonymous. They are collected from both the search box on a wiki page after the visitor submits the query and from queries submitted from Special:Search pages. The search log data does not contain queries from the autocomplete search functionality, as this generates too much data.

About the Author
Editorial Staff

Editorial Staff at I2Mag is a team of subject experts led by Karan Chopra.