Grants:IEG/Wikidata Toolkit

status: draft

Individual Engagement Grants	review grant submissions	visit IdeaLab submissions
eligibility and selection criteria

project:

Wikidata Toolkit

project contact:

markussemantic-mediawiki.org

participants:

grantees:

Markus Krötzsch is the creator of Semantic MediaWiki and data architect of Wikidata. He is a Departmental Lecturer at the University of Oxford and will be leading a research group at TU Dresden starting Nov 2013.
Research assistant tbd (another person from Markus's research group at TU Dresden)
Student assistant tbd (a secondary goal of the project is to involve students in Wikipedia-related development and research)

summary:

The project will develop a toolkit and web service to query and analyse information exported from Wikidata, providing a feature-rich query API based on a robust and scalable backend.

2013 round 2

Project idea

Problem: The Wikidata project collects large amounts of data, but understanding this data requires technical means for querying and analysis that are not currently available. Even skilled developers have hardly any basis for working with Wikidata.

Solution: A modular toolkit for loading, querying, and analysing Wikidata data will make it easy for developers to use Wikidata in their applications. A web service built on top of this toolkit will offer live query capabilities to a wider range of users. The work will heavily draw from prior experience and existing tools, the goal being to unify and improve existing partial solutions.

Motivation

Wikidata collects large amounts of data across all Wikipedia languages. The data comprises names, dates, coordinates, relationships, URLs, but also references for many statements. In contrast to Wikipedia, where the main way of accessing information is to read single pages, the information in Wikidata is most interesting when viewing facts in a wider context, combining information across many subjects. For example, we can now answer the question how the sex distribution of people with Wikipedia articles varies across languages. For Wikidata editors, complex questions are interesting for yet another reason: they use them to check data quality by looking for patterns that should not normally occur. For instance, the mother of a person should normally be female, which is not always the case now. This and many other interesting insights about Wikidata can be gained by querying the data set for certain patterns, thus revealing the true potential of the project.

Unfortunately, Wikidata does not support any advanced form of query. The basic API provided by the project is limited to retrieving elements by their label (or alias). It is not even possible to find pages that refer to another page, e.g., to find the albums recorded by a certain artist – MediaWiki's what links here is sometimes (ab)used as a workaround in cases where it is enough to know that another page is mentioned somewhere in the data. In all cases mentioned above, custom-made software is used to analyse the data from dumps. This is a time-consuming offline process in each case, which often takes hours to complete. Even worse, the lack of technical support excludes the vast majority of users from analysing Wikidata. Even technically trained users who would be able to formulate, say, an SQL query are discouraged by the immense technological barrier of creating their own query answering system.

The goal of this project is to develop necessary technical components to simplify query answering over up-to-date Wikidata data. The heart of this project is a robust and flexible query backend that provides an API for running a variety of queries. A web service to showcase the functionality will be created and set up to use current (or very recent) data. The main approach for achieving this is to develop a set of modular, re-usable, client-side components for in-memory query answering. While the size of Wikidata is large (and growing quickly), it is certainly in the range of modern main memory sizes, and the added flexibility of a memory-based model is essential to support a wider range of queries. Moreover, components for loading and updating data selectively can help to filter information so that querying is possible even on machines with commodity memory sizes.

Project goals

The project has two technical main outcomes:

(1) Wikidata toolkit. A set of modular components for in-memory processing of information from Wikidata in a programmatic way

(2) Query web service. A web service to run queries against current Wikidata content that is built on top of the toolkit

In addition, the project aims at a soft outcome to ensure sustainability beyond the initial grant:

(3) Community engagement. Active involvement of volunteer developers and interested users

Outcome (1) is the heart of the project. Outcome (2) is a first application that will make (1) more tangible and help evaluating project progress. Outcome (3) aims at increasing the long-term impact of the project. In view of (3), a particular focus of toolkit development will be maintainable code and an extensible architecture.

The general goals that these outcomes should help to achieve are:

Significantly lower barrier for using and analysing Wikidata content
Improved quality control mechanisms for Wikidata editors
Higher utility and visibility of Wikidata content, beyond direct use in Wikimedia projects
Increase in content-driven applications based on Wikidata content

The following are no goals of the project: to develop a new database management software (the project is read-only), to replace future Wikidata query features (they address different needs and requirements), to develop innovative user interfaces for queries/analysis (this might be a follow-up project), to improve MediaWiki API access for programmes (API access and bot frameworks are different types of toolkits; the problems addressed in the present project are not addressed by Wikidata's current web API).

Ready to create the rest of your proposal?
Use the button below just once to create the remaining sections you'll need!

Part 2: The Project Plan

Project plan

Temporary note: the rest of this proposal will be provided soon, but not today. --Markus Krötzsch (talk) 16:52, 27 September 2013 (UTC)

Scope:

Scope and activities

Tools, technologies, and techniques

Budget:

Total amount requested

Budget breakdown

Intended impact:

Target audience

Fit with strategy

Sustainability

Measures of success

Participant(s)

Discussion

Community Notification:

Please paste a link to where the relevant communities have been notified of this proposal, and to any other relevant community discussions, here.

Endorsements:

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.

Community member: add your name and rationale here.