This blog post was created for the lecture “Wikidata: Curating Data about the World with 17000 Volunteers” (in German: Wikidata: Curating Data about the World with 17000 Volunteers), the 6th lecture of the lecture series Open Technology For An Open Society. It was held by Lydia Pintscher, Product Manager at Wikidata.

Many people use Wikipedia on a daily basis. Be it to look up something, to get information or to find further sources on a topic. But what is Wikidata, how can Wikidata improve Wikipedia and who benefits from it?

What is Wikidata?

As the name suggests, the Wikidata project is closely linked to Wikipedia. It is a project of the Wikimedia Foundation, the non-profit organization that takes care of structural tasks such as administration, software development and donation coordination for Wikipedia and other projects. Its aim is to make multilingual free knowledge accessible to as many people as possible and to enable them to participate in knowledge formation [1]. Most people know her from her annual fundraising campaign banner on Wikipedia.

Wikidata logo

Wikidata is a database project and is intended to be a “Wikipedia for data” [2]. As with Wikipedia, anyone who wants to can take part. The project lives and grows thanks to the more than 18,000 [3] volunteers.
Wikidata was founded primarily to support Wikipedia in its goal of providing access to knowledge. To do this, Wikidata takes data from Wikipedia such as years, coordinates and place names and saves them in a structured and language-independent way and therefore also machine-readable, more on that later.

The Wikidata project was founded in 2012 by Denny Vrandečić and Markus Krötsch [4], and it was financed by donations from, among others, Google. It is mainly looked after by a small team at Wikimedia Deutschland e.V., the German chapter (name of the national organizations) of the Wikimedia Foundation based in Berlin.
However, 200 people from all over the world came to the first WikidataCon, a conference of the Wikidata community, which took place a month ago in Berlin, and also celebrated the fifth anniversary of the project. Today, like Wikipedia, the funding comes from donations from the Wikimedia Foundation.

The data are published under license CC0 (Public Domain), so they can be used in the public domain without restrictions or conditions, including commercially.

Wikidata and Wikipedia

Example of an infobox generated by Wikidata, excerpt from a screenshot of the Wikipedia article South Pole Telescope [12]

Wikidata supports the individual language versions of Wikipedia through a central database that can be accessed by all Wikipedia media and networks them.
Wikidata's first contribution to Wikipedia was the automation of the links between Wikipedia articles on the same topic in different languages, which are located on the left margin of every Wikipedia article [5].

There are Wikipedia in almost 300 languages ​​[6]. Each language version has its own community and its own rules, such as its own relevance criteria. There are large Wikipedia with large, active communities and many articles that are kept up-to-date, such as the German-language Wikipedia with 2,126,864 articles, and smaller ones, such as the Albanian-language Wikipedia with 70,593 articles [6].
This results in great quality differences in content and scope as well as contradicting information between different language Wikipedia. This could be, for example, different population figures for a city in each language version.
The current data can be integrated language-independent through automatically generated info boxes [7], which are fed from Wikidata. If an article is not yet available in a language, there is the Article Placeholder [8], which instead displays all the data on this topic available in Wikidata, thus providing information and encouraging people to write an article. Both functions are intended to support especially small Wikipedia.

Automatically generated lists are also on the way [9] [10]; an example would be a list of all space probes that have been launched into space, which is kept up to date by automatic regular queries. With the Query Service [11], such queries can already be sent to Wikidata and displayed graphically, for example as a bar chart, timeline or, if geographic coordinates are available, points on a map.

The world cannot be pigeonholed in small boxes - a special data structure

Extracting data from a body of text is not an easy task for a machine. It becomes even more difficult when she is supposed to interpret a text. This is only possible if the data and their relationships follow a structure that the machine can read.
The special thing about Wikidata's data is that it is semantically searchable. They allow searches for the meaning and context of the search query, in contrast to a pure search for keywords [13]. The relationships between individual data can be represented in the form of subject, predicate, object, for example [Douglas Adams | is or was a | Human] [14].
Conversely, since people usually do not speak a database query language, but also semantically more or less clauses their search queries, this format is particularly interesting for artificial intelligences. These should be able to interact with people, such as Google search, Siri (Apple), Alexa (Amazon) or Watson (IBM).

The subjects are called items and each roughly corresponds to a Wikipedia article. In order to be able to use it independently of the language, each item is assigned a unique number. The names of the items, which usually differ from language to language, are called labels. The label for item Q515 is “Stadt” in German and “City” in English [15].
An item also has properties such as “is a” (person) or “trained at” (university XY, elementary school YZ ..).

In order to be able to reproduce the complexity of the world, the restrictions on how which data can be linked are only very loose. A person can be married to a building [4]. Even contradicting statements are shown and a property can be assigned several values ​​and alternative spellings of these values ​​[16] [17], because Wikidata is not supposed to pursue truth-finding or decide territorial conflicts, but to map data and their sources and bring them into context.

Other projects around Wikidata are the linking of the Wikidata structure with Wikimedia Commons and Wictionary and Wikicyte.
Wikimedia Commons is the media archive in which all images, videos and audio files are stored that are used in the individual Wikipedia. The same problems arise with this data as with the Wikipedia articles: Since the metadata is unstructured and available in many different languages, it is hardly machine-readable and searchable. With the structure of Wikidata, the language-independent unique identifiers from Wikidata can be assigned to the files. Then, for example, all images with cats in the foreground and trees in the background can be searched for. [18]
It's no different with Wictionary, Wikimedia's dictionary project. Linked to the Wictionary, Wikidata could contribute to automatic translations, among other things. [19]
Wikicyte would like to support references in Wikipedia and beyond by storing as much metadata as possible about bibliographic sources such as books or scientific publications in Wikidata. For example: who are the authors, which sources were cited, to which institution do the authors belong, etc. [20]

An ecosystem of databases

Wikidata already plays a role outside of the Wikimedia universe.
Museums, archives, libraries and other database projects link their databases with Wikidata in order to complete their content and make it more usable [21]. Wikidata does not see itself as an instance into which all databases should be transferred, but as a node in an ecosystem. So instead of completely importing a database that specializes in metadata about music into Wikidata, Wikidata prefers to link to this database. The databases can use the unique identifiers from Wikidata to mark their contents.

Many services use Wikidata. Some openly, others not so official, such as the artificial intelligence Siri from Apple. Siri claimed on October 4th that the national anthem of Bulgaria was Despacito [22], the song with the currently most successful YouTube video of all time. The national anthem of Bulgaria has often been the target of vandalism in Wikidata [23]. [24]

Among other things, Google uses Wikidata for the Google Knowledge Graph, Google's semantic search function. The result of this search is fed into info boxes that refer to related Wikipedia articles and other resources [25].
Google used to use its own publicly accessible database Freebase for this purpose, but this was discontinued in favor of Wikidata, as Freebase was also based on data from Wikipedia and followed a similar approach, but Wikidata can better serve this purpose.
Google actively helped migrate the Freebase data to Wikidata. A tool was developed for the migration process that is also used when integrating other data sets in Wikidata and is therefore of lasting use. [26]

So who will benefit from Wikidata?

As described, Wikidata's primary task is to support the various Wikipedia media. By increasing the clarity, data quality and usability of Wikipedia and related projects such as Wikimedia Commons, it brings enormous added value for all people who use Wikipedia.
Time will tell whether the smaller Wikipedia can catch up and attract more volunteers who might otherwise migrate to the English-language Wikipedia.
Most people will probably not notice that Wikidata is behind these improvements.
Perhaps the concept of Wikidata is too abstract to be widely known. That could change, however, if, as planned, the data from Wikidata can be edited directly in Wikipedia articles.

Are companies exploiting the wealth of knowledge and the people who create it?

Google and other large companies make a profit with Wikidata and thus with the work of all volunteers, while capitalist usability for Wikidata is just an unused by-product. However, Google in particular also links extensively to Wikipedia, not only in the search results, but also in the Knowledge Graph. In this way, Google contributes significantly to the popularity of Wikipedia, which will hopefully result in more participation.
Changing the license would exclude the big companies, but it is very questionable whether this public data can be licensed in any other way. In addition, many other institutions would then probably also be more critical of cooperation. And at least some of these companies contribute to the popularity of Wikipedia and donate to the Wikimedia Foundation. [27]
Google has given up Freebase because Google recognizes that Wikimedia has greater know-how in the area of ​​community projects and that Wikidata can better serve the actual purpose of Freebase [17]. Freebase wasn't Google's only database project. For Google, Apple, Amazon etc. it is interesting to be able to synchronize several databases in order to increase the quality of their services; however, they are in no way dependent on Wikidata.

All those who cannot access their own large closed data database as a resource and have to trust that Google will answer their next search query in a helpful manner, Facebook or Twitter will not block them tomorrow, are dependent on open data and open structures in general and Siri will definitely have the right answer to all questions.
Especially those who are critical of monopoly should support Wikidata in order to create an alternative that is open to everyone.

A project like Wikidata cannot be developed isolated from the rest of the Internet; it thrives on openness and the resulting symbioses. It is important that it remains independent and that it cannot be co-opted.
And that it remains as open as possible, because the project lives through its community, i.e. the people who use the data and thus set up new projects.
It would only be appropriate, however, for large companies to give more back, not in the form of donations but rather in the form of reliable, unconditional contributions.


