Wikidata:WikiProject Limits of Wikidata

From Wikidata
Jump to navigation Jump to search
This WikiProject aims to catalogue the current limits of Wikidata and to extrapolate their development until about 2030.
The formula depicted here describes the resolution limit of the light microscope. After it had served science for about a century, it was set in stone for a monument. The image was taken yet more years later, but two months after that, the 2014 Nobel Prize in Chemistry was announced to be awarded for overcoming this limit using fluorescent molecules and lasers.
Which of the limits of Wikidata are set in stone, and which ones should we strive to overcome?

About[edit]

This Wikiproject aims at bringing together various strands of conversations that touch upon the respective limits of Wikidata, both in technical and social terms. The aim is not to duplicate existing documentation but to collect pointers to places where the respective limits for a given section are being described or discussed.

Timeframe[edit]

While fundamental limits exist in nature, the technical and social limits we are discussing here are likely to shift over time, so any discussion of such limits will have to come with some indication of an applicable timeframe. Since the Wikimedia community has used the year 2030 as a reference point for its Movement Strategy, we will use this here as a default as well for projections into the future and contrast this with current values (which may be available via Wikidata's Grafana dashboards). If other timeframes make more sense in specific contexts, please indicate that.

Design limits[edit]

"Design limits" are the limits which exist by intentional design of the infrastructure of our systems. As design choices, they have benefits and drawbacks. Such infrastructure limits are not necessarily problems to address and may instead be environmental conditions for using the Wikidata platform.

Software[edit]

Knowledge graphs in general[edit]

MediaWiki[edit]

maxlag[edit]

mw:Manual:Maxlag parameter, as explained here.

Page load performance[edit]
Reduced loading times cut

mw:Wikimedia Performance Team/Page load performance, as explained here.

wb_terms[edit]

Wikibase[edit]

Generic Wikibase Repository[edit]
  • By design, the repository stores statements, that *could* be true. There is no score yet, that describes the validity or "common sense agreement" of that statement.
Data types[edit]
  • Item
  • Monolingual string
  • Single value store, but no time-series for KPIs
Data formats[edit]
  • JSON
  • RDF
  • etc.
Generic Wikibase Client[edit]
Wikidata's Wikibase Repository[edit]
Wikidata's Wikibase Client[edit]
Wikibase Repositories other than Wikidata[edit]
Wikibase Clients other than Wikidata[edit]
Wikidata bridge[edit]
Wikimedia wikis[edit]
Non-Wikimedia wikis[edit]

Wikidata Query Service[edit]

See also Future-proof WDQS.

Triple store[edit]
Blazegraph[edit]
Virtuoso[edit]
JANUS[edit]
Apache Rya[edit]
Oxigraph[edit]
Frontend[edit]
Timeout limit[edit]

Queries to the Wikidata Query Service time out after a certain time, which is a parameter that can be set.

There are multiple related timeouts, e.g. a queryTimeout behind Blazegraph's SPARQL LOAD command or a timeout parameter for the WDQS GUI build job.

JavaScript[edit]

The default UI is heavy on JavaScript, and so are many customizations. This creates problems with pages that have lots of statements in that they load more slowly or freeze the browser.

Python[edit]

SPARQL[edit]

Hardware[edit]

"Firstly we need a machine to hold the data and do the needed processing. This blog post will use a “n1-highmem-16” (16 vCPUs, 104 GB memory) virtual machine on the Google Cloud Platform with 3 local SSDs held together with RAID 0."
"This should provide us with enough fast storage to store the raw TTL data, munged TTL files (where extra triples are added) as well as the journal (JNL) file that the blazegraph query service uses to store its data.
"This entire guide will work on any instance size with more than ~4GB memory and adequate disk space of any speed."

Functional limits[edit]

A "functional limit" exists when the system design encourages an activity, but somehow engaging in the activity at a large scale exceeds the system's ability to permit that activity. For example, by design Wikidata encourages users to share data and make queries, but it cannot accommodate users doing a mass import of huge amounts of data or billions of quick queries.

A March 2019 report considered the extent to which various functions on Wikidata can scale with increased use - wikitech:WMDE/Wikidata/Scaling.

Wikidata editing[edit]

Edits by nature of account[edit]

Edits by human users[edit]
Manual edits[edit]
  • ...
Tool-assisted edits[edit]
  • ...
Edits by bots[edit]
  • ...

Edits by nature of edit[edit]

Page creations[edit]
Page modifications[edit]
Page merges[edit]
Reverts[edit]
Page deletions[edit]

Edits by size[edit]

Edits by frequency[edit]

WDQS querying[edit]

A clear example where we encounter problems, is SPARQL queries against the WDQS where things of some type (P31) are asked for, involving large number of hits. For example, querying all scholarly article titles. Queries that involve fewer items of that type do not typically give these issues.

Query timeout[edit]

This is a design limit discussed under #Timeout limit above. It manifests itself as an error when the query takes more time to run than the timeout limit allows for.

Queries by usage[edit]

One-off or rarely used queries[edit]
Showcase queries[edit]
Maintenance queries[edit]

Queries by user type[edit]

Manually run queries[edit]
Queries run through tools[edit]
Queries run by bots[edit]

Queries by visualization[edit]

  • Table
  • Map
  • Bubble chart
  • Graph
  • etc.

Multiple simultaneous queries[edit]

Wikidata dumps[edit]

Creating dumps[edit]

Using dumps[edit]

Ingesting dumps[edit]
Ingesting dumps into a Wikibase instance[edit]
Ingesting dumps into the Wikidata Toolkit[edit]

Updating Triple Store Content[edit]

Creating large numbers of new items itself does not seem to cause problems (except the aforementioned WDQS querying issue). However, there frequently is a lag between updating the wiki pages of Wikidata and the updates being propagated to the Wikidata Query Service servers.

Large item edits[edit]

One bottleneck is the editing of existing Wikidata items with a lot of properties. The underlying issue here is that, for each edit, RDF for the full item is created and that the WDQS needs to update that full RDF. Therefore, independent of the size of the edit, edits on large items stress the system more than edits on small items. There is a Phabricator ticket to change how the WDQS triple store is updated.

Merged QuickStatement edits[edit]

The current QuickStatement website is not always efficient in making edits: adding a statement with references can result in multiple edits. This feature makes QuickStatement make the Large item edits issue very visible.

Human engagement limits[edit]

"Human engagement limits" include everything to do with human ability and attention to engage in Wikidata. In general Wikidata is more successful when humans of diverse talent and ability enjoy putting more attention and time into their engagement with Wikidata.

Limits in this space include the number of contributors Wikidata has, how much time each one gives, and the capacity of Wikidata to invite more human participants to spend more time in the platform.

Wikidata users[edit]

Human users[edit]

Human Wikidata readers[edit]
Human Wikidata contributors[edit]
  • Format is machine friendly but not human-friendly - hard for new editors to understand. Necessary to ensure that Wikidata brings in data that may not be already on the internet.
  • Difficult for college classes/instructors to know how to organize mass contributions from their students, such as Wikidata_talk:WikiProject_Chemistry#Edits_from_University_of_Cambridge.
  • Effective description of each type of entity requires guidance for the users who are entering a new item: What properties need to be describe for each instance of tropical cyclone (Q8092)? How do we inform each user entering a new book item that they ought to create both a version, edition, or translation (Q3331189) and a written work (Q47461344) entity for that book (per Wikidata:WikiProject_Books). In other words, how do we make the interface self-documenting for unfamiliar users? And where we have failed to do so, how do we clean up well-intentioned but non-standard edits by hundreds or thousands of editors operating without a common framework?
Human curation[edit]
  • Human curation of massive automated inputs of data - tool needed to ensure that data taken from large databases are reliable? Can we harness the power of human curators, who may identify different errors than machine-based checks?

Tools[edit]

Tools for reading Wikidata[edit]
Tools for contributing to Wikidata[edit]
Tools for curating Wikidata[edit]

Bots[edit]

Bots that read Wikidata[edit]
Bots that contribute to Wikidata[edit]

Users of Wikidata client wikis[edit]

Users of Wikidata data dumps[edit]

Users of dynamic data from Wikidata[edit]

API[edit]

SPARQL[edit]

Linked Data Fragments[edit]

Other[edit]

Users of Wikibase repositories other than Wikidata[edit]

Content limits[edit]

"Content limits" describe how much data Wikidata can meaningfully hold. Of special concern is limits on growth. Wikidata hosts a certain amount of content now, but limits on adding additional content impede the future development of the project.

A March 2019 report considered the rate of growth for Wikidata's content — wikitech:WMDE/Wikidata/Growth.

Generic[edit]

How many triples can we manage?[edit]

How many languages should be supported?[edit]

How to link to individual statements?[edit]

Items[edit]

Timeline of Wikidata item creation

How many items should there be?[edit]

The Gaia project released data so far on over 1.6 billion stars in our galaxy. It would be nice if Wikidata could handle that. OpenStreetMap has about 540 million "ways". The number of scientific papers and their authors is on the order of 100-200 million. The total number of books ever published is probably over 130 million. OpenCorporates lists over 170 million companies. en:CAS Registry Number's have been assigned to over 200 million substances or sequences. There are over 100 large art museums in the world each with hundreds of thousands of items in their collection, so likely at least tens of millions of art works or other artifacts that could be listed. According to en:Global biodiversity there may be as few as a few million or as many as a trillion species on Earth; on the low end we already are close, but if the real number is on the high end, could Wikidata handle it? Genealogical databases provide information on billions of deceased persons who have left some record of themselves; could we allow them all here?

From all these different sources, it seems likely there would be a demand for at least 1 billion items within the next decade or so; perhaps many times more than that.

How many statements should an item have?[edit]

  • The top-listed items on Special:LongPages have over 5000 statements. This slows down operations like editing and display.

Properties[edit]

How many properties should there be?[edit]

How many statements should a property have?[edit]

Lexemes[edit]

Overview of lexicographical data as of May 2019. Does not discuss limits other than those of QuickStatements.

How many lexemes should there be?[edit]

English Wiktionary has about 6 million entries (see wikt:Wiktionary:Statistics); according to en:Wiktionary there are about 26 million entries across all the language variations. These numbers given a rough idea of potential scale; however they cannot be translated directly to expected lexeme counts due to the structural differences between Wikidata lexemes and Wiktionary entries. Lexemes have a single language, lexical category and (general) etymology, while Wiktionary entries depend only on spelling and include all languages, lexical categories and etymologies in a single page. On the other hand, each lexeme includes a variety of spellings due to the various forms associated with a single lexeme and spelling variations due to language varieties. Very roughly, then, one might expect the eventual number of lexemes in Wikidata to be on the order of 10 million, while the number of forms might be 10 times as large. The vast majority of lexemes will likely have only one sense, though common lexemes may have 10 or more senses, so the expected number of senses would be somewhere in between the number of lexemes and the number of forms, probably closer to the number of lexemes.

How many statements should a lexeme have?[edit]

So far there are only a handful of properties relevant for lexemes, in each case likely to have only one or a very small number of values for a given lexeme. So on the order of 1 to 10 statements per lexeme/form/sense seems to be expected. However, if we add more identifiers for dictionaries and link them, there's a possibility we may have a much larger number of external id links per lexeme in the long run - perhaps on the order of the number of dictionaries that have been published in each language?

References[edit]

How many references should there be?[edit]

How many references should a statement have?[edit]

Where should references be stored?[edit]

Subpages[edit]

Participants[edit]

[+ Add yourself to the list]

The participants listed below can be notified using the following template in discussions:

{{Ping project|Limits of Wikidata}}