Wikidata:SPARQL query service/WDQS graph split

From Wikidata
Jump to navigation Jump to search

The Search team is currently running an experiment to split the Wikidata Query Service graph and use federation for the queries that need access to all subgraphs. This is a breaking change, which will require a number of queries to be rewritten, either to access a new SPARQL endpoint, or to use federation. We want to have a good understanding of the trade-offs before we commit to any long-term solution.

The reasons behind the plan[edit]

The underlying issue that we are trying to address is the medium term scalability and stability of Wikidata Query Service, which can hinder access to and the possibility to query the data in Wikidata. The Wikidata Query Service runs on top of Blazegraph, and comprises 15 billion triples. The graph is currently growing at the rate of 1 billion triples per year.

With the current size and growth of the graph, we are experiencing a number of scalability issues:

  • The reloading (rebuilding) of the graph from the Wikidata dumps takes between 1 and 2 months, sometimes more. In part this depends on how long the operation is, in part the reloading time is extended because it can unpredictably crash once the graph reaches a certain size, requiring the process to be restarted
  • More frequent stability issues with WDQS
  • The queries are taking a longer time to run, with more frequent timeouts

The ability to reload the graph is a critical function in order to ensure data consistency and be able to recover from potential critical data issues. It is an indication of the stability and scalability of the system. Furthermore, the instability of the data reload process is directly linked to the size of the graph, in a similar way that the runtime stability of WDQS is linked to the size of the graph.

The experiment: splitting the scholarly articles graph[edit]

For the reasons outlined before, we will run an experiment: we will split the graph, moving the scholarly articles to a separate graph. This will reduce the size of the constituent graphs, thereby increasing the stability of WDQS by reducing the size of the data queried across each of the graphs and size of data involved in a full graph refresh. It will support continued linear growth of the graph, providing time to assess other solutions to address the graph scalability issues.

It is important to remember that this is just an experiment, and that it will take time to implement. No decisions will be applied immediately, and all relevant decisions will be taken considering the input of the larger Wikidata community and the Scholia community, with the latter being the most impacted by such experiment. Our scope is to limit, as much as possible, the impact of such a decision on Wikidata’s tools and community actions.

The current proposal about the split is available at Wikidata:SPARQL query service/WDQS graph split/WDQS Split Refinement.

FAQs[edit]

What is the timeline of this project?

As of February 2024, we are setting up the test servers and making them available to the public for testing and providing feedback.

In April 2024, we will conduct internal tests based on query logs, and we expect by this date that the refinement of the graph split will be completed. At this point, we should be confident in the impact of the graph split.

By June or July 2024, the update process is modified to support a split graph, and production servers are available with a split graph updated in real time. After this, we plan a 6-months transition period, once the service is in production, before shutting down the full graph.

Are scholarly articles going to be removed from Wikidata?

No. The experiment will only split the current query graph, moving the scholarly articles to a separate graph in Blazegraph. This will not impact the management of current and future content via the Wikidata website and other APIs.

Is WMF asking to remove scholarly articles and/or other kinds of data from Wikidata?

No. Wikimedia Foundation does not want to and will never take editorial decisions on behalf of the community. Wikimedia Foundation also believes the Wikidata community should be the only actor in charge of deciding what is notable and what is not, throughout its established community processes.

Are you planning to test other splits after scholarly articles?

No. The experiment on scholarly articles is for the time being the only experiment that we are planning to do. At the moment, there is no need to plan on other experiments on other kinds of data in Wikidata.

What are you doing to limit the impact of such experiment on Wikidata tools?

We are in contact with the Scholia community, which will likely be the most impacted, to understand the extent of the impact and provide the necessary support to mitigate eventual negative effects. We are also investigating if other tools will be impacted and we plan on getting in contact with the tools’ maintainers to provide necessary support. If you anticipate this will have a significant effect on your tools and workflows, we would appreciate it if you could reach out to us in the talk page, in the Phabricator ticket for community feedback or by pinging directly Sannita (WMF).

What are the plans for replacing Blazegraph right now?

For now we are focusing our work on this split, as we anticipate that it will buy us time to make better decisions about which alternative to move to. Reducing the size of the graph also makes the replacement easier in the future.

Where should feedback be sent?

You can reach out to us using the talk page, the Phabricator ticket for community feedback or by pinging directly Sannita (WMF).