Wikidata:SPARQL query service/WDQS backend update/March 2022 scaling update

Wikidata Query Service scaling update, March 2022[edit]

The Search team's paper on WDQS backend alternatives is now available. Much thanks to Andrea Westerinen and the rest of the team for all of the hard work in putting this together. As a high level reminder, Blazegraph, which WDQS currently uses as a graph backend, is end of life software and is reaching, or has already reached, its limits in supporting WDQS. The highest priority for scaling WDQS is finding a new graph backend. In this paper, we narrow down our list of options moving forward to a shortlist of 4 different graph backends, in alphabetical order below. All options come with advantages and disadvantages, and further testing will be required before settling on a final choice. Please read the paper for more information on the process of creating this shortlist!
- Apache Jena with the Fuseki SPARQL Server component
- Qlever (some aspects, such as update support, still in development)
- RDF4J V4 (still in development)
- Virtuoso Open Source

We are continuing to productionize the WDQS analysis code that our analyst Aisha Khatun has been working on over the past number of months. These analyses have been enormously useful in better understanding the structure of the Wikidata graph, its subgraphs, and the querying traffic on the subgraphs. As it will important to easily replicate these analyses in the future, rather than depending on a single report based on a single snapshot in time, we are working on ensuring that this code is easy to re-run and analyses can be replicated.

WDQS experienced some outages due to memory issues with the Java Virtual Machine. We are deploying jvmquake in order to address these memory issues and mitigate further memory-related outages.

WDQS also experienced more outages due to Blazegraph deadlocking (which is an independent failure mode from the memory issues above). We reopened a ticket that we used to track previous deadlock issues, and will further investigate how to mitigate further outages. However, because the problem fundamentally originates from Blazegraph itself, which we will not be fixing, any solution will be imperfect and will involve varying levels of performance cost/functionality for WDQS users.

Due to recurring outages, the time it takes to attend to these issues, and the consequent inability to meet our WDQS Service Level Objective (SLO) of 99% update lag <10 minutes, we are having preliminary discussions about potentially lowering this SLO to a more sustainable level for the team. This would alleviate some demand on our overburdened team, and allow us to better balance firefighting work and longer term scaling work. If so, one result would be potentially more downtime of WDQS.

The WDQS reconciliation process is now complete. Many thanks go to David for this hard work!