Wikidata:SPARQL query service/WDQS backend update/WDQS backend alternatives

From Wikidata
Jump to navigation Jump to search

As you know, finding a way to make Wikidata Query Service (WDQS) scalable is a top priority for the Search Team. To accomplish this goal, moving off the Blazegraph backend was deemed a critical action to undertake. For this reasons, in the last months, we evaluated several solutions that could help us achieve this goal.

Finding the alternatives[edit]

Finding an alternative backend was not hard. In fact, there are a large number of alternatives. The hardest part was to narrow down the possibilities. To accomplish this, we defined specific criteria for the backend to have, also relying on the feedback received in our February 2022 scaling community meetings, in order to evaluate them in regard to our needs.

At this moment, four potential candidates are short-listed for replacing Blazegraph (listed in alphabetical order):

  1. Apache Jena with the Fuseki SPARQL Server component;
  2. Qlever (some aspects, such as update support, still in development);
  3. RDF4J V4 (still in development);
  4. Virtuoso Open-Source.

You can find the full evaluation study process and results in our paper, “WDQS Backend Alternatives”. The paper addresses the technical and user requirements for the WDQS backend, gathered over the last seven years of operation, as well as the implications for the system architectures. These topics, the process for evaluation, and the resulting detailed assessments of the possible alternatives are discussed in the document.

Technical and community criteria assessments[edit]

The following tables are the results of the assessments of the candidate alternatives. The first table holds the overall technical assessments. The scoring is 0-5 (where 0 indicates no support and 5 indicates exceptional support). Note that the table includes a column evaluating the current Blazegraph solution.

All of the following criteria are discussed and explained in full in the document.

Overall technical assessments
Criteria Blazegraph Jena QLever RDF4J Virtuoso
Scalability to 10B+ triples 5 5 5 3* 5
Scalability to 25B+ triples 0 0 5 1 5
Full SPARQL 1.1 capabilities 5 5 3* 5 3
Federated query 5 5 0* 5 5
Ability to define custom SPARQL functions 3 5 2 5 4
Ability to tune/define indexes and perform range lookups 2 5 5 5 4
Support for read and write at high frequency 5 5 0* 3* 3
Active open-source community 0 5 4 5 3
Well-designed and documented code base 5 5 5 5 2
Instrumentation for data store and query management 2 5 2 4 4
Query plan explanation 5 3 5 3 2
Query plan tuning/hints within SPARQL statement 5 4 2 3 3
Query without authentication 5 5 5 5 5
Ability to prevent write access 0 5 0 5 5
Data store reload in 2-3 days (worst case) 0 2 5 3 1
Query timeout and resource recovery 2 5 4 4 3
Support for geospatial data (e.g., POINT) 5 5 5 5 5
Support for GeoSPARQL 2 5 2* 4 3
Support for named graphs (quads) 5 5 0 5 5
Query builder interface (ease of use) 3 4 5 3 5
Dataset evaluation (SHACL, ShEX) 0 5 0 5 0

NB: a * indicates that score could be improved after testing/evaluation of work-in-progress.

The second table describes the implications to the users related to query times, complexity and data freshness for each of the alternatives.

Server assessment by user criteria
Criteria Jena QLever RDF4J Virtuoso
Permit long(er) and configurable query timeouts (which translates to additional query load) Longer timeouts will likely be required due to federation; Timeouts configurable at global and query levels Timeouts configurable at query level Longer timeouts will likely be required due to federation; Timeouts configurable at query level Timeout implications need investigation / evaluation based on query load; Anytime query is not possibility (since it is not deterministic); Timeouts are global and not configurable at query level
Query full set of triples Slower performance on some queries due to need for federation Full capability Slower performance on some queries due to need for federation Full capability
Reflect most current data (Requires ability to handle frequent writes) Needs investigation/evaluation; There are capabilities for streamed update Proposed solution for update needs investigation/evaluation; In theory, supports real-time updates Needs investigation/evaluation; The LMDB store should be performant Updates may necessitate index rebuild and affect performance and correctness; Needs investigation/evaluation
Good query response time (Requires performant indexing and join operations) Some slower performance due to federation; Possible to tune indexes and configurations Performant query demonstrated on sample endpoint; Queries that timeout on Blazegraph likely to succeed; All index permutations are supported Some slower performance due to federation; Possible to tune indexes and configurations Needs investigation / evaluation since column-wise data store may not be compatible with frequent writes; Complex queries may timeout or take a long time to complete
Ease of use/easier to use (All solutions support SPARQL 1.1; federation introduces additional complexity) Queries will be more complex since they will reference different endpoints due to federation; Can evaluate HyperGraphQL for simple queries Excellent UI (as demonstrated on sample endpoint) with autocomplete and graphical display of query plans; Need to test full SPARQL 1.1 compliance With FedX support, queries should not have to change (changes would be due to splitting Wikidata into sub-graphs to reduce database size) Queries will reference new prefixes (bif: and sql:) and use non-standard terminology; Query plans explained in SQL which could be confusing; Need to test full SPARQL 1.1 compliance

Our next steps ahead[edit]

First of all, we encourage all community members to review and comment on this paper, by:

Clearly, the work to replace Blazegraph with a suitable alternative backend is just beginning. We already defined a number of steps we need to take, such as:

  • determine how to support the current local SERVICE functions (labelling, geospatial calculations, etc.) in a more SPARQL-compliant manner;
  • define a set of updates and query tests and workloads that exercise the engines and SPARQL endpoints;
  • define and test different algorithms for splitting the Wikidata graph, understand how the update and query workloads would change, and the implications for the RDF stream updater;
  • begin testing and tuning the selected alternative offerings using the specific SPARQL tests and workloads defined above;
  • investigate creation of a middleware layer (between the RDF store/SPARQL endpoint and users/applications) to remove dependencies on a specific implementation and reduce churn in potential, future migrations.

As we progress on these tasks, we remain committed to publishing our work and keeping the community updated. -- Sannita (WMF) (talk) (on behalf of the Search Team) 08:20, 29 March 2022 (UTC)Reply[reply]