Wikidata:SPARQL query service/WDQS backend update/December 2021 scaling update

From Wikidata
Jump to navigation Jump to search

Wikidata Query Service State of the Union, Dec 2021

Executive Summary[edit]

  • Wikidata Query Service (WDQS) is still at risk of catastrophic failure due to Blazegraph backend limitations
  • Progress has been made on:
    • Improving update lag via the Streaming Updater
    • Community transparency and communications of current state
    • Better understanding of user priorities and structure of knowledge graph and query patterns on it
    • Published a plan for data deletion in the case of catastrophic failure
    • Graph consultant hired (start date in Jan 2022) to help identify Blazegraph alternative
  • Ongoing and future work:
    • Work with analysts to continue to better understand how to optimize WDQS
    • Work with graph consultant to help identify Blazegraph alternative
    • Ensure that there are sufficient resources for continued WDQS development

Overview[edit]

In the previous Wikidata Query Service (WDQS) August scaling update, I gave an overview of the severity and risk of the current scaling challenges for WDQS. As the means of building, editing and accessing the knowledge graph built from Wikidata triples, the risk of WDQS failure is a major risk of Wikidata failure. This is still currently a high risk.

As a recap, the highest risk of catastrophic failure for WDQS is that Blazegraph, the graph database backend, reaches a capacity limit for how much it can hold. If this happens, new and updated data in Wikidata will not be reflected in the WDQS knowledge graph, functionally freezing the usability of Wikidata to that point in time. Other major risks are the inability to keep up with querying load, and the fact that Blazegraph is end of life software.

The Search team’s strategy thus far has been primarily two-fold: (i) create a disaster mitigation plan in the case of catastrophic failure, and (ii) create a strategy for migrating the WDQS graph backend away from the end of life backend Blazegraph, whose limitations are the major source of many of our issues. To date, the disaster plan playbook has been published. While most engineering work on WDQS in the first half of 2022 will be restricted to (even more) bare bones maintenance, there will be continued planning work around identifying a Blazegraph alternative, with a stretch goal of formulating a high level migration plan for engineering work.

Wikidata and WDQS are complex systems; any mistakes below are my own.

Progress Made[edit]

In the last ~half of 2021, we have had good successes related to scaling WDQS. These range from shipped code, to community relations, to having a better understanding of WDQS usage.

  • Shipped Streaming Updater to alleviate update lag, which also affects query load
    • Established SLO for WDQS update lag time of 99% (not currently there yet)
  • Informed Wikidata community of risk of catastrophic failure and keeping them informed on how we are working to scale WDQS
  • Ran WDQS user survey, and got insights about user priorities
  • WDQS analysts have provided deeper insights into the structure of Wikidata and the queries on it
  • Published a plan for data deletion in the case of catastrophic failure
  • Graph consultant hired (start date in Jan 2022) to help identify Blazegraph alternative

Streaming Updater shipped[edit]

Search released the new Streaming Updater on 18 Oct 2021. This was an important backend feature for WDQS that increases the update lag uptime for WDQS with our SLO (Service Level Objective) being 99%: specifically, this (opaquely named) metric refers to the amount of time that every WDQS server has an update lag under 10 minutes. For the end user, this ensures that the knowledge graph WDQS queries is up to date and more consistent service due to more evenly distributed volume of graph updates.

While the Streaming Updater helps mitigate the risk of inability to handle high update/query throughput, this ability also potentially increases the risk of maxing out Blazegraph’s capacity sooner (due to allowing for more volume of updates to be processed).

The Search team also released a publicly available 3-part blog post on the process of deploying the Streaming Updater: part 1, part 2, part 3.

Community transparency[edit]

In March 2021, the Wikidata community had varying levels of insight to the state of WDQS, especially with regard to the high risks of catastrophic failure. As it is highly probable that future scaling work will require making compromises and optimizing for a narrower set of users and/or use cases, it has been essential to ensure that the Wikidata community is well-informed on the state of WDQS and can participate in the scaling process.

A WDQS scaling update was publicly released to the community in Aug 2021. In addition to this update, in October 2021, the Search team and Wikimedia Deutschland (WMDE) worked together at Wikidata Con 2021 to give a talk on WDQS scaling challenges and a subsequent panel to take comments, questions and feedback from community members. The Blazegraph catastrophic disaster playbook was published on 10 Dec 2021.

User survey[edit]

In Aug 2021, a WDQS user survey was launched to better understand who our users are and what their priorities are for WDQS usage. There have been ~230 responses to date, which, while we realize are likely skewed towards power users who understand how to use SPARQL well enough to use WDQS regularly, still provide good insight into who WDQS users are (largely people looking to improve Wiki projects) and what their priorities (predominantly reducing timeouts, based on eyeballing).

Based on the responses, the overall rankings of the priorities are as follows – this was determined by looking at the combined 1st and 2nd highest priority votes for each dimension (see Chart 1):

  1. Timeouts
  2. Graph completeness
  3. Data freshness
  4. Response latency
  5. Ease of use
Chart 1: WDQS user priorities

We are still working with the Global Data & Insights team to further refine our understanding of demographic-specific (based on self-reported intended use) priorities. Additionally, Design Research has helped parse through and organize the free responses to the user survey. All of these results/analyses still need to be aggregated to paint a fuller picture of who WDQS users are based on self-reporting, and can help inform how we would like to move forward with scaling WDQS and prioritizing its future features to take into account user needs.

Analyst reports[edit]

In the past 6+ months, we have been working with WDQS analysts to better understand the structure of the Wikidata graph, as well as the query loads on that graph. These analyses help us understand the current and historical actual usage of WDQS, where the stress points are, and highlight potential paths towards future scaling, to be taken into consideration alongside user feedback. Aisha Khatun in particular has provided a lot of helpful insights, and her analyses are available on the subpages of her personal page.

We now have better insight into the various subgraphs of Wikidata, and how large they are in terms of number of items, and number of triples: notably, we have confirmed that scholarly articles account for ~50% of Wikidata by either count. The size of largest subgraphs appears to follow a roughly logarithmic trend, with a long tail of relatively smaller subgraphs. The same analysis shows that the subgraphs are heavily connected, suggesting that in the worst case scenario, splitting the graph into federated subgraphs may not confer any benefit to query computing.

While these analyses illustrate the structure of the Wikidata graph, they do need to be coupled with an analysis on the related queries on the graph -- this work is currently in progress, and related to providing a publishable playbook for data deletion in the case of catastrophic failure. If the subgraph analysis is a map that shows that houses generally have roads to many other houses, a query analysis will provide information about how much traffic is on each of those roads, and thus which roads/connections can be removed with lower impact to the actual querying activity on WDQS.

Moving forward[edit]

There is still a lot of work to do to scale WDQS infrastructure so that it is reliable and futureproof, which includes technical work such as migrating off of Blazegraph to a new graph database backend. There is a substantial amount of what I’ll refer to here as feature work, which is loosely defined as WDQS development that directly impacts how end users interact with the product, such as implementing a user management system via authentication. These two directions of work are clearly interconnected, as many of the immediate feature considerations are to the primary end of scaling WDQS to be stable.

Blazegraph alternative[edit]

The most important priority for WDQS still remains identifying a suitable graph database backend to Blazegraph, as well as putting together a plan for the engineering work required for this migration. While a meaningful estimate is not possible without knowing what the new backend is, the expected order of magnitude for the engineering work required is 2-3 years. It is likely that other mitigations and maintenance work will be required to minimize the risk of catastrophic failure during this process, which has the potential to prolong the backend migration depending on how the scaling vs maintenance work is managed.

The Search team has contracted a graph consultant contractor to assist in identifying a new graph backend, as well as what a high level migration roadmap looks like. This contract is planned for ~5 months starting in Jan 2022. With the Search team shifting back to Search & Relevance priorities in 2022, the goal would be to do the bulk of the Blazegraph migration planning during this half with engineering work ideally starting beginning of FY22-23.


Features and improvements[edit]

A WDQS analyst contractor, synthesized existing analyst research to identify possible future directions for scaling WDQS beyond the Blazegraph migration (a new graph database doesn’t solve all our problems, just our biggest one). These solutions fall on the aforementioned spectrum of feature and infrastructure work necessary; these ideas vary in how much prior thought and investigation have been given into them.

Improving the current SPARQL endpoint[edit]

The suggestions to improve the existing SPARQL endpoint (essentially synonymous to WDQS currently, as SPARQL is the only language WDQS operates on) fall more towards what I consider the infrastructure side of the spectrum of work, as they largely are improving how we handle things on the backend, rather than directly building user-facing features.

  • Federated subgraphs. As mentioned above, Wikidata comprises a number of subgraphs, such as scholarly articles and humans. In the long term, this singular graph could be split into subgraphs which are then federated. However, as mentioned, this approach is yet to be validated as being possible and beneficial based on the structure and query traffic, and would also multiply the overhead of having to manage multiple concurrent graphs at once. Implementing this would be itself a sizable amount of work that would take at least 1-2 quarters of dedicated development.
  • Graph trimming. This tactic would reduce the query compute load by trimming down the graph of redundant labels, and having better handling of Wikidata statements with no qualifiers or references, but where WDQS still generates that statement and info about it. We have not explored this approach to sufficient detail to understand what the potential benefits are, or how much work it would take.
  • Query parsing, query caching. This feature would parse and break up SPARQL queries into recurring parts with cached results. These cached results for frequent queries would reduce the compute time needed for those (sub)queries. The biggest drawback here is that keeping caches up to date with fresh updates is difficult. However, it should also be noted that not every user requires fresh data, and some would be willing to trade graph freshness for stability and the ability to not have timeouts on expensive queries.
  • Limit SPARQL functionality. SPARQL is the querying language used for WDQS and allows users to a lot of freedom to do powerful queries. These queries can often cause problems for WDQS to compute in time. We could limit the functionality of SPARQL to prevent the most expensive types of queries from locking up the servers and bringing down the service. It is unclear what these limitations would be, and any meaningful limitation will inherently make some subset of users unhappy, as they will no longer be able to execute their queries. This is not an executive decision that WMF should make without community engagement and support.
  • User management & throttling. By re-engineering WDQS to require user authentication to use (at least for expensive queries) -- possibly through the API Gateway -- we would have better options for managing expensive users: we would have the power to selectively throttle/block them to preserve overall functionality of WDQS. Currently, we can only ban user agents and do global throttles, which are relatively blunt tools. This is likely also not a popular option with community users, given the current conversation around authentication for Wikimedia Commons Query Service (WCQS).
  • Async(hronous) queries. Many users in the WDQS user survey indicated that they’d be amenable to WDQS returning their query results asynchronously in exchange for more lenient timeout constraints. An async query feature of WDQS could allow for less compute overloads by being able to run queries when resources are available, and returning results at some later time. This has not been technically validated as a viable way of reducing load on WDQS, though.

SPARQL alternatives[edit]

There are also alternatives to improving our SPARQL endpoint. However, community members and users are very attached to SPARQL and it would be undesirable to completely move away from SPARQL altogether -- though we should also note that the feedback we have heard in favor of SPARQL likely does not include voices of potential users who do not understand SPARQL well enough to use WDQS in the way they might want to.

Given the ~300 survey respondents, we can roughly estimate that the order of magnitude of current regular users is in the order of magnitude in the hundreds, which is smaller than we’d like for building accessible knowledge infrastructure for the world. Even if we maintain a SPARQL endpoint, it seems overwhelmingly likely that we need to provide more accessible endpoints for accessing Wikidata content (i.e. WDQS and other new querying services) if we hope to be more inclusive in who can use this product. WMDE’s Query Builder is one such attempt to increase accessibility to WDQS. While this is a good example of the feature work WDQS also requires to be successful, this document focuses primarily on the features most closely related to scaling WDQS.

  • MediaWiki API. The MediaWiki API could be used for accessing entities more efficiently, rather than using SPARQL to compute these common operations more expensively (retrieving entities often feeds other query operations). API calls could also be more easily cached, unlike SPARQL queries.
  • Text index. Similarly, text indexes are more efficient at single node hops. The difficulty is being able to identify single hop queries accurately. We can not reliably do this programmatically, and building a front end for users to select something like a text index would require meta-knowledge of SPARQL, when using SPARQL is already niche enough.
  • JSON relational database. By indexing Wikibase JSON in a JSON relational database, we could route simple queries to a relational database as mySQL queries. However, we would once again need some way of identifying which queries are simple. Additionally, this might also put too much strain on the relational database.
  • Local deploys of Wikibase + WDQS. Currently, it is possible for a user to deploy their own local instance of WDQS. However, this is not widely adopted as a solution for a variety of reasons I do not fully understand yet. A major component though, may be the technical proficiency required to set everything up (aside from the prohibitively high hardware costs for a more casual user) and maintain their WDQS -- after all we are hardly keeping our WDQS instance from falling over at the moment. Some current users would likely be interested in an easier way to deploy and maintain their own local version of WDQS, both to optimize it for their needs, but also to prevent their expensive usage from straining our WDQS instance.
  • Wikidata Linked Data Fragments (LDF). Wikidata LDFs currently exist as an alternative to SPARQL that shifts some of the computational load from the server side to the client side, potentially helping with WDQS query load. Previous estimates are that ~11% of queries could be handled via LDF. However, it is not well supported, documented, or made otherwise readily accessible (I myself am still unclear about how it works or how to use it). The work on this is currently stalled, and my guess is that seriously building out this functionality would require some non-trivial technical work to make LDFs stable/reliable, but also product and design research to make them usable by users, many of who many not understand what LDFs are or when to use them.

Next steps[edit]

In the first half of 2022, the major next steps for scaling WDQS will be:

  1. Working with a graph consultant to identify a suitable Blazegraph alternative, and a technical migration plan. Besides from a disaster plan, this is the highest urgency action.
  2. Continue working with analysts to better understand the structure of Wikidata and query traffic, and user needs, and seeing how it can be optimized, and how it informs what scaling features (from the above list and/or whatever else comes up) to build
  3. Ensure that WDQS has sufficient dedicated resources to continue the scaling work required

Risks[edit]

The major risks to be called out at this point in time are as follows:

  • Resources. The Search team is currently unable to keep up with all the WDQS scaling work required, both in lack of number of people to focus on this work, as well as the lack of expertise required to migrate off of Blazegraph. While the graph consultant should be able to help with this knowledge gap, there is also a risk that the complexity of the problem continues to expand in scope beyond what the current and near future resourcing can handle.
  • Blazegraph failure. Blazegraph is still at high risk for catastrophic failure. Having the published playbook for disaster mitigation is a helpful bandaid, but certainly not a solution to the underlying problem. The Streaming Updater helps with keeping up with update lag and reducing those bottlenecks, but may also be accelerating our approach towards catastrophic failure from Blazegraph maxing out. It is currently unknown how much time we have before catastrophic failure, or how much time it will take us to migrate off of Blazegraph to avoid the worst of those failures.
  • Internal project dependency on WD(QS). This is to call out that other internal Wikimedia projects currently rely on WD(QS) or plan on relying on it. MediaSearch for example makes use of WCQS and WDQS; Abstract Wikipedia plans on using Wikidata in the future; Wikimedia Enterprise is currently working on how to make Wikidata dumps useful for enterprise clients, which can often intersect with WDQS functionality. While there are internal WDQS servers that are not as affected by heavy user query strain, this internal WDQS still has the Blazegraph limitations, including risks of catastrophic failure from Blazegraph maxing out.

--MPham (WMF) (talk) 17:48, 22 December 2021 (UTC)[reply]