Wikidata:SPARQL query service/query optimization

From Wikidata
Jump to: navigation, search

Query optimization attempts to find the most efficient way to execute a given query from among the different possible query execution plans. Blazegraph has a built-in query optimizer which often works well. However, sometimes it is not so successful; in such cases queries may need to be optimised manually.

Automatic optimisation[edit]

The basic approach to query optimisation is to try to get the solution set to be as small as possible as soon as possible in the query execution, to make the work needed for subsequent joins as small as it can be. A subsidiary goal can be to maximise the possibility of pipelining and parallelisation. This is what Blazegraph's built-in query optimiser tries to achieve.

For example, the "Understanding SPARQL" section of the queries page contains the following query to return a list of U.S. Presidents and their spouses:

SELECT ?pres ?presLabel ?spouse ?spouseLabel WHERE {
   ?pres wdt:P31 wd:Q5 .
   ?pres wdt:P39 wd:Q11696 .
   ?pres wdt:P26 ?spouse .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
 }

Try it!

The details of how Blazegraph has executed a particular query (and why) can be obtained via the query engine's EXPLAIN mode. This can be done by replacing the following at the start of the query-editor URL,

https://query.wikidata.org/#

with

https://query.wikidata.org/bigdata/namespace/wdq/sparql?explain&query=

to send the query to the SPARQL endpoint directly (not via the query editor), with an extra explain& in the URL before the query= string.

Sending the query above to the endpoint directly, with the 'explain' option, produces this report on the query and its optimisation.

By default, a query would execute from top to bottom, with solution-set joins made in the order given. This is shown as the "Original AST" plan in the report. However, the query engine estimates (as of September 2015) that there are

It therefore concludes that the most efficient way to run the query, rather than the order specified, is instead to first find the presidents (76), then see how many of them have spouses (51 solutions), then see how many are human (47 solutions), before sending this solution set to the labelling service. This rearrangement is typical of successful query optimisation.

A query that has difficulties[edit]

The following query attempts to find all statements referenced to the Le Figaro website, and return their subject, predicate and object. However as written it times out.

prefix prov: <http://www.w3.org/ns/prov#> 
prefix pr: <http://www.wikidata.org/prop/reference/>SELECT ?statement ?subject ?subjectLabel ?property ?propertyLabel ?object ?objectLabel ?refURL WHERE {
  ?statement prov:wasDerivedFrom ?ref .
  ?ref pr:P854 ?refURL 
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .       
  ?subject ?p ?statement .
  ?property wikibase:claim ?p .
  ?property wikibase:statementProperty ?ps .
  ?statement ?ps ?object
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}

Try it!

The reason for this can be investigated by looking at the explain report for the query.

It appears that the query optimiser is seduced by the low cardinality of the statements ?property wikibase:claim ?p and ?property wikibase:statementProperty ?ps, with only 1756 properties that take items as their objects, without anticipating that (i) the second condition will do nothing to reduce the solution set from the first condition; and (ii) these conditions will do very little to reduce the cardinality of the statement ?subject ?p ?statement (764,223,907), which the optimiser proposes to join before looking at ?statement prov:wasDerivedFrom ?ref and then ?ref pr:P854 ?refURL.

The query optimiser may also be trying to defer testing the statement FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) for as long as possible, as this requires materialisation of actual string values, rather than proceeding solely with joins based on whether items exist in statements or not.

any examples for hand-ordered query?

Other optimisation modes[edit]

Blazegraph offers a handful of other optimisation directives including the more expensive "runtime optimisation", based on using sampling to more precisely estimate the likely development in the size of the solution set by a particular sequence of joins; and a directive to run a particular join first.

However, examining the corresponding query report, it appears that runtime optimisation has proceeded with the same query sequence as the static optimisation, and similarly times out.

As for the "runFirst" directive, I have not yet been able to work out how to apply it.

Other queries the optimiser has difficulty with[edit]

Performance falls off a cliff, when the number in a group of interest exceeds a certain threshold[edit]

This arose in connection with a query to find humans with apparently matching dates of birth and death, for humans from a particular country

  • France (N=125,223): tinyurl.com/obxk9av -- query report (successful)
  • Germany (N=194,098): tinyurl.com/pwthc9d query report (times out)

What goes wrong here occurs when the number of items with the given value of country of citizenship (P27), eg 194,098 for Germany, exceeds the items (time-value nodes in fact) with wikibase:timePrecision "11"^^xsd:integer (169,902).

For France, with N=125,223, there is a steady fall-off in the size of solution set: requiring a date of death, then day-specific precision, then a date of birth, then day-specific precision, knocks this number down to 48,537, which is then reduced to 10 by requiring the date of birth and the date of death to match (a very quick operation).

However for Germany, with N=194,098, the query optimiser attempts to start with the 169,902 date-nodes with day-specific dates; but the query then explodes when the system tries to match these to all the people with such day-specific birthdates (1,560,806), a number which reduces when the nationality filter kicks in; but which still eats up the available query time.

(In fact, having looked at the performance data and seen just how fast Blazegraph can do all-against-all matching for really very big sets, it turns out that (with a query rewrite) the system is fast enough to run this kind of query for everything v everything -- see tinyurl.com/pn9xyye -- by matching on date-nodes first, which knocks down the size of the solution set, and not checking until after that the requirement that the date be day-specific).

Adding a names and name-labels wrapper to a query for life events makes it time out[edit]

Adding an apparently trivial wrapper, to find the name and name-label, causes the query in the immediately previous section to time out, even when without the wrapper it ran with no difficulty.

  • Query as presented: tinyurl.com/p8wzb74 -- query report
  • Hand-ordered query: tinyurl.com/o7z95vb -- query report (successful)
  • Query without the name lookup: tinyurl.com/obxk9av -- query report (successful)

With the outer-layer in place, the inner layer now starts (even for France) by extracting the birth-nodes that are connected to birth-dates, and then keeping those which have precision=11.

The situation can be fixed by turning off the query optimiser and hand-ordering the query.

It seems this may be a general feature of queries with sub-select clauses: both here and in the next query the optimiser seems to want to start the sub-query with a requirement that includes one of the variable that will be projected to the outer layer.

Adding a property-labels wrapper to a query for properties makes it time out[edit]

  • Query as presented: tinyurl.com/oex59qr -- query report
  • Hand-ordered query: tinyurl.com/ppkmb3v -- query report (successul)
  • Query without property-label lookup: tinyurl.com/qe4ukbx -- query report (successful)

Same as the above: for the query including property look-ups, the optimiser apparently wants to start with all triples ?a ?p ?b and then join ?b wdt:P31 wd:Q4167410, rather than the other way round (which is what it manages to successfully do for the smaller query with no property look-up).

This doesn't appear to an artifact caused by BIND, nor the order the clauses are presented -- even if these are changed, the query still runs slow with the built-in optimiser

  • Alt query: tinyurl.com/q97a8sh -- query report (still fails)

It is still preferring to start with the assertion ?a ?p ?b, even though this is less specific -- apparently because ?p is one of the variables projected out of the inner SELECT statement.

More... ?[edit]

Add further cases here