Jump to content

Wikidata:SPARQL query service/WDQS graph split/Internal Federation Guide

From Wikidata

This guide is mainly addressed to WDQS users impacted by the split of scholarly articles out of the Wikidata main graph. It explains how to rewrite a SPARQL query using internal federation between the two graphs.

It also contains some useful information about internal federation for new WDQS users who may want to query both graphs.

This guide expects that you have some basic understanding of:

How to know if I need to rewrite my queries?

[edit]

Before going into more details a simple question to ask is whether your use case targets a specific domain of items in Wikidata. If the answer is yes and this domain is not remotely connected to scientific/scholarly publications then you are most likely not affected by the split.

Until December 2025, it may be possible to answer this question by testing your query against the full graph at https://query-legacy-full.wikidata.org and see if the results differ when running it from https://query-main.wikidata.org/. If you don't spot any differences then chances are that the query is not affected. If, on the other hand, you spot differences you might have to investigate a bit further. Differences might not always be caused by the split but may be due to the use of non deterministic SPARQL features in the query.

If you maintain a tool/bot and the queries you use are generated programmatically then you might want to read further to assess whether your program is affected or not.

A quick (incomplete) test can be run at http://tiago.bio.br/query-split-tester which will run a SPARQL query against all three Wikidata WDQS-es.

When do I need to rewrite my queries?

[edit]

The SPARQL endpoint at https://query.wikidata.org/ continued to serve the full graph until May, 2025, when it started serving only the main graph. There is a legacy endpoint still running until December 2025 at https://query-legacy-full.wikidata.org

Your queries and/or associated tools should be migrated as soon as possible, as the split already happened. The legacy endpoint should only be used as a stop-gap measure.

Please let us know via Wikidata:Report a technical problem/WDQS and Search about your project, progress and difficulties - the more we know about your project and progress the better we can support you.

If you are the maintainer of a tool it might be a good time to reconsider using SPARQL if not strictly required:

Please see Wikidata:Data access for more details, if in doubt please feel free to ask at Wikidata:Report a technical problem/WDQS_and_Search.

What happens to my queries/tools if I don't use the new endpoints?

[edit]

After May 2025 https://query.wikidata.org/ and https://query-main.wikidata.org/ started serving only the wikidata_main graph.

In other words:

  • If, on the other hand, you depend on scientific publications your queries will start producing different results and possibly no results at all.

What is where?

[edit]

Prior to the split all the data was stored in a single graph served by https://query.wikidata.org/. This will no longer be the case and the graph will be split in two different graphs:

The way the data is split is rather naive and follows the rules defined at Wikidata:SPARQL query service/WDQS graph split/Rules. The list of types that identify a scholarly article is visible in this query.

The nature of the split is to separate all the items that are a direct instance of (P31) of the types listed above into the scholarly_articles graph. For instance Approximate Bayesian computation (Q4781761) is part of scholarly_articles. All the data owned by this entity is also part of the scholarly graph. In other words, all the data that you can edit from the entity page is available in the scholarly graph:

  • the labels, aliases and descriptions
  • the statements, qualifiers, complex values and references
  • the sitelinks

What may not be available is the data owned by another entity linked from this item. For instance, you can list the author QIDs linked from this entity:

SELECT ?authors {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity wdt:P50 ?authors
} LIMIT 10

But you cannot (without federation) list the date of birth of these authors:

SELECT ?dateOfBirth {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity wdt:P50 ?authors .
  ?authors wdt:P569 ?dateOfBirth .
} LIMIT 10

The reason is that the date of birth (P569) data is owned by the entity representing the author, not the publication, and thus may not be in the scholarly graph.

Similarly querying labels (even when using the label service) of entities linked from the publication will not work without federation:

SELECT ?authorsLabel {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity wdt:P50 ?authors .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
} LIMIT 20

Sometimes, it might not be obvious that your query might require data owned by another entity. In the above two examples we have the variable named ?authors that identifies the linked entities but when using property paths this variable might be hidden.

Let's rewrite our first example around date of birth with property paths to illustrate the issue:

SELECT ?dateOfBirth {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity wdt:P50/wdt:P569 ?dateOfBirth .
} LIMIT 10

Note the introduction of wdt:P50/wdt:P569 which is just some syntactic sugar (Q734781) of SPARQL. Even though we don't have an explicit variable representing the authors, the date of birth remains nonetheless on another graph.

This does not necessarily mean that you cannot use property paths, but you must be vigilant about what they might hide. For instance, it is perfectly fine to use property paths to access the qualifiers of statements:

SELECT ?namedAs {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity p:P50/pq:P1932 ?namedAs .
} LIMIT 20

The above query does not require federation because we access the object named as (P1932) qualifier which is owned by the publication.

How do I use federation?

[edit]

If you are in a situation where your query needs to access the data stored in both graphs you must use federation. SPARQL Federation is a standard feature of SPARQL and allows to combine the results of multiple SPARQL endpoints.

Let's re-use our example about fetching the author birth dates which we said will no longer work because of the split:

SELECT ?dateOfBirth {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity wdt:P50 ?authors .
  ?authors wdt:P569 ?dateOfBirth .
} LIMIT 10

The first question we have to answer is what endpoint I should use for the host query. We have two choices here:

  • Run the query on the scholarly_articles endpoint and federate the wikidata_main endpoint
  • The other way around running the query on the wikidata_main endpoint federating the scholarly_articles one

We will cover both approaches using our example.

Here we have to understand that the two triple patterns require data from two different graphs:

  • ?entity wdt:P50 ?authors requires the data available in the scholarly_articles graph
  • ?authors wdt:P569 ?dateOfBirth requires the data available in the wikidata_main graph

In other words if the host query is run from the scholarly_articles graph we must federate the triple patterns that require the wikidata_main graph. This is done by wrapping it with the syntax SERVICE wdgraph:wikidata_main { ...my patterns... }:

SELECT ?dateOfBirth {
  VALUES (?entity) {(wd:Q4781761)}
  ?entity wdt:P50 ?authors .
  SERVICE wdsubgraph:wikidata_main {
    ?authors wdt:P569 ?dateOfBirth .
  }
} LIMIT 10

The above query is to be run from the scholarly_articles endpoint but it could be run from the wikidata_main endpoint by wrapping the other triple pattern in a SERVICE clause targeting the scholarly_articles endpoint:

SELECT ?dateOfBirth {
  hint:Query hint:optimizer "None" .
  VALUES (?entity) {(wd:Q4781761)}
  SERVICE wdsubgraph:scholarly_articles {
    ?entity wdt:P50 ?authors .
  }
  ?authors wdt:P569 ?dateOfBirth .
} LIMIT 10

You might notice that we added a strange triple pattern[1] with hint:Query hint:optimizer "None". This is unfortunately one of the common requirements of federation for performant queries. When running on a single graph the Blazegraph graph database engine is usually able to optimize the query and determine what part of the query to run first. In our case if you had to decide what to run first between the following, it would require some consideration:

The former returns only the few authors linked from Approximate Bayesian computation (Q4781761) but the latter will load all the 6.5+M statements defining a date of birth (P569) property. If not given any hint Blazegraph might prefer to resolve the triple patterns from the host query first, which in this case will resolve these 6.5+M authors, and then pass them along to the federated query to resolve the ?entity wdt:P50 ?authors triple pattern. This would result in a timeout.

By using hint:Query hint:optimizer "None" we instruct Blazegraph to trust the order in which we put the various triple patterns in the query and run them in the order they appear. This way we resolve first the few authors of Approximate Bayesian computation (Q4781761) and then resolve their respective date of birth (P569).

If you want to learn more about this please read Wikidata:SPARQL query service/WDQS graph split/Federation Limits.

How to deal with linked entities spread across multiple graphs?

[edit]

In our simple example about the authors' dates of birth it is obvious that all authors are in the wikidata_main graph (they are hopefully all human (Q5)). It might not always be the case, so let's use another example to illustrate this.

For this we can use the main subject (P921) property which may link a subject defined in wikidata_main or possibly another publication defined in scholarly_articles.[2]

The entity Erratum: Quantum repeaters based on entanglement purification [Phys. Rev. A59, 169 (1999)] (Q59458901) might be a good example as it declares two main subject (P921), one being quantum physics and the other the paper the erratum is referring to.

If we wanted to extract the instance of (P31) property of these subjects, using the full graph we would simply write:

SELECT ?subject ?subjectType {
  VALUES (?paper) {(wd:Q59458901)}
  ?paper wdt:P921 ?subject .
  ?subject wdt:P31 ?subjectType .
} LIMIT 10

But here we know that the same triple pattern ?subject wdt:P31 ?subjectType might be on either the wikidata_main or the scholarly_articles graphs.

The solution to our problem is to use a UNION. Running from the scholarly_articles graph the query becomes:

SELECT ?subject ?subjectType {
  VALUES (?paper) {(wd:Q59458901)}
  ?paper wdt:P921 ?subject .
  { ?subject wdt:P31 ?subjectType }
  UNION
  { SERVICE wdsubgraph:wikidata_main { ?subject wdt:P31 ?subjectType } }
} LIMIT 10

From the wikidata_main endpoint the query is a bit more confusing but still uses the same UNION technique:

SELECT ?subject ?subjectType {
  VALUES (?paper) {(wd:Q59458901)}
  hint:Query hint:optimizer "None" .
  SERVICE wdsubgraph:scholarly_articles { ?paper wdt:P921 ?subject }
  # Fetch the subject type from both graphs using a UNION
  { ?subject wdt:P31 ?subjectType }
  UNION
  { SERVICE wdsubgraph:scholarly_articles { ?subject wdt:P31 ?subjectType } }

} LIMIT 10

Note that here we use two federated queries to the scholarly_articles graph, a first time to fetch the subjects using ?paper wdt:P921 ?subject and a second in the UNION to fetch the types of the subjects.[3] As you can see, depending on your use case choosing the right endpoint to run the query is important and the complexity of the query might vary a lot between the two.

Complex queries

[edit]

Complex queries can be challenging to rewrite but here are few hints to help you in the process:

  • First thing to do is to properly identify the variables that link the two graphs; in our examples we used ?authors and ?subject
  • If such variables are not really clear it might perhaps be because of property paths, so look for them and unfold them by introducing an explicit variable
  • It might help to cleanup/simplify your query before attempting to introduce federation, because a query that is clean is a lot simpler to rewrite. Before attempting a rewrite of a complex query, please take a moment to see if it could be simplified.

Please see Wikidata:SPARQL query service/WDQS graph split/Federated Queries Examples which cover some real world examples.

Where can I get some help?

[edit]

You can use Wikidata:Request a query to get help in rewriting a query with federation. You can also contact the WMF Search Platform team via Wikidata:Report a technical problem/WDQS and Search if you think there is an issue with the new endpoints.

Common mistakes

[edit]

Misplacing the label service

[edit]

The label service must be used in the query running on the service that holds the label of the entity.

For instance, fetching the label on the host service does not work if the entity is coming from the federated service:

SELECT ?paper ?paperLabel ?author ?authorLabel ?publicationDate {
  VALUES (?paper) {(wd:Q74426266)}
  SERVICE wdsubgraph:scholarly_articles {
    ?paper wdt:P50 ?author ;
           wdt:P577 ?publicationDate .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}

In the query above we fetch some information (author and publication date) of item Q74426266. The label service is only present in the top-level query and both the labels of the author and the paper are requested. The entity behind ?paper is hosted on the federated service, so its label cannot be loaded from the host service.

The solution is to fetch the label from the federated service, and in order to avoid having to wrap our basic graph pattern with a SELECT we can use the BIND function to tell the label service that we are interested in this label:

SELECT ?paper ?paperLabel ?author ?authorLabel ?publicationDate {
  VALUES (?paper) {(wd:Q74426266)}
  SERVICE wdsubgraph:scholarly_articles {
    ?paper wdt:P50 ?author ;
           wdt:P577 ?publicationDate .
    BIND(?paperLabel AS ?paperLabel)
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}

Wrapping a federated query with a SELECT

[edit]

If for some reason you need to wrap the federated query with a SELECT, be aware that Blazegraph will wrap the query again to select the shared variables, so if you don't select one of the shared variables the behavior of your query might be unexpected.

For example:

SELECT (SUM(?count) AS ?total) {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE wdsubgraph:scholarly_articles {
    SELECT (COUNT(*) AS ?count) {
      ?paper wdt:P921 ?subject .
    }
  }
}

Here we naively tried to count the number of papers whose main subject (P921) is about a subject related to theoretical physics (Q18362). Note that here the shared variable is ?subject but since we do not select it we break the link and Blazegraph will no longer consider this variable the same and will count all the paper - main subject (P921) pairs. So if aggregations are required in the federated query the shared variable must always be selected.

The way to fix the query above is simply to include ?subject using a GROUP BY:

SELECT (SUM(?count) AS ?total) {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE wdsubgraph:scholarly_articles {
    SELECT ?subject (COUNT(*) AS ?count) {
      ?paper wdt:P921 ?subject .
    } GROUP BY ?subject
  }
}

Returning variables bound by OPTIONAL

[edit]

You might sometimes want to use an OPTIONAL clause in the federated query. If any of its variables are shared variables extra care must be taken, especially if a variable is also used in a triple pattern of the host query.

SELECT ?paper ?venueLabel {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE wdsubgraph:scholarly_articles {
    ?paper wdt:P921 ?subject .
    OPTIONAL { ?paper wdt:P1433 ?venue }
  }
  OPTIONAL {
    ?venue rdfs:label ?venueLabel .
    FILTER (LANG(?venueLabel) = 'en')
  }
}

The above query does extract all articles whose main subject (P921) is about theoretical physics (Q18362) and does return an optional binding ?venue which is then used to optionally fetch its label. The issue is that ?venue may not always be bound (it's optional after all) and thus the triple pattern ?venue rdfs:label ?venueLabel might attempt to bind it. This pattern is particularly broad if not restricted and is likely to return way too many triples.

Solution 1 : forcing a bind with an arbitrary value

[edit]

There is an ugly workaround to this problem if you need to return possibly unbound bindings from the federated query: we can simply always bind it with a sigil (Q1758446) using the COALESCE and BOUND functions:

SELECT ?paper ?venueLabel {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE wdsubgraph:scholarly_articles {
    SELECT ?paper (COALESCE(IF(BOUND(?venue), ?venue, '__no_venue__')) AS ?venue) ?subject {
      ?paper wdt:P921 ?subject .
      OPTIONAL { ?paper wdt:P1433 ?venue }
    }
  }
  OPTIONAL {
    ?venue rdfs:label ?venueLabel .
    FILTER (LANG(?venueLabel) = 'en')
  }
}

or a bit shorter without the subquery

SELECT ?paper ?venueLabel {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE wdsubgraph:scholarly_articles {
      ?paper wdt:P921 ?subject .
      OPTIONAL { ?paper wdt:P1433 ?venue }
      bind (COALESCE(IF(BOUND(?venue), ?venue, '__no_venue__')) AS ?venue)
  }
  OPTIONAL {
    ?venue rdfs:label ?venueLabel .
    FILTER (LANG(?venueLabel) = 'en')
  }
}

Note the __no_venue__ sigil that we use when ?venue is not bound that will ensure that the subsequent ?venue rdfs:label ?venueLabel triple pattern can never match.

In that case there is also another way to rewrite the query

Solution 2 : rewrite a query with only one optional

[edit]

Another solution that can work, but works only with scheduling hints, is to put the ?paper wdt:P1433 ?venue outside of its "optional" and put it in the second optional instead. That way we ensure "?venue" should always be bound as it is not optional in the service call. This requires a second scholarly service call, and to ensure the service is called first we put a blazegraph scheduling hint to make sure it does not try to get all the en labels on Wikidata :

SELECT ?paper ?venueLabel {
  ?subject wdt:P279+ wd:Q18362 .
  SERVICE wdsubgraph:scholarly_articles {
      ?paper wdt:P921 ?subject .
  }
  OPTIONAL {
    service wdsubgraph:scholarly_articles {
      ?paper wdt:P1433 ?venue
    } 
    hint:Prior hint:runFirst true . # ensuring venue is bound before blazegraph tries to get the labels (?is this the right explanation it timeouts without ?)
    
    ?venue rdfs:label ?venueLabel .
      FILTER (LANG(?venueLabel) = 'en')
    }
}
Try it!

Notes

[edit]
  1. these are Blazegraph optimizer hints
  2. it happens a lot for erratum articles that reference both the work being corrected and the subject of the initial publication
  3. We use two federated queries to avoid having to OPTIONAL - these clauses are challenging (see #Common mistakes)