Wikidata:A beginner-friendly course for SPARQL

From Wikidata
Jump to: navigation, search

WDQS, the Wikidata Query Service, is an awesome tool to answer many questions you might have, and this guide will teach you how to use it.

Before we start[edit]

For introduction to the service: A gentle introduction to the Wikidata Query Service.

Please don’t let that scare you away! SPARQL is complicated, but the simple basics will already get you a long way – if you want, you can stop reading after first 1-8 sections, and you’ll already know enough to read and write many interesting queries simply by adjusting examples with right items and properties.

Each section past 8 will empower you to write even more awesome queries!

Whitespace is not meaningful in SPARQL. WDQS editor indents lines for you automatically.

Let's go!

Group Graph Patterns[edit]

Johann Sebastian Bach (Q1339) had two wives. How can we see the children of Johann Sebastian Bach with his first wife, Maria Barbara Bach (Q57487)?

The simplest way to do this is to add a second triple with that restriction:

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339.     # Child  has father  Johann Sebastian Bach.
  ?child wdt:P25 wd:Q57487.    # Child  has mother     Maria Barbara Bach.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Queries and questions with "and" conjunction are very straightforward, simply write them one after another within one group; "and" between statements (like in SQL, Python, Lua, Java) is unnecessary.

Repetition in subject[edit]

In natural language, may abbreviate second "Child has" predicate using conjunction:

Child has father Johann Sebastian Bach and mother Maria Barbara Bach.

In SPARQL, simply end a triple with a semicolon (;) instead of a period, you can add another predicate-object pair:

# 4.2.1 Predicate-Object Lists
# https://www.w3.org/TR/sparql11-query/#predObjLists

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;     # Child  has father  Johann Sebastian Bach and
         wdt:P25 wd:Q57487.    #        has mother     Maria Barbara Bach.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Repetition in subject and predicate[edit]

Now suppose that, out of those results, we’re interested only in those children who also were also composer (Q36834) and pianist (Q486748). The relevant property occupation (P106). Please try yourself first. Possible solution below:

# 4.2.2 Object Lists
# https://www.w3.org/TR/sparql11-query/#objLists

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
         wdt:P25 wd:Q57487;
         wdt:P106 wd:Q36834;   #       has occupation  composer and
         wdt:P106 wd:Q486748.  #       has occupation   pianist.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Syntax , allows you to append another object to a triple (reusing both subject and predicate), query can be simplified to:

SELECT ?child ?childLabel
WHERE
{
  ?child wdt:P22 wd:Q1339;
         wdt:P25 wd:Q57487;
         wdt:P106 wd:Q36834,    #        has occupation composer and 
                  wd:Q486748.   #                        pianist.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

SPARQL punctuation[edit]

  • Each triple about a subject is terminated by a period;
  • Multiple predicates about the same subject can be separated by semicolons;
  • Multiple objects for the same subject and predicate can be separated by commas.
SELECT ?s1 ?s2 ?s3
WHERE
{
  ?s1 p1 o1;             # s1
      p2 o2;             # s1
      p3 o31, o32, o33.  # s1
  ?s2 p4 o41, o42.       # s2
  ?s3 p5 o5;             # s3
      p6 o6.             # s3
}

Properties of the object (Relative Clauses)[edit]

Suppose we’re not actually interested in Bach’s children, but in his grandchildren.

For this task we would use child (P40), which points from parent to child and is gender-independent. Possible solution below:

SELECT ?grandChild ?grandChildLabel
WHERE
{
  wd:Q1339 wdt:P40 ?child.                     #    Bach  has a child       ?child.
    ?child wdt:P40 ?grandChild.                #  ?child  has a child  ?grandChild.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Brackets syntax[edit]

Note that we don't need information about the child (?child): we don’t use the variable except to talk about the grandchild.

We can refer back to them because we’ve said “someone who”: this starts a relative clause, and within that relative clause we can say things about “someone” (e. g., that he or she “has a child ?grandChild”).

In SPARQL you can use a pair of brackets ([]) in place of a variable, which acts as an anonymous variable:

# 4.1.4 Syntax for Blank Nodes https://www.w3.org/TR/sparql11-query/#QSynBlankNodes
# Example with object
SELECT ?grandChild ?grandChildLabel
WHERE
{
  wd:Q1339 wdt:P40 [ wdt:P40 ?grandChild ].   # Bach                                    has as child                     someone who has a child ?grandChild.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Word order is fixed around predicates in SPARQL; otherwise it can be changed from SVO to OVS:

# 4.1.4 Syntax for Blank Nodes https://www.w3.org/TR/sparql11-query/#QSynBlankNodes
# Example with subject
# 9.1 Property Path Syntax https://www.w3.org/TR/sparql11-query/#pp-language
# ^ or InversePath
SELECT ?grandChild ?grandChildLabel
WHERE
{
  [ wdt:P40 ?grandChild ] ^wdt:P40 wd:Q1339.   # Someone who has a child ?grandChild     was specified in P40 property    of Bach Q1339
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Inside the brackets, you can specify predicate-object pairs, just like after a ; after a normal triple; the implicit subject is in this case the anonymous variable that the brackets represent. (Note: also just like after a ;, you can add more predicate-object pairs with more semicolons, or more objects for the same predicate with commas.)

Retrieving items with optional information (OPTIONAL)[edit]

Let’s try to craft a query that also includes the title (P1476), illustrator (P110), publisher (P123) and publication date (P577):

# 6 Including Optional Values (malfunctioned code)
# https://www.w3.org/TR/sparql11-query/#optionals

SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
  ?book wdt:P50 wd:Q35610;
        wdt:P1476 ?title;
        wdt:P110 ?illustrator;
        wdt:P123 ?publisher;
        wdt:P577 ?published.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

It only returns two results, why is that?

Reason is very simple: 5 patterns are glued using "and" conjunctions.

In other words, to match this query, a potential result must match all the triples we listed: it must have a title, and an illustrator, and a publisher, and a publication date. If it has some of those properties, but not all of them, it won’t match.

That’s not what we want: we primarily want a list of all the books – if additional data is available, we’d like to include it, but we don’t want that to limit our list of results.

The solution is to tell SPARQL executor that those properties are optional:

  • wrap each with optional clause, line before: ?book wdt:P1476 ?title. and after: OPTIONAL { ?book wdt:P1476 ?title. }
  • order matters, place "OPTIONAL" after required patterns[1]

Instances and classes[edit]

Earlier, we noted that most Wikidata properties are “has” relations: has child, has father, has occupation. But sometimes (in fact, frequently), you also need to talk about what something is:

When we want to search for “all work of art”, it’s not enough search for all items that are direct instances of work of art:

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31 wd:Q838948. # instance of work of art
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

That query only returns 2815 results – obviously, there are over 868119 work of art! The problem is that this misses items like Gone with the Wind, which is only an instance of film, not of work of art. We need to tell SPARQL to account following claim when searching:

< film (Q11424) View with Reasonator View with SQID > subclass of (P279) View with SQID < work of art (Q838948) View with Reasonator View with SQID >

One possible solution to this is the brackets syntax we talked about: Gone with the Wind is an instance of some class subclass of “work of art”.

But this might be not what you want:

  1. We’re no longer including items that are directly instances of work of art. In other words, subclass of relations in path can be optional.
  2. We’re still missing items that are instances of some subclass of some other subclass of “work of art” – for example, Snow White and the Seven Dwarfs is an animated film, which is a film, which is a work of art. In this case, we need to follow two “subclass of” statements – but it might also be three, four, five, any number really.
  3. For some properties, degree of nesting isn't known beforehand: not only it means that there might be a deep chain of subclass of but also such chain should be combined (wasn't covered yet) with short chains of few subclass of. The more links, the more nesting, the less query is readable by humans. Furthermore query that uses simplest syntax or brackets syntax won't match layers of underlying data exactly (3 levels in query, but 4 in data) and every time you change the data, you have to update query as well in order to match them back.

More complex, but also more flexible solution:

# instance of any subclass of work of art

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q838948. # one P31 and any number of P279 between the item and the class
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} LIMIT 42

Try it!

Now you know how to search for all work of arts, or all buildings, or all human settlements: the magic incantation wdt:P31/wdt:P279*, along with the appropriate class. This uses some more SPARQL features that I haven’t explained yet, but quite honestly, this is almost the only relevant use of those features, so you don’t need to understand how it works in order to use WDQS effectively.

subclass of (P279) is the most common transitive property (Q18647515), see others.

Property paths[edit]

Property paths are a way to very tersely write down a path of properties between two items. You can add sequence path elements with a forward slash (/):

SPARQL Items and properties Description
?river wdt:P403 wd:Q1497 mouth of the watercourse (P403), Mississippi River (Q1497) All items that flow directly into the Mississippi River
?river wdt:P403+ wd:Q1497 mouth of the watercourse (P403), Mississippi River (Q1497) All items that flow into the Mississippi River, directly or indirectly
wd:Q1339 wdt:P40 ?descendant child (P40), Johann Sebastian Bach (Q1339) Children of Johann Sebastian Bach (Q1339)
wd:Q1339 wdt:P40 ?child. ?child wdt:P40 ?descendant child (P40), Johann Sebastian Bach (Q1339) Grandchildren of Johann Sebastian Bach (Q1339)
wd:Q1339 wdt:P40 [ wdt:P40 ?descendant ] child (P40), Johann Sebastian Bach (Q1339) Grandchildren of Johann Sebastian Bach (Q1339)
wd:Q1339 wdt:P40/wdt:P40 ?descendant child (P40), Johann Sebastian Bach (Q1339) Grandchildren of Johann Sebastian Bach (Q1339)
wd:Q1339 wdt:P40+ ?descendant child (P40), Johann Sebastian Bach (Q1339) All descendants of Johann Sebastian Bach (Q1339)
wd:Q1339 wdt:P40* ?descendant child (P40), Johann Sebastian Bach (Q1339) All descendants of Johann Sebastian Bach (Q1339), including Johann Sebastian Bach (Q1339)
?descendant (wdt:P22|wdt:P25)+ wd:Q1339 father (P22), mother (P25), Johann Sebastian Bach (Q1339) All descendants of Johann Sebastian Bach (Q1339)
?work wdt:P31/wdt:P279* wd:Q838948 instance of (P31), subclass of (P279), work of art (Q838948) Instance of any subclass of work of art (Q838948)
?instance wdt:P31/wdt:P279* ?class instance of (P31), subclass of (P279), element instance in Wikidata (Q28326490), Wikidata element-class (Q28326484), Instance of any subclass of class
9 Property Paths https://www.w3.org/TR/sparql11-query/#pp-language

Symmetric properties and self-references[edit]

In Wikidata properties can refer to other elements. Sometimes items are obligated to have links between each other: symmetric property.

In practice this means that you might encounter:

self-references
in order to exclude self-references, append FILTER (?item != wd:Q801551) after last triple in Group Graph Patterns.
duplicates

Wider or narrower results[edit]

Matching Alternatives. Negation.

Over time we will lose interest in some items as well-known, visited or done in any sense. It's time to exclude them (MINUS), or to include new items (UNION):

The following query uses these:

#defaultView:ImageGrid
SELECT ?item ?itemLabel ?image ?genreLabel ?movementLabel
WHERE
{
           ?item wdt:P31/wdt:P279*          wd:Q838948   . # works of art
           ?item wdt:P276                   wd:Q19675    . # located in Louvre
           # 117 items
MINUS    { ?item wdt:P136                   wd:Q440928   } # except ONE sculptural genre (Q440928)
           # 116 items
MINUS    { ?item wdt:P136/wdt:P31/wdt:P279* wd:Q18783400 } # except ANY sculptural genre (Q18783400)
           # 113 items
  
OPTIONAL { ?item wdt:P18                   ?image       }
OPTIONAL { ?item wdt:P136                  ?genre       }
OPTIONAL { ?item wdt:P135                  ?movement    }
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Try it!

We should note here about potential differences between MINUS and FILTER NOT EXISTS:

  • NOT EXISTS and MINUS represent two ways of thinking about negation, one based on testing whether a pattern exists in the data, given the bindings already determined by the query pattern, and one based on removing matches based on the evaluation of two patterns.[2]
  • One of the key differences between MINUS and NOT EXISTS is that it is a child graph pattern and so breaks the graph pattern and so the result of the query can change depending where the MINUS is placed[3][4]

Multiple values[edit]

Rarely in Wikidata, we may enter multiple values.

When we query for ?item wdt:mvproperty ?value we can get multiple tuples about values, not one about item. This is different from object-oriented approach where one record corresponds to one object.

In order to get one subject (or item) per tuple:

ignore such properties
the most radical way; do not place properties that return multiple values (wd:Q12418 wdt:P186 ?material) in "SELECT" part of your query
LIMIT 1
a less radical way, but it discards data as well: SELECT ?materialLabel { SELECT ?materialLabel WHERE { wd:Q12418 wdt:P186 ?material . SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } . } LIMIT 1 }; with a good hammer it is possible to fit square in circle
group_concat
working query. simplest query with label service wouldn't work.

Qualifiers, Ranks and Sources[edit]

RDF mappings (above) are different from what we use in SPARQL; Image is used to illustrate idea of additional nodes and links between them

Mona Lisa (Q12418) View with Reasonator View with SQID and material used (P186) View with SQID:

  1. oil paint (Q296955), the main material;
  2. poplar wood (Q291034), with the qualifier applies to part (P518)  painting surface (Q861259) – this is the material that the Mona Lisa was painted on

Suppose we want to find all paintings with their painting surface, that is, those material used (P186) statements with a qualifier applies to part (P518)  painting surface (Q861259). How do we do that? That’s more information than can be represented in a single triple.

The answer is: more triples!

Wikidata’s solution for almost everything is “more items”, and the corresponding WDQS rule is “more triples”. References, numeric precision, values with units, geocoordinates, etc., all of which we’re skipping here, also work like this.

entity
wd:
direct property
wdt: - best values; it respects ranks: only preferred values, else only normal and never deprecated
statement node
p: - link between entity (item) and statement (wds:); it selects every statement node, regardless of ranks or other information. This node then is the subject of other prefixes below;
property statements
ps: - ” to statement object using statement nodes
property qualifier
pq: - ” to statement qualifiers using statement nodes
ranks
wikibase:rank - ” to rank
references
prov:wasDerivedFrom - ” to reference nodes

Here’s a concrete example for the Mona Lisa:

   wd:Q12418  p:P186   ?st1.      # Mona Lisa: material used: ?st1                #  p: is a link between entity and a statement
        ?st1 ps:P186 wd:Q296966.  # value: oil paint                              # ps: is a link between statement and values

   wd:Q12418  p:P186   ?st2.      # Mona Lisa: material used: ?st2
       ?st2  ps:P186 wd:Q291034.  # value: poplar wood                         
       ?st2  pq:P518 wd:Q861259.  # qualifier: applies to part: painting surface  # pq: is a link between statement and qualifiers

Exercise: a query for all paintings with their painting surface?

Pagination (ORDER and LIMIT)[edit]

It’s quite common to care only about a few results: a first, first to, pioneer in; oldest, earliest; youngest, latest.

In order to get an answer our entities should be ordered and limited:

ORDER BY something sorts the results by something.
something can be any expression – for now, the only kind of expression we know are simple variables (?something), but we’ll see some other kinds later. This expression can also be wrapped in either ASC() or DESC() to specify the sorting order (ascending or descending). (If you don’t specify either, the default is ascending sort, so ASC(something) is equivalent to just something.)
LIMIT count cuts off the result list at count results,
where count is any natural number. For example, LIMIT 10 limits the query to ten results. LIMIT 1 only returns a single result.

(You can also use LIMIT without ORDER BY. In this case, the results aren’t sorted, so you don’t have any guarantee which results you’ll get. Which is fine if you happen to know that there’s only a certain number of results, or you’re just interested in some result, but don’t care about which one. In either case, adding the LIMIT can significantly speed up the query, since WDQS can stop searching for results as soon as it’s found enough to fill the limit.)

The query that returns the ten most populous countries:

SELECT DISTINCT ?country ?countryLabel ?population ?ended
# ideally we don't need a "DISTINCT" above
# but because some items have multiple P31 statements that lead to a Q3624078
# we can trim duplicates as workaround (or inspect classification and P31 links)
#SELECT ?country ?countryLabel ?population ?ended
WHERE
{
  ?country wdt:P31/wdt:P279* wd:Q3624078; #countries
           wdt:P1082         ?population; #with their population
  MINUS 
  {
  ?country wdt:P576          ?ended. 
  } # exclude "former" countries
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population) # most populous countries - descending population
LIMIT 10

Try it!

Limitations[edit]

In Wikidata sort order defined for following types of properties:

But not for:

Expressions, FILTER and BIND[edit]

This section might seem a bit less organized than the other ones, because it covers a fairly wide and diverse topic. The basic concept is that we’d now like to do something with the values that, so far, we’ve just selected and returned indiscriminately. And expressions are the way to express these operations on values. There are many kinds of expressions, and a lot of things you can do with them – but first, let’s start with the basics: data types.

Data types[edit]

Each value in SPARQL has a type, which tells you what kind of value it is and what you can do with it. The most important types are:

  • item, like wd:Q42 for no label (Q42).
  • boolean, with the two possible values true and false. Boolean values aren’t stored in statements, but many expressions return a boolean value, e. g. 2 < 3 (true) or "a" = "b" (false).
  • string, a piece of text. String literals are written in double quotes.
  • monolingual text, a string with a language tag attached. In a literal, you can add the language tag after the string with an @ sign, e. g. "Douglas Adams"@en.
  • numbers, either integers (1) or decimals (1.23).
  • dates. Date literals can be written by adding ^^xsd:dateTime (case sensitive – ^^xsd:datetime won’t work!) to an ISO 8601 date string: "2012-10-29"^^xsd:dateTime. (Wikidata does not yet support timestamps with hours, minutes, seconds, etc.)

Operators[edit]

The familiar mathematical operators are available: +, -, *, / to add, subtract, multiply or divide numbers, <, >, =, <=, >= to compare them. The inequality test ≠ is written !=. Comparison is also defined for other types; for example, "abc" < "abd" is true (lexical comparison), as is "2016-01-01"^^xsd:dateTime > "2015-12-31"^^xsd:dateTime and wd:Q4653 != wd:Q283111. And boolean conditions can be combined with && (logical and: a && b is true if both a and b are true) and || (logical or: a || b is true if either (or both) of a and b is true).

FILTER[edit]

FILTER(condition) is a clause you can insert into your SPARQL query to, well, filter the results. Inside the parentheses, you can put any expression of boolean type, and only those results where the expression returns true are used.

For example, to get a list of all humans born in 2015, we first get all humans with their date of birth –

SELECT ?person ?personLabel ?dob
WHERE
{
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?dob.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

– and then filter that to only return the results where the year of the date of birth is 2015. There are two ways to do that: extract the year of the date with the YEAR function, and test that it’s 2015 –

FILTER(YEAR(?dob) = 2015)

– or check that the date is between Jan. 1st (inclusive), 2015 and Jan. 1st, 2016 (exclusive):

FILTER("2015-01-01"^^xsd:dateTime <= ?dob && ?dob < "2016-01-01"^^xsd:dateTime)

I’d say that the first one is more straightforward, but it turns out the second one is much faster, so let’s use that:

SELECT ?person ?personLabel ?dob
WHERE
{
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?dob.
  FILTER("2015-01-01"^^xsd:dateTime <= ?dob && ?dob < "2016-01-01"^^xsd:dateTime)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

Try it!

Language of the value (monolingualtext)[edit]

Saint Petersburg (Q656) with multiple values in official name (P1448) or any other property Special:ListProperties/monolingualtext.

SELECT ?value ?startDate ?endDate # ?r
WHERE
{
       wd:Q656  p:P1448 ?s.
       ?s      ps:P1448 ?value.
       ?s      pq:P580  ?startDate.
       ?s      pq:P582  ?endDate.
#      ?s      wikibase:rank ?r.
       FILTER(LANG(?value) = "ru")
}

Try it!

Language of the label[edit]

We get the label with the ?human rdfs:label ?label triple, restrict it to English labels, and then check if it starts with “Mr. ”:

SELECT ?human ?label
WHERE
{
  ?human wdt:P31 wd:Q15632617;
         rdfs:label ?label.
  FILTER(LANG(?label) = "en")
  FILTER(STRSTARTS(?label, "Mr. "))
}

Try it!

Assigning to Variables[edit]

BIND(expression AS ?variable).
clause can be used to assign the result of an expression to a variable (usually a new variable, but you can also overwrite existing ones).

BIND can be used to bind the results of some calculation to a new variable. This can be an intermediate result of a larger calculation or just directly a result of the query. For example, to get the age of victims of capital punishment:

SELECT ?person ?personLabel ?age
WHERE
{
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?born;
          wdt:P570 ?died;
          wdt:P1196 wd:Q8454.
  BIND(?died - ?born AS ?ageInDays).
  BIND(?ageInDays/365.2425 AS ?ageInYears).
  BIND(FLOOR(?ageInYears) AS ?age).
  # or, as one expression:
  #BIND(FLOOR((?died - ?born)/365.2425) AS ?age).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

BIND can also be used to simply bind constant values to variables in order to increase readability. For example, a query that finds all female priests:

SELECT ?woman ?womanLabel
WHERE
{
  ?woman wdt:P31 wd:Q5;
         wdt:P21 wd:Q6581072;
         wdt:P106 wd:Q42603.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

can be rewritten like this:

SELECT ?woman ?womanLabel
WHERE
{
  BIND(wdt:P31 AS ?instanceOf).
  BIND(wd:Q5 AS ?human).
  BIND(wdt:P21 AS ?sexOrGender).
  BIND(wd:Q6581072 AS ?female).
  BIND(wdt:P106 AS ?occupation).
  BIND(wd:Q42603 AS ?priest).
  ?woman ?instanceOf ?human;
         ?sexOrGender ?female;
         ?occupation ?priest.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

The meaningful part of the query, from ?woman to ?priest., is now probably more readable. However, the large BIND block right in front of it is pretty distracting, so this technique should be used sparingly. (In the WDQS user interface, you can also hover your mouse over any term like wd:Q123 or wdt:P123 and see the label and description for the entity, so ?female is only more readable than wd:Q6581072 if you ignore that feature.)

Testing Values[edit]

BOUND(?variable)
tests if a variable has been bound to a value (returns true or false). It’s mostly useful on variables that are introduced in an OPTIONAL clause.
IF(condition,thenExpression,elseExpression) evaluates to thenExpression
if condition evaluates to true, and to elseExpression if condition evaluates to false.
IF(true, "yes", "no") evaluates to "yes"
IF(false, "great", "terrible") evaluates to "terrible"

IF expressions are often used with BOUND as the expression. For example, suppose you have a query that shows some humans, and instead of just showing their label, you’d like to display their pseudonym (P742) if they have one, and only use the label if a pseudonym doesn’t exist. For this, you select the pseudonym in an OPTIONAL clause (it has to be optional – you don’t want to throw out results that don’t have a pseudonym), and then use BIND(IF(BOUND(… to select either the pseudonym or the label.

SELECT ?writer ?label
WHERE
{
  # French writer born in the second half of the 18th century
  ?writer wdt:P31 wd:Q5;
          wdt:P27 wd:Q142;
          wdt:P106 wd:Q36180;
          wdt:P569 ?dob.
  FILTER("1751-01-01"^^xsd:dateTime <= ?dob && ?dob < "1801-01-01"^^xsd:dateTime)
  # get the English label
  ?writer rdfs:label ?writerLabel.
  FILTER(LANG(?writerLabel) = "en")
  # get the pseudonym, if it exists
  OPTIONAL { ?writer wdt:P742 ?pseudonym. }
  # bind the pseudonym, or if it doesn’t exist the English label, as ?label
  BIND(IF(BOUND(?pseudonym),?pseudonym,?writerLabel) AS ?label).
}

Try it!

Other properties that may be used in this way include nickname (P1449), posthumous name (P1786), and taxon common name (P1843) – anything where some sort of “fallback” makes sense.

You can also combine BOUND with FILTER to ensure that at least one of several OPTIONAL blocks has been fulfilled. For example, let’s get all astronauts that went to the moon, as well as the members of Apollo 13 (Q182252) (close enough, right?). That restriction can’t be expressed as a single property path, so we need one OPTIONAL clause for “member of some moon mission” and another one for “member of Apollo 13”. But we only want to select those results where at least one of those conditions is true.

SELECT ?astronaut ?astronautLabel
WHERE
{
  ?astronaut wdt:P31 wd:Q5;
             wdt:P106 wd:Q11631.
  OPTIONAL {
    ?astronaut wdt:P450 ?mission.
    ?mission wdt:P31 wd:Q495307.
  }
  OPTIONAL {
    ?astronaut wdt:P450 wd:Q182252.
    BIND(wd:Q182252 AS ?mission).
  }
  FILTER(BOUND(?mission))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Try it!

Grouping[edit]

So far, all the queries we’ve seen were queries that found all items satisfying some conditions; in some cases, we also included extra statements on the item (paintings with materials, Arthur Conan Doyle books with title and illustrator).

But it’s very common that we don’t want a long list of all results. Instead, we might ask questions like this:

  • How many paintings were painted on canvas / poplar wood / etc.?
  • What is the highest population of each country’s cities?
  • What is the total number of guns produced by each manufacturer?
  • Who publishes, on average, the longest books?

City populations[edit]

Let’s look at the second question for now. It’s fairly simple to write a query that lists all cities along with their population and country, ordered by country:

SELECT ?country ?city ?population
WHERE
{
  ?city wdt:P31/wdt:P279* wd:Q515;
        wdt:P17 ?country;
        wdt:P1082 ?population.
}
ORDER BY ?country

Try it!

(Note: that query returns a lot of results, which might cause trouble for your browser. You might want to add a LIMIT clause.)

Since we’re ordering the results by country, all cities belonging to a country form one contiguous block in the results. To find the highest population within that block, we want to consider the block as a group, and aggregate all the individual population values into one value: the maximum. This is done with a GROUP BY clause below the WHERE block, and an aggregate function (MAX) in the SELECT clause.

SELECT ?country (MAX(?population) AS ?maxPopulation)
WHERE
{
  ?city wdt:P31/wdt:P279* wd:Q515;
        wdt:P17 ?country;
        wdt:P1082 ?population.
}
GROUP BY ?country

Try it!

We’ve replaced the ORDER BY with a GROUP BY. The effect of this is that all results with the same ?country are now grouped together into a single result. This means that we have to change the SELECT clause as well. If we kept the old clause SELECT ?country ?city ?population, which ?city and ?population would be returned? Remember, there are many results in this one result; they all have the same ?country, so we can select that, but since they can all have a different ?city and ?population, we have to tell WDQS which of those values to select. That’s the job of the aggregate function. In this case, we’ve used MAX: out of all the ?population values, we select the maximum one for the group result. (We also have to give that value a new name with the AS construct, but that’s just a minor detail.)

This is the general pattern for writing group queries: write a normal query that returns the data you want (not grouped, with many results per “group”), then add a GROUP BY clause and add an aggregate function all the non-grouped variables in the SELECT clause.

Painting materials[edit]

Let’s try it out with another question: How many paintings were painted on each material? First, write a query that just returns all paintings along with their painting material. (Take care to only use those material used (P186) statements with an applies to part (P518)  painting surface (Q861259) qualifier.)

SELECT ?material ?painting
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
}

Try it!

Next, add a GROUP BY clause on the ?material, and then an aggregate function on the other selected variable (?painting). In this case, we are interested in the number of paintings; the aggregate function for that is COUNT.

SELECT ?material (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
}
GROUP BY ?material

Try it!

One problem with this is that we don’t have the label for the materials, so the results are a bit inconvenient to interpret. If we just add the label variable, we’ll get an error:

SELECT ?material ?materialLabel (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?material

Try it!

Bad aggregate

“Bad aggregate” is an error message you’ll probably see a lot when working with group queries; it means that one of the selected variables needs an aggregate function but doesn’t have one, or it has an aggregate function but isn’t supposed to have one. In this case, WDQS thinks that there might be multiple ?materialLabels per ?material (even though we know that can’t happen), and so it complains that you’re not specifying an aggregate function for that variable.

One solution is to group over multiple variables. If you list multiple variables in the GROUP BY clause, there’s one result for each combination of those variables, and you can select all those variables without aggregate function. In this case, we’ll group over both ?material and ?materialLabel.

SELECT ?material ?materialLabel (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?material ?materialLabel

Try it!

We’re almost done with the query – just one more improvement: we’d like to see the most-used materials first. Fortunately, we’re allowed to use the new, aggregated variables from the SELECT clause (here, ?count) in an ORDER BY clause, so this is very simple:

SELECT ?material ?materialLabel (COUNT(?painting) AS ?count)
WHERE
{
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
            p:P186 [ ps:P186 ?material; pq:P518 wd:Q861259 ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?material ?materialLabel
ORDER BY DESC(?count)

Try it!

As an exercise, let’s do the other queries too.

Guns by manufacturer[edit]

What is the total number of guns produced by each manufacturer?

Publishers by number of pages[edit]

What is the average (function: AVG) number of pages of books by each publisher?

HAVING[edit]

A small addendum to that last query – if you look at the results, you might notice that the top result has an outrageously large average, over ten times that of the second place. A bit of investigation reveals that this is because that publisher (UTET (Q4002388)) only published a single book with a number of pages (P1104) statement, Grande dizionario della lingua italiana (Q3775610), which skews the results a bit. To remove outliers like that, we could try to select only publishers that published at least two books with number of pages (P1104) statements on Wikidata.

How do we do that? Normally, we restrict results with a FILTER clause, but in this case we want to restrict based on the group (the number of books), not any individual result. This is done with a HAVING clause, which can be placed right after a GROUP BY clause and takes an expression just like FILTER does:

SELECT ?publisher ?publisherLabel (AVG(?pages) AS ?avgPages)
WHERE
{
  ?book wdt:P123 ?publisher;
        wdt:P1104 ?pages.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?publisher ?publisherLabel
HAVING(COUNT(?book) > 1)
ORDER BY DESC(?avgPages)

Try it!

Aggregate functions summary[edit]

Here’s a short summary of the available aggregate functions:

  • COUNT: the number of elements. You can also write COUNT(*) to simply count all results.
  • SUM, AVG: the sum or average of all elements, respectively. If the elements aren’t numbers, you’ll get weird results.
  • MIN, MAX: the minimum or maximum value of all elements, respectively. This works for all value types; numbers are sorted numerically, strings and other types lexically.
  • SAMPLE: any element. This is occasionally useful if you know there’s only one result, or if you don’t care which one is returned.
  • GROUP_CONCAT: concatenates all elements. Rarely useful, but if you’re curious, you can look it up in the SPARQL specification.

Additionally, you can add a DISTINCT modifier for any of these functions to eliminate duplicate results. For example, if there are two results but they both have the same value in ?var, then COUNT(?var) will return 2 but COUNT(DISTINCT ?var) will only return 1. You often have to use DISTINCT when your query can return the same item multiple times – this can happen if, for example, you use ?item wdt:P31/wdt:P279* ?class, and there are multiple paths from ?item to ?class: you will get a new result for each of those paths, even though all the values in the result are identical. (If you’re not grouping, you can also eliminate those duplicate results by starting the query with SELECT DISTINCT instead of just SELECT.)

And beyond…[edit]

This guide ends here. SPARQL doesn’t: there’s still a lot that I haven’t shown you – I never promised this was going to be a complete guide! If you got this far, you already know a lot about WDQS and should be able to write some very powerful queries. But if you want to learn even more, here are some things you can look at:

  • Subqueries. You add another entire query in curly brackets ({ SELECT ... WHERE { ... } LIMIT 10 }), and the results are visible in the outer query. (If you’re familiar with SQL, you’ll have to rethink the concept a bit – SPARQL subqueries are purely “bottom-up” and can’t use values from the outer query, like SQL “correlated subqueries” can.)
  • MINUS lets you select results that don’t fit some graph pattern. FILTER NOT EXISTS is mostly equivalent (see the SPARQL spec for an example where they differ), but – at least on WDQS – usually slower by quite a bit.
  • VALUES is a simple way to introduce multiple possible values for a variable.

Your main reference for these and other topics is the SPARQL specification.

And of course, there are some parts of Wikidata still missing as well, such as references, numeric precision (100±2.5), values with units (two kilograms), geocoordinates, sitelinks, statements on properties, and more. You can see how those are modeled as triples under mw:Wikibase/Indexing/RDF Dump Format.

  1. https://wiki.blazegraph.com/wiki/index.php/SPARQL_Order_Matters
  2. https://www.w3.org/TR/sparql11-query/#neg-notexists-minus
  3. https://jena.apache.org/documentation/query/negation.html
  4. https://wiki.blazegraph.com/wiki/index.php/SPARQL_Order_Matters