Help:Dataset sizing

From Wikidata
Jump to navigation Jump to search

Purpose[edit]

This page aims to list and define a few standard metrics suitable to be determined on a subset of Wikidata items.

For metrics used elsewhere, it attempts to provide queries that can be used on Query Server.

Version[edit]

This is the version as of 20200823075556. Please use the "permanent link" on the left side when quoting this page.

Introduction[edit]

Sample queries to select the items:

  • sleds: SELECT ?item WHERE { ?item wdt:P279* wd:Q181388 }
  • tennis: SELECT ?item WHERE { ?item wdt:P641 wd:Q847 }


Knowledge Graphs on the Web -- an Overview (Q86997852) proposes a few metrics:

  • a. # instances
  • b. # assertions
  • c. average linking degree
  • d. median ingoing edges
  • e. median outgoing edges
  • f. # classes
  • g. # relations
  • h. average depth of class tree
  • i. average branching factor of class tree (average width of class tree)
  • j. ontological complexity

They are described at "3. Comparison of Knowledge Graphs" in the paper.

Discussion at Wikidata:Request a query#Dataset sizing.

The queries below are mostly based on truthy main statements (wdt:), not qualifiers (pq:), references (pr:), sitelinks, or labels/descriptions/aliases. Please help expand/add alternate ways to calculate.

A few other metrics are included as well.

Basic metrics[edit]

number of instances[edit]

definition
number of distinct items
#  a. # instances
SELECT (COUNT(DISTINCT ?item) as ?nb_instance)
WHERE
{
     ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
}
Try it!

number of assertions[edit]

#  b. # assertions
# Tbd: include sitelinks?
SELECT (SUM(?st) as ?nb_assertions) 
WITH 
{
    SELECT DISTINCT ?item ?st 
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
        # ?item wdt:P641 wd:Q847 .
        ?item wikibase:statements ?st . 
    }      
} as %a
{
  INCLUDE %a 
}
Try it!


average linking degree[edit]

#  c. average linking degree
# TBD: include incoming links?
SELECT (AVG(?st) as ?avg_linking_degree)
WITH 
{
    SELECT DISTINCT ?item ?st 
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
        # ?item wdt:P641 wd:Q847 .
        ?item wikibase:statements ?st . 
    }      
} as %a
{
  INCLUDE %a 
}
Try it!

median ingoing edges[edit]

#  d. median ingoing edges: number of ingoing edges
# after the below, calculate median on ?nb_ingoing_edges
SELECT ?item (COUNT(?wdt) as ?nb_ingoing_edges) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
    }      
} as %a
{
  INCLUDE %a 
  ?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
  [] ?wdt ?item 
}
GROUP BY ?item
Try it!


median outgoing edges[edit]

#  e. median outgoing edges: number of outgoing edges
# after the below, calculate median on ?nb_outgoing_edges
# alternative method: include external id properties
SELECT ?item (COUNT(?wdt) as ?nb_outgoing_edges) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
    }      
} as %a
{
  INCLUDE %a 
  ?p wikibase:directClaim ?wdt ; wikibase:propertyType wikibase:WikibaseItem .
  ?item ?wdt []
}
GROUP BY ?item
Try it!


number of relations[edit]

#  g. # relations
# currently properties. Could be expanded to other

SELECT (COUNT(DISTINCT ?wdt) as ?nb_relations) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     #  ?item wdt:P641 wd:Q847 .

    }      
} as %a
{
  INCLUDE %a 
  ?p wikibase:directClaim ?wdt .
  { ?item ?wdt [] } UNION { [] ?wdt ?item }
}
Try it!

number of classes (types)[edit]

definition
number of distinct values used with instance of (P31) or subclass of (P279)
query
#  f. # classes
SELECT (COUNT(DISTINCT ?class) as ?nb_classes) 
WITH 
{
    SELECT DISTINCT ?item
    WHERE
    {
        ?item wdt:P279* wd:Q181388 .
     # ?item wdt:P641 wd:Q847 .
    }      
} as %a
{
  INCLUDE %a 
  ?item (wdt:P31|wdt:P279) ?class       
}
Try it!

Most frequent[edit]

most frequently used properties[edit]

definition
properties most frequently used as main values (truthy values)
query

most frequent sitelinks[edit]

definition
most frequently linked WMF sites (Wikipedia, Commons, Wikisource, etc.)
query

most frequently used classes (types)[edit]

definition
most frequent values used with instance of (P31) or subclass of (P279). Sometimes limited to P31.
query