User:Lucas Werkmeister (WMDE)/LOD Cloud

From Wikidata
Jump to navigation Jump to search

We need to collect some information in order to get Wikidata into the Linked Open Data cloud diagram (Q43984865).

Number of triples[edit]

SELECT ?count (NOW() AS ?asOf) WITH {
  SELECT (COUNT(*) AS ?count) WHERE {
    ?s ?p ?o.
  }
} AS %count WHERE {
  INCLUDE %count.
}
Try it!

Result as of : 12526206009 (there’s some mismatch between the different backend servers, so let’s just say ~12.5 billion)

Links to other datasets[edit]

We count all the properties as links which have formatter URI for RDF resource (P1921) and whose associated item has Linked Open Data Cloud ID (P8605). The following scripts counts the statements, qualifiers and references for all of those:

#!/usr/bin/env bash

TZ=UTC printf 'Data as of {{ISOdate|%(%FT%R:00%Z)T}}: \n\n'

curl -s \
    -H 'Accept: text/tab-separated-values' \
    --data-urlencode query='
      SELECT (GROUP_CONCAT(DISTINCT STRAFTER(STR(?property), STR(wd:)); separator = "|") AS ?propertyIds) ?lodCloudId WHERE {
        ?dataset wdt:P8605 ?lodCloudId;
                 wdt:P1687 ?property.
        ?property wikibase:propertyType wikibase:ExternalId;
                  wdt:P1921 ?rdfFormatterUri.
      }
      GROUP BY ?dataset ?lodCloudId
    ' \
    https://query.wikidata.org/sparql \
    | tail -n+2 \
    | while IFS=$'\t' read -r propertyIds_ lodCloudId; do
    lodCloudId=${lodCloudId#'"'}
    lodCloudId=${lodCloudId%'"'}
    propertyIds_=${propertyIds_#'"'}
    propertyIds_=${propertyIds_%'"'}
    IFS='|' read -ra propertyIds <<< "$propertyIds_"
    total=0
    for propertyId in "${propertyIds[@]}"; do
        count=$(
            curl -s \
                 -H 'Accept: text/tab-separated-values' \
                 --data-urlencode query="
                   SELECT (?asStatement + ?asQualifier + ?asReference AS ?count)
                   WITH {
                     SELECT (COUNT(*) AS ?asStatement) WHERE {
                       ?subject ps:$propertyId [].
                     }
                   } AS %asStatement
                   WITH {
                     SELECT (COUNT(*) AS ?asQualifier) WHERE {
                       ?subject pq:$propertyId [].
                     }
                   } AS %asQualifier
                   WITH {
                     SELECT (COUNT(*) AS ?asReference) WHERE {
                       ?subject pr:$propertyId [].
                     }
                   } AS %asReference
                   WHERE {
                     INCLUDE %asStatement.
                     INCLUDE %asQualifier.
                     INCLUDE %asReference.
                   }
                 " \
                 https://query.wikidata.org/sparql \
                | tail -n+2
             )
        ((total+=count))
    done
    printf '%s\t%s\t%s\n' "$propertyIds_" "$total" "$lodCloudId"
done \
    | sort -t$'\t' -k2 -rn \
    | while IFS=$'\t' read -r propertyIds_ count lodCloudId; do
    IFS='|' read -ra propertyIds <<< "$propertyIds_"
    printf '; %s ({{#commaseparatedlist:{{P|%s}}' "$lodCloudId" "${propertyIds[0]}"
    if ((${#propertyIds[@]} > 1)); then
        printf '|{{P|%s}}' "${propertyIds[@]:1}"
    fi
    printf '}}): %d\n' "$count"
done

Data as of 2021-04-30T12:04:00UTC:

doi (DOI (P356))
27367486
viaf (VIAF ID (P214))
6148897
freebase (Freebase ID (P646))
4397020
geonames-semantic-web (GeoNames ID (P1566))
3747452
uniprotkb (UniProt protein ID (P352), UniProt journal ID (P4616))
2537481
uniprot (UniProt protein ID (P352), UniProt journal ID (P4616))
2537481
dnb-gemeinsame-normdatei (GND ID (P227))
1918745
lcsh (Library of Congress authority ID (P244))
1338101
data-bnf-fr (Bibliothèque nationale de France ID (P268))
895058
idreffr (IdRef ID (P269))
528040
oclc-fast (FAST ID (P2163))
502221
zitgist-musicbrainz (MusicBrainz release ID (P5813), MusicBrainz event ID (P6423), MusicBrainz area ID (P982), MusicBrainz instrument ID (P1330), MusicBrainz series ID (P1407), MusicBrainz recording ID (P4404), MusicBrainz artist ID (P434), MusicBrainz work ID (P435), MusicBrainz release group ID (P436), MusicBrainz label ID (P966), MusicBrainz place ID (P1004))
490029
open-library (Open Library ID (P648))
271983
libris (Libris-URI (P5587), SELIBR ID (P906), LIBRIS editions (P1182))
258242
datos-bne-es (National Library of Spain ID (P950))
207362
eunis (EUNIS ID for species (P6177))
179098
swedish-open-cultural-heritage (Swedish Open Cultural Heritage URI (P1260))
164146
europeana-sparql (Europeana entity (P7704))
160460
national-diet-library-authorities (NDL Authority ID (P349))
124052
chembl-rdf (ChEMBL ID (P592))
100223
bioportal-msh (MeSH term ID (P6680), MeSH concept ID (P6694), MeSH descriptor ID (P486), MeSH tree code (P672))
77724
bioportal-mesh-owl (MeSH term ID (P6680), MeSH concept ID (P6694), MeSH descriptor ID (P486), MeSH tree code (P672))
77714
babelnet (BabelNet ID (P2581))
65942
bag (BAG building ID (P5208), BAG residence ID (P981), BAG public space ID (P5207))
61218
bluk-bnb (BNB person ID (P5361))
42519
gemeenschappelijke-thesaurus-audiovisuele-archieven (GTAA ID (P1741))
39556
data-persee-fr (Persée author ID (P2732), Persée journal ID (P2733))
34064
getty-tgn (Getty Thesaurus of Geographic Names ID (P1667))
29226
rism (RISM ID (P5504))
27329
getty-aat (Art & Architecture Thesaurus ID (P1014))
25080
gutenberg (Project Gutenberg ebook ID (P2034), Project Gutenberg author ID (P1938))
19945
BVMC (BVMC person ID (P2799), BVMC work ID (P3976))
16483
yso (YSO ID (P2347))
14180
glottolog (Glottolog code (P1394))
10800
clld-glottolog (Glottolog code (P1394))
10800
linked-open-numbers (KIT Linked Open Numbers ID (P5176))
10327
hungarian-national-library-catalog (NSZL name authority ID (P3133))
6825
pleiades (Pleiades ID (P1584))
5625
sandrart-net (Sandrart.net person ID (P1422), Sandrart.net artwork ID (P4380))
5172
thesesfr (Theses.fr person ID (P4285))
4370
bbc-programmes (BBC programme ID (P827))
4202
eurovoc (EuroVoc ID (P5437))
3068
stw-thesaurus-for-economics (STW Thesaurus for Economics ID (P3911))
2663
nomisma_org (Nomisma ID (P2950))
1801
bioportal-snomedct (SNOMED CT ID (P5806))
801
iptc-newscodes (IPTC NewsCode (P5429))
782
bioportal-unitsontology (wurvoc.org measure ID (P3328))
610
cpv-2008 (Common Procurement Vocabulary (P5417), CPV Supplementary (P8984))
422
msc (Mathematics Subject Classification ID (P3285))
210
naics-2012 (NAICS code (P3224))
45
warsampo (FI WarSampo person ID (P3817))
17

Additionally, the following two datasets are “special” and I haven’t yet figured out a better way to represent them:

lexvo (ISO 639-3 code (P220))
8300
ocd (Italian Chamber of Deputies dati ID (P1341))
12881