Wikidata:Tools/OpenRefine/Editing/Tutorials/Third-party reconciliation

From Wikidata
Jump to navigation Jump to search

Other languages:
English • ‎français

Sometimes, the source you want to import data from is huge. For instance, data sources such as company registers hold much more records than Wikidata will ever have in the corresponding domain. In that case, the usual workflow of loading the source database in OpenRefine and reconciling it to Wikidata is completely impractical - the databases are too big, reconciliation will take ages and will very rarely surface good matches (because the vast majority of records from the source database do not and should not exist in Wikidata).

This tutorial explains how to turn the problem around: we will instead extract existing Wikidata items with a SPARQL query that targets the corresponding domain, and reconcile these items against our data source. Our goal will be to add authority control identifiers such as VIAF ID (P214) and GND ID (P227) to items about people. We will use the LOBID reconciliation service, which lets us match records against the Integrated Authority File (Q36578) (GND).

Extracting target items with a SPARQL query[edit]

Say we are interested in improving the linkage of German researchers. We can retrieve a list of German researchers missing a GND ID (P227) like this:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P106 wd:Q1650915;
        wdt:P27 wd:Q183.
  FILTER NOT EXISTS { ?item wdt:P227 ?gnd }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 100

Try it!

Of course this query (and its limit) are arbitrary - we could equally look for Brazilian organizations or Latvian places. The goal is simply to narrow down the domain to items which are likely to have an entry in the target database.

Reconciling with GND[edit]

Import the results of this query in OpenRefine. The first column contains Qids, which can directly be reconciled to Wikidata (ReconcileStart reconciling and choose the Wikidata service). We will also reconcile the second column, but this time against GND itself. To do that, click ReconcileStart reconciling and Add standard service. Use the address of the GND reconciliation service run by LOBID: https://lobid.org/gnd/reconcile

Screenshot of the dialog to add a new reconciliation service

Just like for Wikidata, you can restrict the reconciliation by type and refine it via properties (see the documentation of the service for more details). You can then match items against GND:

Screenshot of the reconciliation process with LOBID

Retrieving the identifiers[edit]

Once you have matched items, you can obtain the GND id by adding a column with the expression cell.recon.match.id, and you can obtain the reference name in GND with cell.recon.match.name. You can also obtain this information (and much more) by using the Add columns from reconciled values operation:

Screenshot of the dialog to fetch data from GND

Adding the ids to Wikidata[edit]

We can then create a schema to add the identifiers to Wikidata. You can also add the reference name from GND as alias to the items:

Example schema to add the ids

This gives the following candidate edits:

Preview of the edits made

These edits can then be uploaded to Wikidata.

Other reconciliable data sources[edit]

Various other data sources can be queried via reconciliation services. Here are a few:

It is possible to create your own reconciliation interface for other databases, for instance via reconcile-csv, conciliator or by implementing the reconciliation API yourself.