Wikidata:Tools/OpenRefine/Editing/Tutorials/Working with APIs

From Wikidata
Jump to navigation Jump to search

In this tutorial, we are going to extract information from a web API and upload it to Wikidata. An API is a machine-readable version of a website, which eases the automated extraction of information from this website.

Getting started[edit]

ORCID is an initiative to provide unique identifiers to scholarly authors. Anyone can create its own profile on ORCID and get a unique identifier that they can use when authoring research articles or other artifacts. In Wikidata, the ORCID iD (P496) property can be used to link to these profiles, which often contain information about the employer of the researcher. For instance, at the time of writing Andreas Küttel (Q289574) does not indicate any affiliation, but links to which lists an Employment record. Our goal is to import this information in Wikidata.

First, let us write a small SPARQL query to find examples of researchers that do not have any affiliation on their Wikidata item, but do have an ORCID iD (P496).

SELECT ?researcher ?orcid WHERE {
  ?researcher wdt:P31 wd:Q5;
              wdt:P496 ?orcid.
  FILTER NOT EXISTS { ?researcher wdt:P108 ?institution }
Run it!

Run this query and download its result as a table. Create an OpenRefine project with it (the standard settings for the importer should be adequate). The first column contains URIs of Wikidata items. For now, OpenRefine treats them as any other URL: they are technically not reconciled to Wikidata. To give these cells a reconciled status, you need to run reconciliation on this column, which should be significantly quicker than usual (as no search happens - the Qids are just validated and returned).[1]

Querying the API[edit]

ORCID provides a public API that can be used without registration to retrieve the public contents of any ORCID profile. They have a quick tutorial explaining how the API works. You do not need to fully understand ORCID's description of the API - their guide is quite complicated because it covers advanced features that we will not need here. All we need to know is which URL we should retrieve to obtain the affiliation information, in a structured format. To retrieve the contents of the Employment section, we need to call the following address: ""+value+"/employments", where +value+ is replaced by the ORCID identifier of the person. By default, this returns a response formatted in XML, but we can ask to get JSON instead, by adding the HTTP header Accept: application/json. We can do all this in OpenRefine, using the Edit columnAdd column by fetching URLs operation on the ORCID column:

Wikidata editing with OpenRefine. API tutorial. Configuration of the URL fetching operation.
Wikidata editing with OpenRefine. API tutorial. Configuration of the URL fetching operation.

Extracting information from JSON responses[edit]

State of the project after retrieving the JSON responses.

Once the operation is complete, you obtain the JSON responses in a new column. We now need to extract information from this. Each response potentially contains a list of employers, so the first step is to extract the sub-objects for each employer on a separate line, according to OpenRefine's record model. This is unfortunately a slightly inelegant process:

  • we first need to extract the list of employers by parsing the JSON payload, using the Add column based on this column operation (in the Edit column menu of the employments column) with this expression: value.parseJson()["employment-summary"].join('###')
  • we then use the Split multi-valued cells operation (in the Edit cells menu of the new column), specifying the same separator ###, to obtain each object on its own row.

Once we have this, we can again use parseJson() to extract information from this object. We will extract the following fields (again using the Add column based on this column operation):[2]

  • Organization name, with value.parseJson()["organization"]["name"]
  • Organization country, with value.parseJson()["organization"]["address"]["country"]
  • Organization identifier, with value.parseJson()["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"]
  • Start date of the position, with value.parseJson()["start-date"]["year"]["value"]
  • End date of the position, with value.parseJson()["end-date"]["year"]["value"]
  • (Exercise: you can also extract position titles, as they can be added as qualifiers with position held (P39)).

We can now discard the JSON columns and reorder the others. This can be achieved easily with the Re-order / remove columns operation available under Edit columns from the All menu in the root column. You should get something like this:

Wikidata editing with OpenRefine. API tutorial. State of the project after extraction of fields from the JSON payload
Wikidata editing with OpenRefine. API tutorial. State of the project after extraction of fields from the JSON payload

Reconciling the institutions[edit]

Our researchers are already reconciled, because we just got them from our SPARQL query. However, the institutions need to be disambiguated. We can reconcile them by name, but that alone will not get us very far, because these names are often ambiguous. We can refine by two parameters: the country and the identifier.

  • ORCID represents countries with their ISO 3166-1 alpha-2 code (P297). However, because this property is found not on the item of the institution but of its country (P17), we cannot just use ISO 3166-1 alpha-2 code (P297) in the reconciliation dialog. We need two hops to get from the item to the country code. The Wikidata reconciliation service provides a syntax for that (inspired by SPARQL): P17/P297.
  • The identifiers we have pulled from the JSON payload are not always the same sort of identifiers. Very often they are Ringgold ID (P3500), but sometimes they are GRID ID (P2427) or DOI (P356) (that is actually stored as Crossref funder ID (P3153) in Wikidata). ORCID provides the type of identifier that they supply, so we could use that to separate them in different columns and use each column in the reconciliation dialog. This time we are going to use a trick instead: as we know that the formats of these identifiers are incompatible, we will directly ask the reconciliation interface to pick the identifier for us. This can be achieved using the | character to introduce a disjunction: P3500|P2427 will match on the union of both properties.[3] We obtain the following reconciliation dialog:
Wikidata editing with OpenRefine. API tutorial. Configuration of the reconciliation operation.
Wikidata editing with OpenRefine. API tutorial. Configuration of the reconciliation operation.

Creating the schema[edit]

Once reconciliation is done, we can now create the schema. Consult employer (P108) to learn about the expected structure of such claims. It looks like the following could do:

Wikidata Editing with OpenRefine. API tutorial. Schema.
Wikidata Editing with OpenRefine. API tutorial. Schema.

Now, we are not quite done because our table has a record structure: some of the cells in the researcher column are empty (when a researcher has multiple employers on their profile). Because schema evaluation works row-wise, the corresponding statements are currently skipped (because their subject is empty). To fix that, we can use the Fill down command (from the Edit cells menu of the researcher column). This will spread the values down the records. You should also do this for the "orcid" column, because these values are used in references. Once this is done, you can preview the edits in the Preview tab, and upload them with the Upload edits to Wikidata option of the Wikidata menu in the top right corner. You should get edits like this or that. Note that as these edits introduce multiple statements at once, their automatic edit summaries are less informative, so it is important to specify a meaningful edit summary to compensate for that.

OpenRefine Wikidata advanced editing tutorial, preview of edits adding affiliation information to researchers.
OpenRefine Wikidata advanced editing tutorial, preview of edits adding affiliation information to researchers.

What have we done?[edit]

We have been through a fairly tedious process for just 200 researchers. But the good news is that we do not need to go through all this next time. OpenRefine's undo/redo tab is a bit more advanced than what you might be used to from other software. Not only can you roll back your project to any intermediate state, but it also lets you extract the description of the operations as a JSON object, and reapply these operations to a fresh project. So, if you run the SPARQL query again (potentially with a higher limit), you can apply your edit history on the new project. If you are using OpenRefine to perform your edits directly, that counts as an operation too, so OpenRefine will do that as well. Congratulations, you have written an advanced Wikidata bot without a single line of code!

Below is a sample OpenRefine edit history that you can apply to the results of the SPARQL query to achieve the same results:


  1. This might seem like a waste of time: this reconciliation operation should be trivial. In practice, OpenRefine does fetch information for each item: its label, but also their types which are stored internally and that you can access with the cell.recon.match.type expression.
  2. It is possible to obtain shorter expressions by creating intermediate columns (such as value.parseJson()["organization"]).
  3. For simplicity we ignore the DOI (P356) because their format does not match the one used on Wikidata. Exercise: remove the DOI prefix from these values so that you can use them for reconciliation too.