Wikidata:Tools/OpenRefine/Editing/Tutorials/Basic editing
OpenRefine includes a Wikidata extension which lets you transform tabular data into Wikidata statements. This tutorial gives a broad overview of how this works.
Getting started
[edit]In this tutorial, we are going to import data about universities in the UK. Our source is a portal run by the UK government which lets us download its data as a CSV file. Open this file with OpenRefine (you can copy the link address and OpenRefine will download the file for you).
The first step is to reconcile the EstablishmentName
column to Wikidata. Click on the menu for this column and choose Reconcile → Start reconciling, and choose to reconcile against Wikidata.
OpenRefine tries to guess the type of the entities in this column and proposes university (Q3918): in practice, this is too specific as some of our rows are about colleges or academies. Therefore, choose the Reconcile against type option and select a broader type in this field, such as educational organization (Q5341295).
In the panel on the right you have the option to use other columns to refine the reconciliation process. In our case, the URN
column contains identifiers that are stored in Wikidata as DfE URN (P2253). So tick Include for this column, type DfE URN in the box on the right-hand side, and validate your choice in the drop-down suggestion dialog.
Finally, start the reconciliation process and wait for it to complete. Most cells should have been reconciled automatically thanks to the unique identifier that we have provided, but in general it is important to check the quality of these matches.
Some of the cells have not been reconciled automatically, because of discrepancies between our table and Wikidata. You can isolate them by clicking on none in the judgment facet on the left-hand side. If you have time, you can then match them manually - but this is not required to continue this tutorial: unreconciled cells will be ignored in the following steps.
Planning the import
[edit]Scroll horizontally to review the columns in this table. Many of them could be imported in Wikidata: the address of the headquarters (including the postcode), the URL of the official website, the main telephone number and even the name and title of the head of the institution.[1] We first need to understand how this sort of data is currently modeled in Wikidata. Click on a few reconciled cells and look what statements they already have.
Take University of Birmingham (Q223429) for instance. This item already has the following statements:
official website |
| ||||||||||||||
add value |
headquarters location |
| ||||||||||||||||
add value |
That shows us which Wikidata properties are used for this sort of data. To have a better idea of their current coverage for the items we have at hand, we are going to use the data enrichment capabilities of OpenRefine to pull the existing Wikidata statements in our OpenRefine project. Click on the reconciled column and choose Edit column → Add columns from reconciled values. Select the properties that we are interested in, such as official website (P856), headquarters location (P159), located in the administrative territorial entity (P131) and phone number (P1329). Validate and wait for the process to complete.
We can now use facets on these columns to evaluate their coverage on our dataset and compare their values to the ones we have in our source. For instance, click on the official website column that we have just created. Select Facet → Customized facets → Facet by blank. This creates a new facet on the left-hand site. The false value corresponds to the cells that are not empty, so the rows for which Wikidata already has an official website (P856).
At the time of writing, all these universities already have a website on Wikidata (good job Wikidatans!). The remaining two rows do not correspond to any university. This might be different for you if you have not reconciled all cells. We can use the same facet on other rows, and we discover that none of these items have a phone number (P1329) yet. Let's import that then!
Creating a schema
[edit]We now need to instruct OpenRefine how to translate our table into Wikidata statements. This is done by creating a schema: this is a template (or pattern) of statements that will be applied to each row. Click on the Wikidata button in the top right corner of your screen. Click Edit Wikidata schema to open the schema editor. You can think of the schema editor as a space where you can create new statements pretty much as you would create them directly on Wikidata. The main difference is that you can use columns as placeholders for values in these statements. For each row, these placeholders (or variables) will be replaced by the actual content of the column in that row. This is useful in our case, because the statements we want to add follow a simple pattern:
EstablishmentName 1329 TelephoneNum
First, let us indicate that the EstablishmentName column contains the items on which we want to add statements. Click the add item button to add a new item in the schema. Then, drag and drop the EstablishmentName column to the subject of the item.
Then, we can add a statement on this item template. Fill in the property (with phone number (P1329)), and drag and drop the TelephoneNum column as value for the statement. In the same way, we could add other statements on the same item, as well as labels, descriptions or aliases (for which you need to use the add term button).
Detecting and fixing issues
[edit]At this stage, you can see that the Issues tab reports 238 issues with our candidate statements.[2] Open the issues tab to survey them:
OpenRefine reports that the format of the telephone numbers that we are trying to add does not match the expected format on Wikidata. While the regular expression it provides is a bit cryptic, we can click on the property and discover that it expects a format where the international dialing code is supplied: instead of 01784434455
, we should add +44-1784434455
.
Luckily, we can fix that: we know that all these institutions are from the UK, and hence have the same international dialing code "+44". All we need is to append it in front of our values, with the initial 0 removed. Go back to the table. Find the TelephoneNum column, open its menu and select Edit cells → Transform. We can now enter a simple expression in GREL[3] to transform the values:
Apply the transformation and open the schema editor again. The warning about the format has disappeared, but OpenRefine still complains about the fact that we have not added any reference to our statements. We want to indicate the source of this information and make it easy for others to verify it. In our case, the website we drew the information from is the one DfE URN (P2253) links to. So, we can just include this identifier as a reference. The value of the identifier will change for each statement, so again we need to drag and drop the column that contains the identifier:
We can also add a retrieved date, as the data on this website might be updated in the future. We have now solved all the issues reported by OpenRefine, and we can preview the edits generated by our schema.
Uploading the changes to Wikidata
[edit]Now that we are happy with our statements, we can upload them to Wikidata. Click on the Wikidata button in the top right corner and select Push to Wikidata. You will need to login with your Wikidata account: the statements will be added via this account. This is a fairly small edit batch so we can do it on our own, but for larger groups, it is better to discuss the changes on Wikidata and request a bot flag to perform the edits.
Add a descriptive edit summary to your edits: this comment will appear in the history view of each item that you edit. Once you are done, click Perform edits. The upload will take a while and you can watch its progress by checking your own contributions. If you cancel the operation in OpenRefine, the upload will stop but existing edits will not be undone. If you have made a mistake and want to undo the entire edit group, you can use the EditGroups tool.
Conclusion
[edit]Well done, you have added more than 100 well sourced statements to Wikidata, just in a few clicks! Why don't you try this on another dataset? Oh, and if you enjoy using OpenRefine, feel free to add the {{User loves OpenRefine}}
userbox to your user page. Do you want to try something a bit more challenging? Then go to the following tutorial in this series, the Inverse Listeria tutorial.
Notes
[edit]- ↑ We could add the head of the institution with statements such as University of Birmingham (Q223429)director / manager (P1037)Michael Sterling (Q6834612), but after reviewing the associated Wikipedia page we realize that this information is already outdated by a few years, so that should also be excluded from the import.
- ↑ If OpenRefine does not report any format violation, you might have selected the wrong column. We have two columns with phone numbers in our project: the original one from our source (TelephoneNum) and the one that we pulled from Wikidata (phone number). In the schema editor we want to use the TelephoneNum, because that is where our fresh values are.
- ↑ To learn more about this expression language, visit OpenRefine's wiki (you can also use Python or Clojure if you are more familiar with these languages).