Wikidata:Tools/OpenRefine

From Wikidata
Jump to navigation Jump to search

Other languages:
English • ‎français • ‎svenska • ‎српски / srpski • ‎українська • ‎العربية • ‎کوردی
OpenRefine logo

OpenRefine is a free data wrangling tool that can be used to clean tabular data and connect it with knowledge bases, including Wikidata. It was previously developed by Google (under the name Google Refine) and has now transitioned to a community-supported project.

This page gathers OpenRefine recipes that can be useful to import datasets into Wikidata, or augment datasets with additional data extracted from Wikidata. Feel free to use the talk page to ask for help with the software. If you enjoy using this tool, you can spread the word with the {{User loves OpenRefine}} userbox.

Main features[edit]

Wikidata reconciliation[edit]

In OpenRefine terminology, reconciliation is the process of linking free-text tabular cells to identifiers in knowledge bases. OpenRefine's built-in reconciliation capabilities make it a versatile tool to reconcile tabular data to a wide range of databases, including Wikidata.

Semi-automatic reconciliation of universities in OpenRefine

OpenRefine's wiki contains a detailed guide to the reconciliation process. Here are the main features:

  • Restrict the reconciliation to a Wikidata class. Only items from subclasses of this Wikidata class will be considered;
  • Use multiple columns in your dataset and match them against values of properties in Wikidata, which refines the reconciliation score and acts as a tiebreaker between namesakes;
  • Use the external identifiers shared by your dataset and Wikidata to reconcile your items;
  • Use the sitelinks provided in your dataset as external identifiers - if these Wikimedia pages are linked to a Wikidata item, they will automatically be reconciled to that.

If you want to use the reconciliation features, consider engaging with the following instruction materials:

Data augmentation[edit]

This screencast demonstrates how to add new columns based on a reconciled column in OpenRefine 2.8.

This feature is available from OpenRefine 2.8 onwards.

Once a column of your table is reconciled to Wikidata, you can pull data from Wikidata, creating other columns in your dataset. If there are multiple claims for a given property, the values will be grouped as records in OpenRefine: they are stored in additional rows where the original reconciled column is blank. OpenRefine's record mode might therefore be more suitable for the later transformations you want to carry out on your table.

You can use this function recursively on the newly-created columns if they correspond to Wikidata items. This lets you explore the Wikidata graph along selected properties. It is also possible to configure the way you retrieve the properties in various ways (for instance, filtering by rank or references).

Wikidata editing[edit]

This feature is available from OpenRefine 3.0 onwards.

OpenRefine can help you transform tabular data into Wikidata statements. This works by creating a schema - a template of Wikidata edit that is applied to each row of your table. Once you have created a schema, you can:

  • preview the Wikidata edits and inspect them manually;
  • analyze and fix any issues raised automatically by the tool;
  • upload your changes to Wikidata by logging in with your own account;
  • export the changes to the QuickStatements v1 format.

See the editing subpage for more details.

Recipes[edit]

OpenRefine workflows can be shared by copying the JSON representation of the edit history. This represents the operations you have made in OpenRefine, and can be reused by others on similar datasets. This section lists some recipes that can be useful when working with Wikidata.

  • Once you have reconciled a column to Wikidata, you can obtain the Qids in a new column, by using the Add column based on this column operation with the following GREL expression: cell.recon.match.id
  • Share your recipe here!

Help OpenRefine[edit]

OpenRefine needs your help! There are many things you can do: