Wikidata:Data Import Guide

From Wikidata
Jump to: navigation, search
Data import guide
This guide has been created for anyone wishing to import data into Wikidata.
Light-Bulb by Till Teenck.svg

Importing data into Wikidata requires many skills, however the process can be broken down into individual steps. This means that the Wikidata community can work together to import data. The prerequisite skills to get started importing data are:

  • Creating and editing wiki pages including interacting with Wikidata community members.
  • Moving information into a spreadsheet and duplicating sheets within a spreadsheet, additional helpful skills are listed in Step 6.

The process of uploading data to Wikidata can be broken down into the following steps which can be broken down further into the following stages:

Preparing the data requires minimal technical skills and importing data into Wikidata can be done by either requesting the data be imported by bot (highly recommended) or by importing it yourself (only for experts and not yet documented).

Step 1: Choose data to import[edit]

Number-1 (black).png

Data imported into Wikidata must be:

  1. Reliable.
  2. Publicly available and preferably online.

If in doubt please ask about the dataset on the Partnerships and Data Imports discussion page.

If the data is only available as a table in a PDF Tabula can be very helpful in copying the data into a spreadsheet.

Step 2: Start a data import[edit]

Number-2 (black).png

Part A: Go to the Data Import Hub and follow instructions to create a new import.

Part B: Add the table and subheading as outlined in the Instructions for data importers section of the Data Import Hub. Please do this even if you are going to import the data yourself, it will enable others to help you and understand what you have done for future updates to the data.

Step 3: Describe the dataset[edit]

Number-3 (black).png

Complete the Description of dataset section of the table. This will mostly be a repetition of the data import request but please provide any other useful information on the dataset:

  • Name: Name of the dataset, if a formal name does not exist please create a descriptive title.
    Source: The source of the data e.g the organisation who produced it.
    Link: A public link to the data or a document that structured data will be created from.
    Description: A description of the dataset including any information that is useful to know about the dataset.

Step 4: Import the data into spreadsheet[edit]

Number-4 (black).png

Part A: Import the data into a spreadsheet, it is strongly recommended you create an online spreadsheet to work from as it will allow others to understand the issues and collaborate on importing a dataset. It is very helpful to include ID numbers if they exist for the uploading process.

Part B: Complete the following fields within the Create and import data into spreadsheet section in the table on the Data Import Hub

  • Link: Add a link to the spreadsheet you have created
  • Done: Add information about what tasks you have done within the spreadsheet, if you have not imported all the data add tasks to be completed in the To do field. Once all tasks have been completed simply write 'all' in the field.
  • To do: Any tasks that need to be completed to import the data into the spreadsheet.
  • Notes: Any extra information that are useful to know e.g if any changes were made to the original dataset, if there were spelling mistakes etc.

Step 5: Define the structure of the data within Wikidata[edit]

Number-5 (black).png

This step is often the most difficult, however there are many knowledgeable people within the Wikidata community that will be able to work with you to accomplish this step on the Wikidata:Partnerships and data imports page.

Part A: Look at the Wikidata glossary to understand the terms used in the following steps.

Part B: Look at examples of potentially similar data within Wikidata to understand what structure is already used for items.

  • Showcase items provide examples of items with very rich levels of data within Wikidata
  • Use the search function to search for items which may hold similar information stored in a way that could be copied for this data set.

Part C: Outline the structure within Wikidata in the table on the dataset import page. The dataset will need to be broken down into which parts of the data will be items, properties and values and if any qualifiers are needed. Also any issues or notes about the data e.g if the data is complete or if the data is related to any other datasets. Add what work has been done to the and any work still to do e.g propose properties. If you need help with defining the structure of the dataset ask on the data imports talk page.

  • Items: Which part of the data will become items or use existing items
  • Descriptions: the descriptions used for the items.
  • Properties: What property or properties will be used. You can search current properties on the Property list page. If any new properties need to be created you can propose them on the Property Proposal page.
  • Qualifiers: If any qualifiers are needed
  • Values: Which parts of the data will be used as values.
  • References: This can be one reference for the entire dataset or many.

Part D: Create one or more example items with the data structured in the way described, these practical examples will show how the data will be structured within an item and surface any issues in implementing the proposed data structure.

Step 6: Format the data to be imported[edit]

Number-6 (black).png

Part A: Duplicate the Original dataset sheet within your spreadsheet and rename the copy Structured for Wikidata.

Part B: Reorder your spreadsheet to use the following structure to make it easier for the people importing the data into Wikidata. A downloadable version of this format is available here.

Unique ID Name / Title Description for Wikidata Description for importing data URL More data 1 More data 2
A set of numbers/letters/characters that uniquely identify items in your dataset. This allows us to create a map from your data set to the corresponding Wikidata items.

Data can be imported without this, but it is strongly recommended to create an ID system if you do not already have one as the import process becomes significantly easier (there are a range of other benefits too, such as increased discoverability of your content) NOTE: if the donating organisation does not have an ID system and cannot create one internally, the data importer will make up an id system at when they upload the data. The recommended format is FAKE_ID_$ (with $ representing a number)

This is the name/title of each item that you have some data about.

For example, if you were donating data about people (dates of birth, occupation, place of death etc), then this column should show the name of each person in the data set. If you were donating data about a book, the title of each book would be shown. Note: if you have names of your items in multiple languages, include an additional column for each language

A short description of the item from a few words up to a sentence. This will describe the item within Wikidata. Descriptions can be created by combining data fields within the dataset e.g For a dataset of Biosphere Reserves where data on the country and year of inscription was available, the description could be 'Biosphere reserve in Democratic Republic Of The Congo, designated in 1976.'. A short description of the item from a few words up to a paragraph. This field can be the same as the Description for Wikidata field. This is not for importing into Wikidata - it's purpose is to help match items in your dataset with Wikidata items unambiguaously.

For example, the description would help us distinguish two people of the same name by providing some extra info about their lives (e.g. occupation and date of birth).Note: This column is not essential if you are providing data in other columns that can be used to disambiguate. For example, if 'occupation' and 'country of citizenship' are given in other columns, this would usually be enough to identify a person uniquely (along with their name of course).

If applicable, you should included a URL to a page about on your website.

For example, a digital collection of a museum would have a page on their site for each item in the collection. NOTE: If your website has a URL pattern for getting to an item's page from the unique ID number, then you can just provide us with one example (e.g. www.example.com/collection/12345) - obviously we also need the unique IDs given in column A to make use of the pattern.

Any other data about an item that you would like to make avaialable for import into Wikidata.

This heading of this column might be "date of birth", "population", "area in square meters", "occupation", "height", "colour", or any other meanignful type of date that you have for some or all of the items in the data set

You can add as many additional columns as you like for additional points of data.

As an example here is a small section of the spreadsheet structure used to import data from the UNESCO Man and the Biosphere Programme.

Name of Site Description URL Country / countries Designation year Year withdrawn Midpoint Latitude Midpoint Longitude Total area of the newest data (ha) Area of all core zones Area of all buffer zones Area of all transition zones
Yangambi Biosphere reserve in Democratic Republic Of The Congo | designated in 1976 http://www.unesco.org/new/en/natural-sciences/environment/ecological-sciences/biosphere-reserves/africa/democratic-republic-of-the-congo/yangambi/ DEMOCRATIC REPUBLIC OF THE CONGO 1976 0.3333333333 24.5 220000 160000 60000
Luki Biosphere reserve in Democratic Republic Of The Congo | designated in 1976 http://www.unesco.org/new/en/natural-sciences/environment/ecological-sciences/biosphere-reserves/africa/democratic-republic-of-the-congo/luki/ DEMOCRATIC REPUBLIC OF THE CONGO 1976 -5.633333333 -13.18333333 32968 6816 5216 20936

Commonly used processes[edit]

This table provides some commonly used processes to format data so that it can be ingested into Wikidata. These processes range from the very common to needing special formulas to achieve (provide links to read only Google Sheets versions wherever possible).

Process Useful for Google Sheets Microsoft Word Open Office Open Refine
General introductory guidance Learning how to use the programs Google Sheets

Wikipedia and Wikidata tools for Google Sheets

Add formulas and functions to a spreadsheet

Google Sheets list of functions

Keyboard shortcuts

OpenRefine's wiki

OpenRefine recipes for Wikidata

Sort A - Z Seeing all the rows with the same statements Sort and filter your data Click on column, then Sort..., and choose sorting options.
Add up a column or row Combining numbers from several cells into a single cell SUM fuction
Matching columns Matching columns of data produced by Mix n' Match to import them into Quickstatements
Embed data from other sheet Useful for having a master sheet with all the data and seperate sub sheets to import individual statements into Wikidata Reference data from other sheets
Extracting urls from list of hyperlinks Often when copying a list from a website there are hyperlinks embedded in the text that are needed in a separate column in a spreadsheet.
Seperating cell text into columns Splitting cells before and after a certain character into separate columns when data is provided as an unstructured list but with a regular pattern Separate cell text into columns
Create a Google search query for a term Looking for Wikipedia articles for an item to find if it exists or not HYPERLINK function e.g =HYPERLINK("https://www.google.co.uk/search?q="&A1)
Find and Replace Search and use find and replace OpenRefine's find and replace functions
Combining cells using & operator Combining information from two or more cells into a cell, e.g constructing a URL or a phrase, or adding quotation marks around a URL for QuickStatements
Add quotation around a field QuickStatements requires quotation marks are added around some kinds of data e.g URLs, coordinates.
Select whole row or column
Prevent increasing list of numbers Some ways of copying numbers to additional cells causes the numbers in the column to increase rather than be a duplicate, this will introduce errors into the spreadsheet
Transform formula column to plain text When copying data into QuickStatements it is helpful to use plain text rather than cells that rely on formulas After copying the cells: Edit > Paste Special > Paste values Only
Select every Nth line Often external sources do not make their content available in rows and columns and the relevant data is is kept in a repeating pattern of rows, this means you have to seperate every Nth row and put it in a new column OpenRefine's transpose functions (currently undocumented)
Remove every empty row in a column Use facets to isolate the blank cells, then remove rows with the menu in the first column.
Coreferencing Coreferencing (or authority matching, instance matching) is the process of finding corresponding records in external datasets, which can be used to source additional info (eg MnM started sourcing birth-death years) or references How to use Google Sheets to Manage Wikidata Coreferencing Guide to Wikidata reconciliation with OpenRefine

list of other related tutorials

video showing how the reconciliation process works (think "Wikidata" when they say "Freebase", think "OpenRefine" when they say "Google Refine")

Step 7: Chose how to import the data[edit]

Number-7 (black).png

Option 1: Request data is imported by other people[edit]

Step A: Request data be imported into Wikidata on the Wikidata bot request page. To make a request click on the Add a new request button and then link to your request on the Data Hub page.

Step B: Check the Manual work needed section of the table once the import has started to see what work has to be done manually.

Step C: Go to Step 11 once the data has been imported.


Option 2: Self import Go to Step 8: Match the data to Wikidata.

Step 8: Match the data to Wikidata[edit]

Number-8 (black).png

Before you can import data about any list of items, you will need to know the corresponding Wikidata Id numbers for each item in the list (essentially each row of your spreadsheet). You will also sometimes need to find Wikidata Ids for people, places, concepts or other things that are used to describe your main list of items.

a) Matching 'main' items Finding a Wikidata item number for each 'row' in the data (e.g. the item number for each painting if the dataset was a collection of paintings). The options you have for this step are:
  • Mix'n'match - Use this tool when a significant amount will need to be matched manually by community members.
  • OpenRefine - Use this tool for fine-grained matching using multiple columns of your dataset, using fuzzy-matching, external identifiers or sitelinks.
  • Other semi-auto matching - When the data has some unique IDs, URLs or other data in common with Wikidata that allow you to generate a list of Q numbers (e.g. by running a query and matching common data in your import spreadsheet).
b) Matching items used as values
Finding Q numbers for any additional data to be added. E.g. a painting which has "country of origin" = "Italy", needs to become P495 = Q38 (Note: The properties should have been found already in Step 5 above).
The method you use for for this will depend on the specific case. The following tools are available:
With any method used, you may end up with some manual work at the end to finish it off.

Step 9: Add the data to Wikidata[edit]

Number-9 (black).png There are several options available to import the data once it has been matched:
  • QuickStatements - For when there are statements, labels, descriptions etc that vary from item to item.
  • PetScan - For when you need to add the same statement(s) to a list of items.
  • Bot import - For when the job is too complex to use QuickStatements or PetScan. You can request a bot import here.

Step 10: Check the data import[edit]

Number-10 (black).png

Once the data has been imported into Wikidata request a query to ensure your data has been imported correctly at Wikidata:Request a query. This step ensures the data has been imported correctly and highlights any issues that may come about from importing data. A list of useful queries to check the data has been imported properly will be added here soon.

Step 11: Summarise the import[edit]

Number-11 (black).png

Part A: Add information the Date import complete and notes section of the table on the dataset import page adding a date that the data import into Wikidata is complete and any notes about issues or anything else people should know to improve or maintain the data e.g when an updated version of the dataset will be released.

Part B: Move the dataset from the Current datasets being imported section to the Completed datasets section.