Wikidata:Data Import Guide
Importing data into Wikidata requires many skills, however the process can be broken down into individual steps. This means that the Wikidata community can work together to import data. The prerequisite skills to get started importing data are:
The process of uploading data to Wikidata can be broken down into the following steps which can be broken down further into the following stages:
Preparing the data requires minimal technical skills and importing data into Wikidata can be done by either requesting the data be imported by bot (highly recommended) or by importing it yourself (only for experts and not yet documented).
Step 1: Choose data to import
Data imported into Wikidata must be:
If in doubt please ask about the dataset on the Partnerships and Data Imports discussion page.
If the data is only available as a table in a PDF Tabula can be very helpful in copying the data into a spreadsheet.
Step 2: Start a data import
Part A: Go to the Data Import Hub and follow instructions to create a new import.
Part B: Add the table and subheading as outlined in the Instructions for data importers section of the Data Import Hub. Please do this even if you are going to import the data yourself, it will enable others to help you and understand what you have done for future updates to the data.
Step 3: Describe the dataset
Step 4: Import the data into spreadsheet
Step 5: Define the structure of the data within Wikidata
This step is often the most difficult, however there are many knowledgeable people within the Wikidata community that will be able to work with you to accomplish this step on the Wikidata:Partnerships and data imports page.
Part A: Look at the Wikidata glossary to understand the terms used in the following steps.
Part B: Look at examples of potentially similar data within Wikidata to understand what structure is already used for items.
Part C: Outline the structure within Wikidata in the table on the dataset import page. The dataset will need to be broken down into which parts of the data will be items, properties and values and if any qualifiers are needed. Also any issues or notes about the data e.g if the data is complete or if the data is related to any other datasets. Add what work has been done to the and any work still to do e.g propose properties. If you need help with defining the structure of the dataset ask on the data imports talk page.
Part D: Create one or more example items with the data structured in the way described, these practical examples will show how the data will be structured within an item and surface any issues in implementing the proposed data structure.
Step 6: Format the data to be imported
Part A: Duplicate the Original dataset sheet within your spreadsheet and rename the copy Structured for Wikidata.
Part B: Reorder your spreadsheet to use the following structure to make it easier for the people importing the data into Wikidata. A downloadable version of this format is available here.
|Unique ID||Name / Title||Description for Wikidata||Description for importing data||URL||More data 1||More data 2|
|A set of numbers/letters/characters that uniquely identify items in your dataset. This allows us to create a map from your data set to the corresponding Wikidata items.
Data can be imported without this, but it is strongly recommended to create an ID system if you do not already have one as the import process becomes significantly easier (there are a range of other benefits too, such as increased discoverability of your content) NOTE: if the donating organisation does not have an ID system and cannot create one internally, the data importer will make up an id system at when they upload the data. The recommended format is FAKE_ID_$ (with $ representing a number)
|This is the name/title of each item that you have some data about.
For example, if you were donating data about people (dates of birth, occupation, place of death etc), then this column should show the name of each person in the data set. If you were donating data about a book, the title of each book would be shown. Note: if you have names of your items in multiple languages, include an additional column for each language
|A short description of the item from a few words up to a sentence. This will describe the item within Wikidata. Descriptions can be created by combining data fields within the dataset e.g For a dataset of Biosphere Reserves where data on the country and year of inscription was available, the description could be 'Biosphere reserve in Democratic Republic Of The Congo, designated in 1976.'.||A short description of the item from a few words up to a paragraph. This field can be the same as the Description for Wikidata field. This is not for importing into Wikidata - it's purpose is to help match items in your dataset with Wikidata items unambiguaously.
For example, the description would help us distinguish two people of the same name by providing some extra info about their lives (e.g. occupation and date of birth).Note: This column is not essential if you are providing data in other columns that can be used to disambiguate. For example, if 'occupation' and 'country of citizenship' are given in other columns, this would usually be enough to identify a person uniquely (along with their name of course).
|If applicable, you should include a URL to a page about on your website.
For example, a digital collection of a museum would have a page on their site for each item in the collection. NOTE: If your website has a URL pattern for getting to an item's page from the unique ID number, then you can just provide us with one example (e.g. www.example.com/collection/12345) - obviously we also need the unique IDs given in column A to make use of the pattern.
|Any other data about an item that you would like to make avaialable for import into Wikidata.
This heading of this column might be "date of birth", "population", "area in square meters", "occupation", "height", "colour", or any other meanignful type of date that you have for some or all of the items in the data set
|You can add as many additional columns as you like for additional points of data.|
As an example here is a small section of the spreadsheet structure used to import data from the UNESCO Man and the Biosphere Programme.
|Name of Site||Description||URL||Country / countries||Designation year||Year withdrawn||Midpoint Latitude||Midpoint Longitude||Total area of the newest data (ha)||Area of all core zones||Area of all buffer zones||Area of all transition zones|
|Yangambi||Biosphere reserve in Democratic Republic Of The Congo | designated in 1976||http://www.unesco.org/new/en/natural-sciences/environment/ecological-sciences/biosphere-reserves/africa/democratic-republic-of-the-congo/yangambi/||DEMOCRATIC REPUBLIC OF THE CONGO||1976||0.3333333333||24.5||220000||160000||60000|
|Luki||Biosphere reserve in Democratic Republic Of The Congo | designated in 1976||http://www.unesco.org/new/en/natural-sciences/environment/ecological-sciences/biosphere-reserves/africa/democratic-republic-of-the-congo/luki/||DEMOCRATIC REPUBLIC OF THE CONGO||1976||-5.633333333||-13.18333333||32968||6816||5216||20936|
Commonly used processes
This table provides some commonly used processes to format data so that it can be ingested into Wikidata. These processes range from the very common to needing special formulas to achieve (provide links to read only Google Sheets versions wherever possible).
|Process||Useful for||Google Sheets||Microsoft Word||Open Office||Open Refine|
|General introductory guidance||Learning how to use the programs||Google Sheets||OpenRefine's wiki|
|Sort A - Z||Seeing all the rows with the same statements||Sort and filter your data||Click on column, then Sort..., and choose sorting options.|
|Add up a column or row||Combining numbers from several cells into a single cell||SUM fuction|
|Matching columns||Matching columns of data produced by Mix n' Match to import them into Quickstatements|
|Embed data from other sheet||Useful for having a master sheet with all the data and seperate sub sheets to import individual statements into Wikidata||Reference data from other sheets||Use the |
|Extracting urls from list of hyperlinks||Often when copying a list from a website there are hyperlinks embedded in the text that are needed in a separate column in a spreadsheet.|
|Seperating cell text into columns||Splitting cells before and after a certain character into separate columns when data is provided as an unstructured list but with a regular pattern||Separate cell text into columns||Use string functions|
|Create a Google search query for a term||Looking for Wikipedia articles for an item to find if it exists or not||HYPERLINK function e.g =HYPERLINK("https://www.google.co.uk/search?q="&A1)|
|Find and Replace||Search and use find and replace||OpenRefine's find and replace functions|
|Combining cells using & operator||Combining information from two or more cells into a cell, e.g constructing a URL or a phrase, or adding quotation marks around a URL for QuickStatements|
|Add quotation around a field||QuickStatements requires quotation marks are added around some kinds of data e.g URLs, coordinates.|
|Select whole row or column||Not applicable|
|Prevent increasing list of numbers||Some ways of copying numbers to additional cells causes the numbers in the column to increase rather than be a duplicate, this will introduce errors into the spreadsheet||Not applicable|
|Transform formula column to plain text||When copying data into QuickStatements it is helpful to use plain text rather than cells that rely on formulas||After copying the cells: Edit > Paste Special > Paste values Only||Not applicable|
|Select every Nth line||Often external sources do not make their content available in rows and columns and the relevant data is is kept in a repeating pattern of rows, this means you have to seperate every Nth row and put it in a new column||OpenRefine's transpose functions (currently undocumented)|
|Remove every empty row in a column||Use facets to isolate the blank cells, then remove rows with the menu in the first column.|
|Coreferencing||Coreferencing (or authority matching, instance matching) is the process of finding corresponding records in external datasets, which can be used to source additional info (eg MnM started sourcing birth-death years) or references||How to use Google Sheets to Manage Wikidata Coreferencing||Guide to Wikidata reconciliation with OpenRefine
video showing how the reconciliation process works (think "Wikidata" when they say "Freebase", think "OpenRefine" when they say "Google Refine")
Step 7: Chose how to import the data
Option 1: Request data is imported by other people
Step A: Request data be imported into Wikidata on the Wikidata bot request page. To make a request click on the Add a new request button and then link to your request on the Data Hub page.
Step B: Check the Manual work needed section of the table once the import has started to see what work has to be done manually.
Step C: Go to Step 11 once the data has been imported.
Step 8: Match the data to Wikidata
Before you can import data about any list of items, you will need to know the corresponding Wikidata Id numbers for each item in the list (essentially each row of your spreadsheet). You will also sometimes need to find Wikidata Ids for people, places, concepts or other things that are used to describe your main list of items.
Step 9: Add the data to Wikidata
|There are several options available to import the data once it has been matched:
Step 10: Check the data import
Once the data has been imported into Wikidata request a query to ensure your data has been imported correctly at Wikidata:Request a query. This step ensures the data has been imported correctly and highlights any issues that may come about from importing data. A list of useful queries to check the data has been imported properly will be added here soon.