Wikidata:Data Import Guide/Step 5: Format the data to be imported

From Wikidata
Jump to navigation Jump to search

Part A: Duplicate the Original dataset sheet within your spreadsheet and rename the copy Structured for Wikidata.

Part B: Reorder your spreadsheet to use the following structure to make it easier for the people importing the data into Wikidata. A downloadable version of this format is available here.

Unique ID Name / Title Description for Wikidata Description for importing data URL More data 1 More data 2
A set of numbers/letters/characters that uniquely identify items in your dataset. This allows us to create a map from your data set to the corresponding Wikidata items.

Data can be imported without this, but it is strongly recommended to create an ID system if you do not already have one as the import process becomes significantly easier (there are a range of other benefits too, such as increased discoverability of your content) NOTE: if the donating organisation does not have an ID system and cannot create one internally, the data importer will make up an id system at when they upload the data. The recommended format is FAKE_ID_$ (with $ representing a number)

This is the name/title of each item that you have some data about.

For example, if you were donating data about people (dates of birth, occupation, place of death etc), then this column should show the name of each person in the data set. If you were donating data about a book, the title of each book would be shown. Note: if you have names of your items in multiple languages, include an additional column for each language

A short description of the item from a few words up to a sentence. This will describe the item within Wikidata. Descriptions can be created by combining data fields within the dataset e.g For a dataset of Biosphere Reserves where data on the country and year of inscription was available, the description could be 'Biosphere reserve in Democratic Republic Of The Congo, designated in 1976.'. A short description of the item from a few words up to a paragraph. This field can be the same as the Description for Wikidata field. This is not for importing into Wikidata - its purpose is to help match items in your dataset with Wikidata items unambiguously.

For example, the description would help us distinguish two people of the same name by providing some extra info about their lives (e.g. occupation and date of birth). Note: This column is not essential if you are providing data in other columns that can be used to disambiguate. For example, if 'occupation' and 'country of citizenship' are given in other columns, this would usually be enough to identify a person uniquely (along with their name of course).

If applicable, you should include a URL to a page about on your website.

For example, a digital collection of a museum would have a page on their site for each item in the collection. NOTE: If your website has a URL pattern for getting to an item's page from the unique ID number, then you can just provide us with one example (e.g. www.example.com/collection/12345) - obviously we also need the unique IDs given in column A to make use of the pattern.

Any other data about an item that you would like to make available for import into Wikidata.

This heading of this column might be "date of birth", "population", "area in square meters", "occupation", "height", "colour", or any other meaningful type of data that you have for some or all of the items in the data set

You can add as many additional columns as you like for additional points of data.

As an example here is a small section of the spreadsheet structure used to import data from the UNESCO Man and the Biosphere Programme.

Name of Site Description URL Country / countries Designation year Year withdrawn Midpoint Latitude Midpoint Longitude Total area of the newest data (ha) Area of all core zones Area of all buffer zones Area of all transition zones
Yangambi Biosphere reserve in Democratic Republic Of The Congo | designated in 1976 http://www.unesco.org/new/en/natural-sciences/environment/ecological-sciences/biosphere-reserves/africa/democratic-republic-of-the-congo/yangambi/ DEMOCRATIC REPUBLIC OF THE CONGO 1976 0.3333333333 24.5 220000 160000 60000
Luki Biosphere reserve in Democratic Republic Of The Congo | designated in 1976 http://www.unesco.org/new/en/natural-sciences/environment/ecological-sciences/biosphere-reserves/africa/democratic-republic-of-the-congo/luki/ DEMOCRATIC REPUBLIC OF THE CONGO 1976 -5.633333333 -13.18333333 32968 6816 5216 20936

Commonly used processes[edit]

This table provides some commonly used processes to format data so that it can be ingested into Wikidata. These processes range from the very common to needing special formulas to achieve (provide links to read only Google Sheets versions wherever possible).

Process Useful for Google Sheets Microsoft Word Open Office Open Refine
General introductory guidance Learning how to use the programs Google Sheets

Wikipedia and Wikidata tools for Google Sheets

Add formulas and functions to a spreadsheet

Google Sheets list of functions

Keyboard shortcuts

OpenRefine's wiki

OpenRefine recipes for Wikidata

Sort A - Z Seeing all the rows with the same statements Sort and filter your data Click on column, then Sort..., and choose sorting options.
Add up a column or row Combining numbers from several cells into a single cell SUM fuction
Matching columns Matching columns of data produced by Mix n' Match to import them into Quickstatements
Embed data from other sheet Useful for having a master sheet with all the data and seperate sub sheets to import individual statements into Wikidata Reference data from other sheets Use the cross function
Extracting urls from list of hyperlinks Often when copying a list from a website there are hyperlinks embedded in the text that are needed in a separate column in a spreadsheet.
Seperating cell text into columns Splitting cells before and after a certain character into separate columns when data is provided as an unstructured list but with a regular pattern Separate cell text into columns Use string functions
Create a Google search query for a term Looking for Wikipedia articles for an item to find if it exists or not HYPERLINK function e.g =HYPERLINK("https://www.google.co.uk/search?q="&A1)
Find and Replace Search and use find and replace OpenRefine's find and replace functions
Combining cells using & operator Combining information from two or more cells into a cell, e.g constructing a URL or a phrase, or adding quotation marks around a URL for QuickStatements
Add quotation around a field QuickStatements requires quotation marks are added around some kinds of data e.g URLs, coordinates.
Select whole row or column Not applicable
Prevent increasing list of numbers Some ways of copying numbers to additional cells causes the numbers in the column to increase rather than be a duplicate, this will introduce errors into the spreadsheet Not applicable
Transform formula column to plain text When copying data into QuickStatements it is helpful to use plain text rather than cells that rely on formulas After copying the cells: Edit > Paste Special > Paste values Only Not applicable
Select every Nth line Often external sources do not make their content available in rows and columns and the relevant data is is kept in a repeating pattern of rows, this means you have to seperate every Nth row and put it in a new column OpenRefine's transpose functions (currently undocumented)
Remove every empty row in a column Use facets to isolate the blank cells, then remove rows with the menu in the first column.
Coreferencing Coreferencing (or authority matching, instance matching) is the process of finding corresponding records in external datasets, which can be used to source additional info (eg MnM started sourcing birth-death years) or references How to use Google Sheets to Manage Wikidata Coreferencing Guide to Wikidata reconciliation with OpenRefine

list of other related tutorials

video showing how the reconciliation process works (think "Wikidata" when they say "Freebase", think "OpenRefine" when they say "Google Refine")