Wikidata:Requests for comment/Mapping and improving the data import process

From Wikidata
Jump to navigation Jump to search
An editor has requested the community to provide input on "Mapping and improving the data import process" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

Community Noun project 26481.svg What would the ideal data import process look like? This page maps the current data import processes including tools and documentation, what is available and what is still needed to create an ideal process. Once this information has been mapped it can later be broken down into Phabricator tasks to keep easy track of progress.

We want to get as many community members as possible involved in these early stages of planning. The end goal is to have a well structured task list showing the improvements we need to make to the data import process as good as possible which we can turn into a set of Phabricator tasks.

This discussion follows on from the session held at WikidataCon 2017, which aimed to gather feedback from the community about problems/difficulties with the current data import flow.




Noun Project goal icon 618753 cc.svg

Goals[edit]

The goals of the import process are outlined in each section, the overall goals of having a centralised data import process are:

  • Increase the volume and quality of data in Wikidata through:
    • Lowering the barrier to contribution to Wikidata for new contributors and experts through better documentation (e.g more Wikidata tours).
    • Tools to make it much easier to import data into Wikidata.
    • Providing resources for contributors to increase their skill level.
    • Providing a high quality service to data partners.
  • Understand topic completeness on Wikidata
    • Collating catalogues of datasets on different subjects into a central record, not all topics have Wikiprojects and many Wikiprojects have topic overlaps, collating all data involves many languages, data types and contributors.
    • Collating data imports into this same central record.
    • Creating Wikidata items for each database with a number of items in the database as a statement, allowing users to run this against the number of items on Wikidata.
  • Increase trust by organisations and individuals reusing Wikidata including other Wikimedia projects through:
    • Providing clear information of how data is added to Wikidata and where it comes from.
    • Providing clear information Wikidata's shortcomings and the development roadmap for new tools and documentation to improve data quality.
    • Having a much higher percentage of referenced statements, this could be done through making it policy to reference statements wherever possible when importing datasets.
    • Being transparent about the data import process with documentation of decisions made when importing a dataset.
    • Linking between import documentation and items and vice versa.
    • More linking to other Wikimedia projects e.g Commons Categories.
    • Other Wikimedia contributors are more easily able to contribute to Wikidata and gain value from doing so.
  • Increase the number of organisations and individuals contributing to Wikidata through:
    • Providing information on the purpose and potential of contributing and using Wikidata.
    • Provide metrics on data reuse.
  • Increase the number of organisations and individuals reusing data from Wikidata through:
    • More complete and more topic specific documentation and easier tools to reuse data including using Wikidata as authority control.
    • More easy to understand and attractive way to explore data on Wikidata.
Phacility phabricator logo.svg

Phabricator task structure[edit]

A suggested structure for the Phabricator task list is a single root project called Data partnerships process with three main subprojects.

  1. Documentation
  2. Tools
  3. Community communication


Noun project list icon 1380018 cc.svg

Import process overview[edit]

The process of importing external datasets into Wikidata can be broken down into the following steps, tools and documentation could combine two or more of these steps:

  1. Provide information about Wikidata to data partners
  2. Identify data to be imported
  3. Plan and record data import
  4. Match data to existing data on Wikidata
  5. Import data
  6. Maintain and improve data quality
  7. Usage of data on other Wikimedia projects
  8. Reporting on data usage

Noun project info icon 808465 cc.svg

1. Provide information about Wikidata to data partners[edit]

Current situation[edit]

Some resources for data partners exist but are limited, there are no step by step instructions for publishing open data or how to get the data imported into Wikidata. The laws around copyright and other data rights is complicated and not well understood by both many Wikidata contributors and data partners.

Existing resources[edit]

Goals[edit]

Data producers have good understanding of the purpose and potential of contributing and using Wikidata. Many organisations use Wikidata as an authority control to make their data Linked Open Data. There is information available for different audiences interested in Wikidata e.g data science students, librarians, museums, galleries, archives. There is a more easy to understand and attractive way to explore data on Wikidata.

Resources needed[edit]

Task Type Phabricator task
Detailed case studies of successful data imports, including a section on the impact for the organisation. Documentation
Clear and detailed instructions for reusing identifiers from Wikidata. Documentation
A way for data partners to get consistent and professional support for working with Wikidata. Unknown
Information in people's own language, they do no have to speak English to work with Wikidata. Documentation
Some kind of visual way of exploring Wikidata items, possibly as described in the 2017 Community Wishlist Tools
Provide a place for potential partners to talk to Wikimedia contributors, potentially using Wikidata:Partnerships_and_data_imports Tools, Documentation

Noun Project icon Magnifying glass 938466 cc.svg

2. Identify data to be imported[edit]

Current situation[edit]

There is very little coordination of data imported on a specific topic and few records of what has and has not been imported on a subject. It's difficult for an external organisation to find out whether their data is within the scope of Wikidata.

Existing resources[edit]

Goals[edit]

A place to map all the data available on a subject and systematically import it into Wikidata, to know what is available and what is missing, including multiple external databases. E.g for having a worldwide list of built heritage on Wikidata (very useful for Wiki Loves Monuments) all national and regional heritage registers would need to be imported.

More consistent use of the data import hub so we have a record that can be searched through to identify if someone is already working on importing a dataset (and which Wikidata users / organisations are involved)

Resources needed[edit]

Task Type Phabricator task
System for tagging/categorising data imports Tools
A Wiki project or task force that can be the first contact point for external organisations who need to find out which parts (if any) of their data set are notable enough for Wikidata. Ideally we need a non-wiki alternative for this, like a mailing list email address or even a person who can be contacted by phone/Skype. Community communication



Community Noun project 36082.svg

3. Plan and record data imports[edit]

Current situation[edit]

A first version of a central record of all datasets imported exits ( Data Import Hub) but it isn't used. Wikidata Import Hub has a notes section, a place to plan the structure of data imports and a place for users to collaborate on data imports.

The current system should be ok for a while, but will become very difficult to manage as the number of entries listed increases. It's also not easy for someone without knowledge of wiki text to start and edit a data import hub entry.

Property proposals often take a long time to be accepted or rejected.

Existing resources[edit]

Goals[edit]

A central recording mechanism for all external datasets imported into Wikidata. It is easy to start, manage and provide topic knowledge on a data import, so we cooperate and can capture the knowledge external data partners and other subject experts. There is an easy way to extract databases from external sites if they do not offer spreadsheet downloads.

Resources needed[edit]

Task Type Phabricator task
An easy well documented way to format data to get it to the point where it is usable by tools we have to import data into Wikidata Documentation
Data Import Hub becomes easier to use and request data imports Documentation, Tools
Store data imported as part of the import process so it can be referred back to Documentation, Tools
Get user feedback to improve process Community Communication, Documentation
Find ways to capture lessons learned in importing data Documentation, Community communication
Plan the steps needed to turn the Data Import Hub into a tool, and/or gather ideas about existing tools we may be able to use in the mean time (e.g. Phabricator task for each data import) Documentation
An easy way for non technical people to extract databases from external websites Tools, Documentation
A way to increase the speed of assessing property proposals Documentation, Community communication
Guidance on converting unstructured data and data held on web pages, PDFs etc into structured data usable by Wikidata. Documentation, Tools
Make it easier to have conversations about data imports, using Structured Discussion. Documentation, Tools
A maintenance section on the Data Import Hub which records expected future updates and breaks down who will do each task, including reoccurring tasks Documentation, Tools Formalise the process for chosing a data model, types of model for different types of datasets, e.g dataset of paintings Documentation



Noun Project good fit icon 1122485 cc.svg

4. Matching data to existing data on Wikidata[edit]

Current situation[edit]

Tools exist to match data in Wikidata, but automatic matching is sometimes wrong

Existing resources[edit]

Goals[edit]

Matching of data to existing Wikidata items is faster, easier. If an external database uses other external identifiers these can be used to match the data e.g ISO codes.

Resources needed[edit]

  • More tips for matching data
  • Create documentation for new New Mix n' Match import including video
  • Mix n' Match Game mode blank items issue
  • Fancy matching for external database IDs already in Wikidata

Noun project share icon 1394623 cc.svg

5. Data importing[edit]

Current situation[edit]

A series of incomplete resources to help people learn how to import data into Wikidata. Tools require a high technical ability and have limited documentation. Bot requests often take a long time time to be assessed and done.

Existing resources[edit]

Goals[edit]

Significantly lowered technical barriers to upload data. A semi automated way of keeping existing imports synchronised with the external database. Any manual work needed is minimised where possible, clearly described and easy to contribute to.

Resources needed[edit]

Task Type Phabricator task
Better instructions Documentation
Better instructions Documentation
Better documentation around structuring data on Wikidata Documentation,
Complete list of tools that are needed/useful for data imports, with detailed instructions on how to use each one. Documentation
Improve QuickStatements documentation Documentation
Improve Mix'n'match documentation Documentation
Improve Wikipedia and Wikidata tools (addon for Google Sheets) documentation Documentation
Improve Primary sources tool (how to add new data sets) documentation Documentation
Improve APIs and code libraries documentation Documentation
Document how to extract data from a pdf Document
Keep adding spreadsheet skills to Data Import Guide#Commonly used processes Documentation
Documentation on manual work needed on each dataset which is easy to find and goes into a central list of certain kinds of tasks Documentation
Develop a staging area where people can test their imports before ‘going live’ on Wikidata. This would ideally be a complete mirror of Wikidata, but with the additional data being tested for import showing up in the interface. Once you are happy it all looks good, you can click “Publish” to go ahead and update Wikidata for real. Tools

Noun Project tools icon 943586 cc.svg

6. Maintain and improve data quality[edit]

Current situation[edit]

There is no guidance on maintaining data quality, it is left to each user to invent sparql queries to check the data with no list of possible queries. Recently Magnus wrote the recent changes tool to show changes in the results of a sparql queries. Knowing how to create and run sparql queries is required to check data quality.

Existing resources[edit]

Goals[edit]

Data quality is maintained and improved over time using a set of tools with a low technical barrier to use. Users are track changes in data that has been imported to understand what has changed and fix errors introduced. Errors are less likely to be introduced in the first place. Data is integrated with information from other Wikimedia projects e.g attaching a Commmons category to the item. It is easy to find and repair vandalism. There is effective dispute resolution for disagreements about items. There are easy to use processes to assess and improve item quality.

Resources needed[edit]

Task Type Phabricator task
More Wikidata games and more people playing them to improve data quality and add new information. Tools, Documentation
A way to configure Wikidata games to only work on a specific subset (for example all items with have an external ID property of organisation X) Tools
Easy way to match Wikidata items with content on other Wikimedia projects. Tools
Some kind of guidance around common mistakes made with a particular kind of data so that the same mistakes don't happen repeatedly due to lack of knowledge on the subject. Documentation
Signed statements are added to items Tools T138708
Sample queries are available and used as a base for a more automated system to check data quality e.g:
  • Multiple items with the same external ID
  • Missing external IDs
  • Items merged from list
  • Number of items with a statement instance of e.g World Heritage site.
Documentation
Tools to find and repair vandalism Tools
Easy to use tools to assess item quality Tools
A way to add additional references to imported statements (could be done through the Primary Sources Tool and StrepHit, or with OpenRefine). Tools
A way to add additional statements to items created or improved through a data import (could be done through Primary Sources Tool and StrepHit, or with OpenRefine) Tools
Provide a way to get from the statement within the item with the data added to the data import documentation and from the data import documentation to the statement within the items. Tools, Documentation
Explore how ORES can help maintain and improve data quality Tools, Documentation
Data quality documentation and tools for post import 'tidying up' eg find duplicates using Mix n' Match item search Tools, Documentation

Noun project network icon 1365244 cc.svg

7. Usage of data on other Wikimedia projects[edit]

Current situation[edit]

There is lack of understanding, trust and feeling of control and agency of Wikidata on some other Wikimedia projects which is preventing integration especially on English Wikipedia. Some resources are available but many are missing or incomplete. There are incomplete instructions on how to get started contributing to Wikidata.

Existing resources[edit]

Goals[edit]

Trust within other Wikimedia project contributors is much higher. Other Wikimedia contributors are more easily and more frequently contribute to Wikidata. Other Wikimedia projects use Wikidata widely and gain value from doing so.

Resources needed[edit]

Task Type Phabricator task
Documentation to make it easier for people including existing Wikimedia contributors to contribute to Wikidata. Documentation
Fix Wikidata Tours Tools
Create Wikidata Tours for References, Qualifiers Documentation
Making it easier to see changes in the data shown on other Wikimedia projects within their existing processes e.g watchlists Tools
Providing documentation on what other Wikimedia projects use Wikidata data for and how to reuse data Documentation
Instructions for creating Wikidata fed content on other Wikimedia projects e.g Infoboxes Documentation
Make query instructions better Documentation
Have consistency in the way we mark when Wikipedia gets used as a reference for statement, so that Wikipedia can decide against using the data in their infoboxs (Wikipedia does not like Wikipedia as a source), possibly partially resolved through this Tools, Documentation
Find some way to deal with data about people which is acceptable to other Wikimedia projects, possibly partially resolved by Wikidata:Living_persons_(draft). Tools, Documentation, Community communication
Resolve issues with Wikipedia watch lists showing changes to Wikidata data that is used on a page. Tools
Show the value of Wikidata fed visualisations on different language Wikipedias Tools, Documentation, Community Communication
Provide clear instructions on how to construct Wikidata fed infoboxes for other Wikimedia projects with examples. Documentation
Help other Wikimedia projects to find gaps in coverage of a subject, missing or incorrect categorisation, missing or incorrect uses of templates using the data imported, etc, e.g using Listeria lists e.g List of Royal Academicians. Documentation, Tools

Noun Project pie chart icon 1379121 cc.svg

8. Reporting on data usage[edit]

Current situation[edit]

No current documentation on how to do this, not sure if it needs additional tools?

Existing resources[edit]

  • Sparql service (but no instructions or examples)

Goals[edit]

To be able to generate a report on data added and where it used across Wikimedia projects. This will especially useful for partner organisations who want to understand how widely their data is used, including in languages they usually do not reach.

Resources needed[edit]

Task Type Phabricator task
In the long term we should have a single tool that allows you to generate all available metrics, but the initial goal should be to gather together all of the existing resources and add them to the data partnership pages. A new "metrics" chapter should be added so that we have a place to put these resources, and some instructions on how to use them.
  • What facts come from a certain source
  • Where they are used across Wikimedia projects
  • How many people see those pages on Wikimedia projects
Tools