Wikidata:Requests for comment/Mapping and improving the data import process

From Wikidata
Jump to: navigation, search
An editor has requested the community to provide input on "Mapping and improving the data import process" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

Community Noun project 26481.svg What would the ideal data import process look like? This page maps the current data import processes including tools and documentation, what is available and what is still needed to create an ideal process. Once this information has been mapped it can later be broken down into Phabricator tasks to keep easy track of progress.

We want to get as many community members as possible involved in these early stages of planning. The end goal is to have a well structured task list showing the improvements we need to make to the data import process as good as possible which we can turn into a set of Phabricator tasks.

This discussion follows on from the session held at WikidataCon 2017, which aimed to gather feedback from the community about problems/difficulties with the current data import flow.



Phacility phabricator logo.svg

Phabricator task structure[edit]

A suggested structure for the Phabricator task list is a single root project called Data partnerships process with three main subprojects.

  1. Documentation
  2. Tools
  3. Community communication
Noun project list icon 1380018 cc.svg

Import process overview[edit]

The process of importing external datasets into Wikidata can be broken down into the following steps, tools and documentation could combine two or more of these steps:

  1. Provide information about Wikidata to data partners
  2. Identify data to be imported
  3. Plan and record data import
  4. Match data to existing data on Wikidata
  5. Import data
  6. Maintain and improve data quality
  7. Usage of data on other Wikimedia projects
  8. Reporting on data usage

Noun project info icon 808465 cc.svg

1. Provide information about Wikidata to data partners[edit]

Current situation[edit]

Some resources for data partners exist but are limited, there are no step by step instructions for publishing open data or how to get the data imported into Wikidata. The laws around copyright and other data rights is complicated and not well understood by both many Wikidata contributors and data partners.

Existing resources[edit]

Goals[edit]

Data producers have good understanding of the purpose and potential of contributing and using Wikidata. Many organisations use Wikidata as an authority control to make their data Linked Open Data. There is information available for different audiences interested in Wikidata e.g data science students, librarians, museums, galleries, archives. There is a more easy to understand and attractive way to explore data on Wikidata.

Resources needed[edit]

  • Detailed case studies of successful data imports, including a section on the impact for the organisation.
  • Clear and detailed instructions for reusing identifiers from Wikidata.
  • A way for data partners to get consistent and professional support for working with Wikidata.
  • Information in people's own language, they do no have to speak English to work with Wikidata.
  • Some kind of visual way of exploring Wikidata items, possibly as described in the 2017 Community Wishlist

Noun Project icon Magnifying glass 938466 cc.svg

2. Identify data to be imported[edit]

Current situation[edit]

There is very little coordination of data imported on a specific topic and few records of what has and has not been imported on a subject. It's difficult for an external organisation to find out whether their data is within the scope of Wikidata.

Existing resources[edit]

Goals[edit]

A place to map all the data available on a subject and systematically import it into Wikidata, to know what is available and what is missing, including multiple external databases. E.g for having a worldwide list of built heritage on Wikidata (very useful for Wiki Loves Monuments) all national and regional heritage registers would need to be imported.

Resources needed[edit]

  • More consistent use of the data import hub so we have a record that can be searched through to identify if someone is already working on importing a dataset (and which Wikidata users / organisations are involved)
  • System for tagging/categorising data imports.
  • A Wiki project or task force that can be the first contact point for external organisations who need to find out which parts (if any) of their data set are notable enough for Wikidata. Ideally we need a non-wiki alternative for this, like a mailing list email address or even a person who can be contacted by phone/Skype.

Community Noun project 36082.svg

3. Plan and record data imports[edit]

Current situation[edit]

A first version of a central record of all datasets imported exits ( Data Import Hub) but it isn't used. Wikidata Import Hub has a notes section, a place to plan the structure of data imports and a place for users to collaborate on data imports.

The current system should be ok for a while, but will become very difficult to manage as the number of entries listed increases. It's also not easy for someone without knowledge of wiki text to start and edit a data import hub entry.

Property proposals often take a long time to be accepted or rejected.

Existing resources[edit]

Goals[edit]

A central recording mechanism for all external datasets imported into Wikidata. It is easy to start, manage and provide topic knowledge on a data import, so we cooperate and can capture the knowledge external data partners and other subject experts. There is an easy way to extract databases from external sites if they do not offer spreadsheet downloads.

Resources needed[edit]

  • An easy well documented way to format data to get it to the point where it is usable by tools we have to import data into Wikidata
  • Data Import Hub becomes easier to use and request data imports
  • Store data imported as part of the import process so it can be referred back to
  • Get user feedback to improve process
  • Find ways to capture lessons learned in importing data
  • Plan the steps needed to turn the Data Import Hub into a tool, and/or gather ideas about existing tools we may be able to use in the mean time (e.g. Phabricator task for each data import)
  • An easy way for non technical people to extract databases from external websites
  • A way to increase the speed of assessing property proposals
  • Guidance on converting unstructured data and data held on web pages, PDFs etc into structured data usable by Wikidata.

Noun Project good fit icon 1122485 cc.svg

4. Matching data to existing data on Wikidata[edit]

Current situation[edit]

Tools exist to match data in Wikidata, but automatic matching is sometimes wrong

Existing resources[edit]

Goals[edit]

Matching of data to existing Wikidata items is faster, easier. If an external database uses other external identifiers these can be used to match the data e.g ISO codes.

Resources needed[edit]

  • More tips for matching data
  • Create documentation for new New Mix n' Match import including video
  • Mix n' Match Game mode blank items issue
  • Fancy matching for external database IDs already in Wikidata

Noun project share icon 1394623 cc.svg

5. Data importing[edit]

Current situation[edit]

A series of incomplete resources to help people learn how to import data into Wikidata. Tools require a high technical ability and have limited documentation. Bot requests often take a long time time to be assessed and done.

Existing resources[edit]

Goals[edit]

Significantly lowered technical barriers to upload data. A semi automated way of keeping existing imports synchronised with the external database. Any manual work needed is minimised where possible, clearly described and easy to contribute to.

Resources needed[edit]

  • Better instructions
  • Better documentation around structuring data on Wikidata
  • Complete list of tools that are needed/useful for data imports, with detailed instructions on how to use each one.
    • QuickStatements
    • Mix'n'match
    • Wikipedia and Wikidata tools (addon for Google Sheets)
    • Primary sources tool (how to add new data sets)
    • APIs and code libraries
  • Document how to extract data from a pdf
  • Keep adding spreadsheet skills to Data Import Guide#Commonly used processes
  • Documentation on manual work needed on each dataset which is easy to find and goes into a central list of certain kinds of tasks
  • Develop a staging area where people can test their imports before ‘going live’ on Wikidata. This would ideally be a complete mirror of Wikidata, but with the additional data being tested for import showing up in the interface. Once you are happy it all looks good, you can click “Publish” to go ahead and update Wikidata for real.

Noun Project tools icon 943586 cc.svg

6. Maintain and improve data quality[edit]

Current situation[edit]

There is no guidance on maintaining data quality, it is left to each user to invent sparql queries to check the data with no list of possible queries. Recently Magnus wrote the recent changes tool to show changes in the results of a sparql queries. Knowing how to create and run sparql queries is required to check data quality.

Existing resources[edit]

Goals[edit]

Data quality is maintained and improved over time using a set of tools with a low technical barrier to use. Users are track changes in data that has been imported to understand what has changed and fix errors introduced. Errors are less likely to be introduced in the first place. Data is integrated with information from other Wikimedia projects e.g attaching a Commmons category to the item. It is easy to find and repair vandalism. There is effective dispute resolution for disagreements about items. There are easy to use processes to assess and improve item quality.

Resources needed[edit]

  • More Wikidata games and more people playing them to improve data quality and add new information.
  • A way to configure Wikidata games to only work on a specific subset (for example all items with have an external ID property of organisation X)
  • Easy way to match Wikidata items with content on other Wikimedia projects.
  • Some kind of guidance around common mistakes made with a particular kind of data so that the same mistakes don't happen repeatedly due to lack of knowledge on the subject.
  • Signed statements are added to items
  • Sample queries are available and used as a base for a more automated system to check data quality
    • Multiple items with the same external ID
    • Missing external IDs
    • Items merged from list
    • Number of items with a statement instance of e.g World Heritage site.
  • Tools to find and repair vandalism
  • Easy to use tools to assess item quality
  • A way to add additional references to imported statements (could be done through the Primary Sources Tool and StrepHit).
  • A way to add additional statements to items created or improved through a data import (could be done through Primary Sources Tool and StrepHit)

Noun project network icon 1365244 cc.svg

7. Usage of data on other Wikimedia projects[edit]

Current situation[edit]

There is lack of understanding, trust and feeling of control and agency of Wikidata on some other Wikimedia projects which is preventing integration especially on English Wikipedia. Some resources are available but many are missing or incomplete. There are incomplete instructions on how to get started contributing to Wikidata.

Existing resources[edit]

Goals[edit]

Trust within other Wikimedia project contributors is much higher. Other Wikimedia contributors are more easily and more frequently contribute to Wikidata. Other Wikimedia projects use Wikidata widely and gain value from doing so.

Resources needed[edit]

  • Documentation to make it easier for people including existing Wikimedia contributors to contribute to Wikidata.
  • Making it easier to see changes in the data shown on other Wikimedia projects within their existing processes e.g watchlists
  • Providing documentation on what other Wikimedia projects use Wikidata data for and how to reuse data
  • Instructions for creating Wikidata fed content on other Wikimedia projects e.g Infoboxes
  • Make query instructions better
  • Have consistency in the way we mark when Wikipedia gets used as a reference for statement, so that Wikipedia can decide against using the data in their infoboxs (Wikipedia does not like Wikipedia as a source), possibly partially resolved through this
  • Find some way to deal with data about people which is acceptable to other Wikimedia projects, possibly partially resolved by Wikidata:Living_persons_(draft).
  • Resolve issues with Wikipedia watch lists showing changes to Wikidata data that is used on a page.
  • Show the value of Wikidata fed visualisations on different language Wikipedias
  • Provide clear instructions on how to construct Wikidata fed infoboxes for other Wikimedia projects with examples.

Noun Project pie chart icon 1379121 cc.svg

8. Reporting on data usage[edit]

Current situation[edit]

No current documentation on how to do this, not sure if it needs additional tools?

Existing resources[edit]

  • Sparql service (but no instructions or examples)

Goals[edit]

  • To be able to generate a report on data added and where it used across Wikimedia projects. This will especially useful for partner organisations who want to understand how widely their data is used, including in languages they usually do not reach.

Resources needed[edit]

In the long term we should have a single tool that allows you to generate all available metrics, but the initial goal should be to gather together all of the existing resources and add them to the data partnership pages. A new "metrics" chapter should be added so that we have a place to put these resources, and some instructions on how to use them.

  • What facts come from a certain source
  • Where they are used across Wikimedia projects
  • How many people see those pages on Wikimedia projects