Wikidata:WikiProject Datasets/Data Structure/Recommendations on datasets that serve as sources for data ingestion in Wikidata

From Wikidata
Jump to navigation Jump to search

Data quality assessment[edit]

Before starting to ingest data into Wikidata its quality should be assessed in order to decide whether it is worthwhile or not to actually ingest the data.

The data quality assessment can be done following the example given at: Ingesting data about historical monuments in the city of Zurich

Modeling and mapping the data[edit]

One of the main subjects of modeling datasets in Wikidata is to make it easier to state the source of a given statement. Therefore, the following recommendations are made for datasets that serve as sources for data ingestion in Wikidata. 

Data catalogs and data sets should be structured and described following the W3C Data Catalog Vocabulary (DCAT). For every data set at least the following elements should be described:

Class Instance of (P31)
Dataset Dataset (Q1172284)
Distribution digitale Distribution (Q269415)
Agent (Herausgeber) Agent (Q24229398)
Concept (Thema) Begriff (Q151885)

Classification by «instance of» is not mandatory for agent and concept so existing items can be used instead. (Comment: What is meant by this sentence?)

Further recommendations on modeling and mapping:

  • Mappings for the data set and distribution classes are provided at: DCAT - Wikidata - Schema.org mapping. They are only a recommendation indicating to which entity a certain property should be mapped. Feel free to add additional properties if required. 
  • The data catalog is not modeled in Wikidata in most cases (however possible for big catalogs with many datasets) 
  • Distributions:
    • A title is not modeled in DCAT, but every item in Wikidata should have one. Therefore, it is strongly suggested to add a title in the following format: "Distribution of {dataset title}, {month and year of release}" e.g. "Distribution of …, May 2017" 
    • Do create a link from every distribution to its data set using property «part of» (P361). 
    • The distribution – not the dataset – must be used as reference for the source property for ingested statements. This should be done with the property "imported from" (P143). (Comment: Questionable; needs verification).
  • Do use Wikidata items instead of simple textual values as often as possible. 
  • Do create some show case items first. They help you understand the data and are good examples for the ingestion later on. (Note: Is this pertinent here? The sentence seems to refer to the data ingestion process as such, and not to the description of the data set)

Documentation of data ingestion[edit]

Data ingestion into Wikidata should always be comprehensible and reproducible. Therefore, it is suggested to create the following documentation artefacts for every ingestion done. A good place to do so is a case report about the ingestion in a WikiProject. It is suggested to create a statement on the data set item using property URL (P2699) with qualifier of (P642) (P642) and value ingestion (Q1663054) to point to this documentation. 

The steps the artefacts are mapped to are taken from A practical beginners user-guideline for ingesting datasets into Wikidata.

Artefact Remarks
Step 1: Goals, data structure and data quality
diagram of the source data structure Helps understand the data to be ingested
Quality baseline Quality baseline which must be reached in quality assessment to actually start ingesting the data
Quality assessment Results of the quality assessment
 

Step 2: Mapping

Mapping for data rows

(for each data set)

A mapping for each data element in the source data
Step 3: Data formats and data cleansing
Helper tables e.g. tables used with Reconcile, e.g. as csv files
Cleanup steps Exact steps data cleansing can be reproduced with 
Step 4: Unique identifier
Unique identifier Which field is used as unique identifier or how exactly are unique identifiers generated?
Step 5: Mapping to existing data
SPARQL queries SPARQL queries used to identify existing data
Step 6: Model the data source in Wikidata
Mapping for data catalog only if a data catalog is created
Mapping for data set(s) A table with source property, DCAT property and representation within Wikidata
Mapping for distribution(s) A table with source property, DCAT property and representation within Wikidata
Link to data catalog item only if a data catalog is created
Link to data set item(s) Link pointing to the Wikidata item
Link to distribution item(s) Link pointing to the Wikidata item
Step 7: Clean up existing data on Wikidata
List of cleanup actions done A list showing every cleanup action that is applied to existing data
Step 8: Ingest the data
Steps for ingestion Detailed steps how items are ingested
Help files e.g. mail merge office files used to generate commands for QuickStatements tool
Step 9: Visualize the data
SPARQL queries Additional SPARQL queries or a reference to step 5
Step 10: Case report
Further explanations Further information explaining artefacts stated above