Wikidata:WikiProject Datasets/Data Structure/Recommendations on datasets that serve as sources for data ingestion in Wikidata
Data quality assessment[edit]
Before starting to ingest data into Wikidata its quality should be assessed in order to decide whether it is worthwhile or not to actually ingest the data.
The data quality assessment can be done following the example given at: Ingesting data about historical monuments in the city of Zurich.
Modeling and mapping the data[edit]
One of the main subjects of modeling datasets in Wikidata is to make it easier to state the source of a given statement. Therefore, the following recommendations are made for datasets that serve as sources for data ingestion in Wikidata.
Data catalogs and data sets should be structured and described following the W3C Data Catalog Vocabulary (DCAT). For every data set at least the following elements should be described:
Class | Instance of (P31) |
---|---|
Dataset | Dataset (Q1172284) |
Distribution | digitale Distribution (Q269415) |
Agent (Herausgeber) | Agent (Q24229398) |
Concept (Thema) | Begriff (Q151885) |
Classification by «instance of» is not mandatory for agent and concept so existing items can be used instead. (Comment: What is meant by this sentence?)
Further recommendations on modeling and mapping:
- Mappings for the data set and distribution classes are provided at: DCAT - Wikidata - Schema.org mapping. They are only a recommendation indicating to which entity a certain property should be mapped. Feel free to add additional properties if required.
- The data catalog is not modeled in Wikidata in most cases (however possible for big catalogs with many datasets)
- Distributions:
- A title is not modeled in DCAT, but every item in Wikidata should have one. Therefore, it is strongly suggested to add a title in the following format: "Distribution of {dataset title}, {month and year of release}" e.g. "Distribution of …, May 2017"
- Do create a link from every distribution to its data set using property «part of» (P361).
- The distribution – not the dataset – must be used as reference for the source property for ingested statements. This should be done with the property "imported from" (P143). (Comment: Questionable; needs verification).
- Do use Wikidata items instead of simple textual values as often as possible.
- Do create some show case items first. They help you understand the data and are good examples for the ingestion later on. (Note: Is this pertinent here? The sentence seems to refer to the data ingestion process as such, and not to the description of the data set)
Documentation of data ingestion[edit]
Data ingestion into Wikidata should always be comprehensible and reproducible. Therefore, it is suggested to create the following documentation artefacts for every ingestion done. A good place to do so is a case report about the ingestion in a WikiProject. It is suggested to create a statement on the data set item using property URL (P2699) with qualifier of (P642) (P642) and value ingestion (Q1663054) to point to this documentation.
The steps the artefacts are mapped to are taken from A practical beginners user-guideline for ingesting datasets into Wikidata.
Artefact | Remarks |
---|---|
Step 1: Goals, data structure and data quality | |
diagram of the source data structure | Helps understand the data to be ingested |
Quality baseline | Quality baseline which must be reached in quality assessment to actually start ingesting the data |
Quality assessment | Results of the quality assessment |
Step 2: Mapping | |
Mapping for data rows
(for each data set) |
A mapping for each data element in the source data |
Step 3: Data formats and data cleansing | |
Helper tables | e.g. tables used with Reconcile, e.g. as csv files |
Cleanup steps | Exact steps data cleansing can be reproduced with |
Step 4: Unique identifier | |
Unique identifier | Which field is used as unique identifier or how exactly are unique identifiers generated? |
Step 5: Mapping to existing data | |
SPARQL queries | SPARQL queries used to identify existing data |
Step 6: Model the data source in Wikidata | |
Mapping for data catalog | only if a data catalog is created |
Mapping for data set(s) | A table with source property, DCAT property and representation within Wikidata |
Mapping for distribution(s) | A table with source property, DCAT property and representation within Wikidata |
Link to data catalog item | only if a data catalog is created |
Link to data set item(s) | Link pointing to the Wikidata item |
Link to distribution item(s) | Link pointing to the Wikidata item |
Step 7: Clean up existing data on Wikidata | |
List of cleanup actions done | A list showing every cleanup action that is applied to existing data |
Step 8: Ingest the data | |
Steps for ingestion | Detailed steps how items are ingested |
Help files | e.g. mail merge office files used to generate commands for QuickStatements tool |
Step 9: Visualize the data | |
SPARQL queries | Additional SPARQL queries or a reference to step 5 |
Step 10: Case report | |
Further explanations | Further information explaining artefacts stated above |