Wikidata talk:WikiProject Datasets

Using Wikidata to describe datasets[edit]

Today in the AKTS workshop at the eScience conference in San Diego, I will present the idea of using Wikidata to describe datasets in the rare case that Schema.org does not work. This would allow to include these datasets in dataset catalogs (such as Google Dataset Search) in a collaborative manner and that is beneficial to everyone. This relies on the dataset communities and the Wikidata communities to agree with that approach.

I will open the discussion with the dataset communities at the workshop, and I wanted to open the discussion with the Wikidata community here. I am happy to answer questions and discuss the proposal. --Denny (talk) 14:15, 24 September 2019 (UTC)[reply]

Thanks for this, @Denny: I guess we should also maintain data about the datasets that have been ingested into Wikidata; especially if updates are to be expected in the future. - Any thoughts on this? --Beat Estermann (talk) 14:22, 24 September 2019 (UTC)[reply]

Thanks, @Beat Estermann:. Yes, I totally agree. And I expect that the description of datasets ingested and the description of datasets for external dataset catalogs should align. I guess we would need some additional metadata for ingested datasets, such as "last ingested" or "up to date as of" or something like that? Do we have any examples of that already? --Denny (talk) 17:43, 25 September 2019 (UTC)[reply]

We had this some time ago, but I don't think it really got active. Good idea to relaunch this. --- Jura 14:18, 24 September 2019 (UTC)[reply]

Thanks for the pointer! --Denny (talk) 17:46, 25 September 2019 (UTC)[reply]

Great to see new activity in this space! --Daniel Mietchen (talk) 20:52, 26 September 2019 (UTC)[reply]

Very interesting idea! Some comments/questions:

First of all, I would be interested to see how you have extracted the datasets that are already available in Wikidata (and the numbers reported in the manuscript). I suppose it would probably be something like SELECT DISTINCT ?item WHERE { ?item wdt:P31/wdt:P279* wd:Q1172284 } (27459 results currently), but I am actually not totally sure. If so, are we sure that the subclass hierarchy is sane? (I already see some inconsistencies…)
Then I ask myself whether a distinction between databases and datasets matters here.
- From my point of view (physicist, not computer scientist), datasets are more like an individual file (CSV/RDF/JSON/Excel/some sort of structured plain text/whatever…) that can be downloaded, and which contains some sort of consolidated and consistent data. A dataset can be an extract from a database, but that is not necessarily the case. Usually, datasets can easily be downloaded (~1 request), while entire databases often cannot—one had to crawl all individual entries or so to gather "all data" from a database.
- In that setting, an individual Wikidata item would be a dataset, a WDQS result would also be a dataset, but entire Wikidata would be a database.
- From my experience, here at Wikidata we look at sources for data imports on a database-level, not on a dataset-level. We set up Wikidata properties with external-id type for an entire database, and link to individual datasets in that database via identifiers. For anything that should simplify data imports to Wikidata, a standardized description of databases would be more valuable than a description of datasets. Correct so far?

—MisterSynergy (talk) 19:10, 25 September 2019 (UTC)[reply]

So far, I have naively kept to the terminology employed by DCAT (see also: DCAT-AP). This standard distinguishes between dcat:Dataset and dcat:Distribution. What you designate as "database" would qualify as dcat:Dataset (it may for example take the form of a relational database or a single table). The specific available form of dcat:Dataset is called dcat:Distribution (an API, or a CSV, RDF, JSON, Excel file, etc.) that may contain the whole or part of the respective dcat:Dataset. There is a third class defined by DCAT – dcat:Catalog. "A data catalog is a curated collection of metadata about datasets". And this is where it becomes messy when we try to transpose this terminology to Wikidata:

Following the logic of DCAT, Wikidata constitutes one dcat:Dataset.
The Wikidata Query Service (SPARQL endpoint) constitutes one dcat:Distribution of Wikidata.
Any extract of Wikidata made available in any format would qualify as a dcat:Distribution as well.
The subset of Wikidata that represents a "curated collection of metadata about datasets" would qualify as dcat:Catalog.

In my view, this raises two fundamental conceptual issues:

First, an issue that may become more and more common as DCAT enters the linked data world: Where does the "metadata" end, and where does the actual "data" begin? - The big advantage of linked data is that it uses the same data space to describe the metadata and the actual data. But where do we draw the line between them? Do we have to?
Second, in the case of Wikidata, we need to either accept that Wikidata is both a huge dcat:Catalog and a co-extensive huge dcat:Dataset (mixing metadata and data), or we need to allow for dcat:Datasets being nested: One subset of Wikidata constitutes a dcat:Catalog, which in turn would qualify as a dcat:Dataset. As a consequence, we would need to accept the idea that the huge dcat:Dataset Wikidata contains many smaller dcat:Datasets - but how in the world will we delimit them? By defining SPARQL queries that capture them? But this tends to be completely arbitrary...

Any thoughts on this? Is anybody here involved in SEMIC, the European Semantic Interoperability Community, or in the development of the next major release of DCAT-AP? It would be interesting to hear how they are planning to deal with this kind of issues. --Beat Estermann (talk) 06:55, 26 September 2019 (UTC)[reply]

External identifiers[edit]

Hi All,

I wrote a proposal about expanding the statements regarding external identifiers, please have a look and add your opinions: https://www.wikidata.org/wiki/Wikidata:Project_chat#External_Identifiers_-_expanding_statements,_best_practice --Adam Harangozó (talk) 15:31, 1 November 2019 (UTC)[reply]

Infobox for Wikidata in French[edit]

I've developed a wikidata-based Infobox for Wikipedia in French for datasets (Infobox Jeu de données).

I've described the main properties I use in the Infobox on this page : Wikidata:WikiProject Datasets/Data structure.

If you have any remarks about this, feel free to open the debate here.

Do you know any other Infobox for datasets? PAC2 (talk) 07:28, 13 September 2020 (UTC)[reply]

Data models for datasets[edit]

Datasheets for Datasets (Q60487752)

This paper applies the model of the data sheet (Q1172383) to datasets.

Nutritional Labels for Data and Models (Q105552296)

This paper applies the model of the nutrition facts label (Q1531970) to datasets.

Both of these papers could guide the development of how we model datasets here on Wikidata. They also present a philosophy of the dataset. I am excited about the entire concept of packaging dataset metadata as media worth discussing and considering. Blue Rasberry (talk) 00:10, 18 February 2021 (UTC)[reply]

Canon of datasets[edit]

When students study humanities there are some works which are canon, and which students of the discipline in various places over time will all study and know. Teachers compile their canons of art, literature, case studies, and precedents.

What are the canonical datasets, and what are their characteristics? I am considering which datasets to model as examples in this wikiproject. Characteristics of the datasets I want include the following: The ideal dataset is...

free and open
large and diverse enough for use in student exercises including machine learning
the subject of academic publication in the humanities, considering the data for social significance
relevant or meaningful across language and cultural barriers
promotes neutral public benefit, as compared to a dataset which presents a corporate brand
easy to understand and appreciate even among people who know nothing of datasets

I asked some people for their suggestions and here is what I have for a start.

perhaps something from list of datasets for machine learning research (Q23038294)
The R Datasets Package
the dataset behind the software COMPAS (Q48811072), which seeks to support police in predicting crime
r-project.org packages
Iris flower data set (Q4203254), a flower dataset
Passengers of the RMS Titanic (Q462554), on kaggle
MNIST database (Q17069496), handwritten numbers
the 1990s Boston Housing Data
ImageNet (Q24901201)

Besides these miscellaneous suggestions, there are lots of websites where amateurs and professionals have listed their favorite datasets for teaching various skills related to data.

In addition to the above criteria, I am also hoping to somehow present datasets which represent some culturally significant organizations. Organizations that I have in mind are Wikipedia and Wikidata itself, Internet Archive, and OpenStreetMap, because all three of these have missions to provide all media in their domain digitally for free to everyone in the world and with community project management. If there is any other such project I would like to represent it here too, so that discussing the data can include discussing the ideas and values of accessible social movements which curate the data.

If anyone has ideas to share then thanks. After identifying some datasets the next step is modeling them out here with Wikidata properties then seeking some community comment. Blue Rasberry (talk) 02:58, 18 February 2021 (UTC)[reply]

Also, is a dataset different from a database? (Of which there may also be canonical examples - IMdB etc). Most of the modelling for dataset may also apply to database. Also, we may put items in eg class online database (Q7094076) and seek that to imply that they contain an accessible dataset (or collection of datasets) - we want the statement to stand for both thoughts, and may want the database be returned in a search for data. Similarly, statements may be referenced to an online database (Q7094076), being used synonymously with the dataset(s) it contains. Perhaps something to consider, and if we do make a distinction, then also give a model for database on the page, to clarify the nuance. Jheald (talk) 18:21, 22 February 2021 (UTC)[reply]

@Jheald: We could talk this through. In the case of Internet Movie Database (Q37312) (IMDB) there is the database which is their tool or platform or interface or container for holding and presenting their data. Check out their own terms of use - they say that their data is only for personal and noncommercial use. Since Wikidata only uses free and open data, that means it is not possible to import their dataset to here for remixing in Wikidata. There are three concepts to know here - IMDB the database which holds the data, the IMDB dataset which is their own metadata for movies on which they are claiming a copyright, and then there is the theoretical dataset of movie data which may be identical to the IMDB dataset but which somehow does not have copyright restrictions. There can be no copyright on facts like titles of movies or lists of actors, but there can be restrictions on getting access to the datasets containing those publicly known and copyright ineligible facts.

It would be tedious to make the distinction between database and dataset in all examples, and certainly corporate brands like IMDB would prefer that users assume that their product and the public data are indistinguishable and inseparable. It would not be in their interest if elsewhere people could access the public data on which they are laying a claim. At the same time, they have archives of public data which no one else has.

We can talk through these situations when we have some consensus about which datasets to model. I criticize IMDB, but movies are very popular, and maybe Wikidata should feature it as an example even though it is proprietary. Blue Rasberry (talk) 22:38, 22 February 2021 (UTC)[reply]

@Bluerasberry: I was merely giving IMdB as an example case, no endorsement or particular interest in it implied :-) But I think the thought that I was trying to get towards is that for many dataases, particularly the ones that are values of applicable 'stated in' value (P9073) and so cases of particular priority, we are lucky if we have one item for the thing; and people certainly don't seem to create separate items for the database and the data in it. So some though is needed as to how to handle that. Jheald (talk) 09:28, 23 February 2021 (UTC)[reply]

Considering some datasets[edit]

Sloan Digital Sky Survey (Q840332)

data access
teacher resources
CC-By image policy
download 12 terabytes of astronomical data
seemingly no statement about copyright claims on the data; cannot see whether they claim any of it is non-free. This might all be free.

Laser Interferometer Gravitational Wave Observatory (Q255371)

data access
PDF data management plan seems to suggest 10,000GB data and that *claims to be Open Archival Information System (Q1469973), where "open" may mean Open Definition (Q21605525) and be Wikimedia compatible

Protein Data Bank (Q766195)

something from National Institute of Standards and Technology (Q176691), perhaps from their AI project

two missions - one is meteorology, the other is security and explainability (seems like three missions?)
United States government, presumably free data?

Blue Rasberry (talk) 00:15, 23 February 2021 (UTC)[reply]

Request for ontology[edit]

Wikidata:Project_chat#Modelling_of_items_for_datasets_?

Thanks @Jheald: for bringing this to the main chat and outlining the issue there. Blue Rasberry (talk) 13:19, 22 February 2021 (UTC)[reply]

Copy/paste from Project Chat:

Question from twitter [1]:

Does #wikidata have an ontology for datasets, including content creators and curators?

I suggested the following, but I am pretty sure I missed some things (thread, unrolled):

Yes (apparently). The item for data set (Q1172284) says it's maintained by Wikidata:WikiProject_Datasets which has a page Wikidata:WikiProject_Datasets/Models
The wilkiproject also gives this, which is some sensible advice on importing Wikidata:WikiProject_Datasets/Data_Structure/Recommendations_on_datasets_that_serve_as_sources_for_data_ingestion_in_Wikidata
The Help:Sources page has this: Help:Sources#Databases, on how a wikidata statement can reference a database; though I would tend to use subject named as (P1810) instead of title (P1476) for how it appears there, also I would tend to give /either/ the database name /or/ its identifier prop, not both
Umbrella page Wikidata:Sources doesn't appear to have heard of datasets, or their wikiproject
A notable weakness in the style-guide from WikiProject datasets is that it doesn't seem to answer your original Q including who created it (creator (P170), probably; maybe collection creator (P6241))
nor who is curating / maintaining it (maintained by (P126), probably)
Also it would be nice if the style-guide identified some items that were good examples of modelling of datasets ('showcase items')... or even if it made sure the properties table used examples that actually were datasets.
Looking at the dataset item in Reasonator [2] one can scroll down to the "from related items" box, and find some items that consider themselves to be "instance of" or "subclass of" dataset
Closer to home, one could also look at Gazetteer of Early Modern England and Wales (Q105548625) [= the questioner's own project] and its relation to Viae Regiae (Q105547906) (given as operator (P137)) /end

A useful and I think very important point was made back to me, that

For academic (and credible non-academic) reuse of datasets, especially when combining datasets, for analysis and subsequent reference it is crucial to know creator, curator, version number and authoritative repository

Overall, this does seem rather a fundamental subject, both in its own right, and in its importance for citations.

It did feel that I had to go 'round the houses' a bit to find the information we had. I would have completely forgotten about WikiProject Datasets, and never found its item style-guide if I hadn't chanced to look at data set (Q1172284). And there's probably also more on here, that I never did find?

Jheald (talk) 19:06, 23 February 2021 (UTC)[reply]

Matching datasets to educational needs[edit]

This WikiProject presents datasets. Instructors teaching various techniques will use example datasets for demonstrations. If we have good modeling for the sorts of datasets that instructors need for teaching, then this WikiProject could be a resource for education.

Here are some example papers which describe learning goals in computational biology, which is one discipline which heavily uses datasets. Also for anyone who does not know, English Wikipedia's WikiProject Computational Biology (now merged to molecular biology) has been super active in university contribution to Wikipedia since the start of university partnerships. They have run a student content since 2012 and before that did other activities. This is a field which has a natural affinity for Wikimedia content.

en:Wikipedia:WikiProject Molecular Biology/Computational Biology

If anyone identifies other lists of skills which students learn by playing with datasets then please share. Perhaps this WikiProject can recommend certain datasets to use as examples when teaching certain skills. Also, we could curate those datasets well with Wikidata properties so that students could see good Wikidata-style data modelling. Blue Rasberry (talk) 21:23, 22 February 2021 (UTC)[reply]

Cut water drops with hydrophobic knife[edit]

water drop survives cutting attempt

water drop is cut

Cutting a drop of water pinned by wire loops using a superhydrophobic surface and knife (Q34303353) This paper has some heavy calculations and explanations for how a superhydrophobic knife cuts water droplets differently than a typical knife. user:Daniel Mietchen suggested these videos as an example here. He actually suggests lots of things all around Wikidata, so thanks Daniel.

The example is that it is often hard to match datasets to a real-world demonstration. In the case of these knives, they both look like knives, but with a dataset explaining behavior, one can predict how water will behave in contact with surfaces made of various materials. The videos are a demonstration of differences in outcomes when using different materials.

The paper is from 2012, I am unsure if the data is available, and if the data is available then I am unsure what we should do with it here. But I did want to express that it would be nice if we found some model dataset to present that also had some video or image which was not data, but which explained the significance of the dataset to a non-scientist. Practically no one could read this boring crazy paper and understand it, but anyone could watch these videos for 10 seconds and get the idea of exactly what the researchers did. The video makes a mundane explanation awesome. Blue Rasberry (talk) 21:36, 22 February 2021 (UTC)[reply]

Modeling AGROVOC[edit]

AGROVOC (Q292649) is currently modeled as a controlled vocabulary, but there is a great deal of information about it in English Wikipedia which is not included in Wikidata. This might be interesting to model - would we consider this a dataset as well? - PKM (talk) 20:01, 24 February 2021 (UTC)[reply]

@PKM: Yes it is a dataset and yes it is really important for the wiki community to figure out how to model multilingual technical vocabularies. This list of plant species and more is exactly the kind of difficult translation sets which the wiki community would excel in translating if we figured out how to model and manage this stuff. A big regret that I have about this dataset is that it is CC-By licensed so incompatible with import to Wikidata. The ideal dataset to model would be something we can actually import. I really like the idea of this - do you have ideas for other such vocabularies? Blue Rasberry (talk) 23:15, 19 June 2021 (UTC)[reply]

@Bluerasberry: Inter-Active Terminology for Europe (Q1520860) might be another one - I use it for translating labels. It's downloadable (and more importantly, I think, downloadable by domain). I'm not sure the downloads include the references and definitions (= descriptions), which are not freely licensed but are helpful for disambiguation and matching to QIDs. It would be nice if there was a clear CC-by license, but it appears not, though the data can be used for commercial purposes and derivative works are allowed with attribution. - PKM (talk) 20:47, 20 June 2021 (UTC)[reply]

Ah: from the Legal notice page: "The data made available is not copyright protected, and can therefore be freely downloaded and reproduced, for personal use or for further non-commercial or commercial dissemination. The European Union retains the copyright in the database structure of the downloadable file, but hereby authorises downloading and reproduction, for personal use or for further non-commercial or commercial dissemination, also of the database structure as available in the downloadable file." - PKM (talk) 20:51, 20 June 2021 (UTC)[reply]

temporal coverage of Dataset[edit]

As I added to Wikidata:WikiProject_Datasets/Data_Structure/DCAT_-_Wikidata_-_Schema.org_mapping#Dataset, for temporal coverage, aren't start of covered period (P7103) and end of covered period (P7104) more precisely appropriate than start time (P580) and end time (P582)? "4.11.1 Optional properties for Period of Time" in DCAT-AP v1.1 says schema:startDate and schema:endDate are the corresponding properties, and start time (P580) and end time (P582) says they are equivalent to those. (I think these relationships imply that start of covered period (P7103) should be the sub-property of start time (P580)) --KAMEDA, Akihiro (talk) 16:45, 24 October 2022 (UTC)[reply]