Wikidata talk:WikiProject Schemas/Subsetting

From Wikidata
Jump to navigation Jump to search

Possible fields[edit]

@Andrawaag, Daniel Mietchen, John Samuel, Jose Emilio Labra Gayo, Alejgh, Will (Wiki Ed):

I would encourage all participants to define core classes (or sets of them) that would make a slice that would be interesting to you, or which you think would be widely used. Please append here. --SCIdude (talk) 07:45, 26 June 2020 (UTC)[reply]

I am not sure if a set of core classes would do it. A lot of Wikidata items still lack class typing. In the current state a set of properties that describe the slice might be more effective.
I see more value in defining entity schemas of the subsets one would like to extract. e.g. A lot of items in Wikidata still lack proper references, basically making them inaccurate. Defining a set of classes only would create a subset where the lack of referenced statements persists, while I would like to create high-quality wikidata subset primarily by only extracting items with their referenced statements. --

So we have a set of schemas as the description of a slice. Again, can you give such a set that would be of interest (to you)? What would be a medium/file format to describe such a set? A Wikidata item? A tarball? --SCIdude (talk) 15:48, 11 December 2020 (UTC)[reply]

Suitable fields[edit]

What is needed is a metric of the interconnectedness of specific slices with the rest of Wikidata. Example: we can expect scientific articles to draw on all aspects of Wikidata (fields, persons) so doing a slice of all articles may need links to most items of Wikidata, while molecular biology plus chemistry is pretty self-sufficient. This metric could have as input a set of classes which define the items in the slice core, and it may have as output an array of percentages. Made up examples:

The first percentage is the ratio of number of instances of the class vs. all items in Wikidata; the second adds all items linked to from the previous items; the third adds all items linked to from them.

This would help in determining if a slice is doable, and if some sort of cutoff is needed, for example only linking to items but not including the actual item. --SCIdude (talk) 07:29, 26 June 2020 (UTC)[reply]

Queries[edit]

@Andrawaag, Daniel Mietchen, Jsamwrites, Jelabra, Alejgh: @SCIdude, Will (Wiki Ed), Sj:

Maybe Wikidata:Request_a_query#Dataset_sizing interests you. --- Jura 09:09, 11 August 2020 (UTC)[reply]

Aha! Yes, thank you. Sj (talk) 15:25, 17 August 2020 (UTC)[reply]
It's now mostly at Help:Dataset sizing. --- Jura 16:09, 17 August 2020 (UTC)[reply]