Wikidata:WikiProject Schemas/Subsetting
About
[edit]This page assists in documenting efforts related to designing, creating, managing and creating resources that could be described as a Wikidata subset (Q96051494), i.e. subset (Q177646) of Wikidata, with an emphasis on stability and reusability.
History
[edit]During the hackathon of 12th International SWAT4HCLS conference (Q64762682), Wikidata subsetting was one of the topics. A running document collected the proposals, suggestion and ideas. With this edit, that document has been moved to this page.
Rationales
[edit]- Why are we calling things 'subsets'? Aren't all query results subsets? Subset emphasises longevity and re-usability of the subset, e.g. for subsequent re-query, publication, archival etc.
- Normalizing the Wikidata data graph. When we subset, can we normalize this to other representations / schemas, which can make it easier to merge with external data.
- Quicker and better autocomplete (less noise, precision/recall tuning)
- Part of the larger story about how Wikidata fits with a) Wikibase federation b) Linked Data c) Schema.org d) Semantic MediaWiki etc., e.g. plans to bring in very bulky areas where core Wikidata could get overwhelmed by size of some datasets.
- Faster queries, control over timeouts
- School friendly subsets, eg: drop porn star info
- Censorship anti-usecases - out of scope here but should be noted.
- See also Q2 "earth is flat" and dealing with disagreement, and ability to filter on provenance and statement ranks.
- Database scaling and cost - running whole thing is too expensive. Takes days to index. Costs ~750+ USD / month to run in Cloud even with no traffic.
- Make room to grow and to overlay related data eg crawled schema, or non-cc0 open data, or data whose custodians prefer to curate elsewhere (possibly in wikibase but also LOD, Semantic MediaWiki, R2RML etc.).
- Use of inference for expert finding, we might know someone has expertise in some species and find them by inferences/paths
- Smaller cleaner datasets: it becomes easier or more reliable to run inference strategies over the data - eg materialising triples on demand
- Banking metaphor: as with interest on a bank account, if you put substantial quantities of data in you will want to be able to withdraw it too, to benefit from improvements, fixes, additions etc. Get back domain-specific data (e.g. protein info, or museum artifacts) "with interest" [drawn from the public exchange].
- Reproducible publication of smaller datasets with papers: clean addressing + versioning of a data snapshot
- schema transformation for federated queries - using wikidata as a tool to get a feel for the data. federated queries can be unpredictable, so first explore that way and then pull from wikidata and federated sources into a common store optimized for the query.
- Data availability on Cloud platforms - regular auto-generated subsets as triples and workflow, for shared per-platform data availability, including both per-db indices and interesting subsets & transformations. See also HDT.
- Making (parts of) Wikidata available in more application settings e.g. on offline PCs, or phones, or Raspberry Pi for makers and schools, … See also "Wikipedia in a Box" initiatives (kiwix.org).
- Creating a dataset for a specific research setting (e.g., shared task). For example a Machine Learning task, e.g. artworks.
Methods
[edit]How to actually do this?
- Subsetting by starting from nothing and choosing what we want to add (by topic, by entity ID, by property, by query, by shex shape, by project, or by kind of data e.g. trim descriptions, provenance, labels, …).
- Subsetting by starting from everything and choosing what to remove (as above).
- Subsetting by time-frozen snapshot (combined with any of the above) e.g. for scientific replicability, publishing scholarly articles with relevant data attachments.
- Subsets of subsets e.g. adding (more wikidata) or removing. You might take a big cities subset and put some cities in.
- If we do make and publish subsets of Wikidata, how to document what the subset was (void/dcat/schema.org etc.). If it was made by software, vs query, vs hand crafting, … how to record that alongside the actual data? See also Google Dataset Search docs.
- Andra: "I subset by triples with provenance"
- Stats page https://www.wikidata.org/wiki/Wikidata:Statistics/Wikipedia#Type_of_content
- Per triple analysis ignores the larger byte size of description texts fields. Are there other "bulky" text-valued properties? Size on disk may be hidden
- Vague Q from Dan: Can we associate metadata on the property pages, e.g. P1234 being topically tagged, so we could mechanize output? A subset-category could list the properties it cares about (alongside basics like types and labels)
- Intertwingularity problem, https://en.wikipedia.org/wiki/Intertwingularity … domain-based subsetting hard, maybe pragmatic approach based on throwing out obviously irrelevant materials is easier?
- Completeness: how can we tell the difference between knowing that Wikidata has entities for 100% of all known countries (and this is a small useful amount of info) versus having a similar number of entities that are only e.g. 5% of the potential total set of appropriate entities. E.g. does Wikidata also have most interesting chemicals? proteins? taxonomic names? Airline connections? Food cultivars?
Granularity issues? Are people doing this by type, by triple, by label, by source, by sitelink to wiki, … any other strategies?
Drop descriptions, drop Lang labels for very lite representation? Not very nice for I18N but perhaps on very limited devices e.g. raspberry pi
Examples
[edit]- 1000 Disease Ontology terms and their Wikidata mappings to 17 mostly Indian languages and English (Q96051481)
- Take the bits you want approach:
- G2G uses SPARQL expressions (plus Cypher to map to Property Graphs) (see also http://g2g.fun/)
- Example of Wikidata2bioschemas property graph from biohackathon: https://github.com/elixir-europe/BioHackathon-projects-2019/blob/master/projects/28/src/g2g/wikidata_disease_bioschema.g2g (sparql + cypher patterns)
- G2G uses SPARQL expressions (plus Cypher to map to Property Graphs) (see also http://g2g.fun/)
- Tool from Andra that creates a copy of a Wikidata item
https://colab.research.google.com/drive/1DDws0pGPmP6eKwuDeSCHTCf6PYYMw9OL … can also take only items where statements have references.
- Issues: how to deal with boundaries, eg. Edinburgh having type City, should we replicate the type City, … Other cities? …
Looking at shape expressions for this. ...others?
What has been done around log analysis of the query logs that were shared, e.g. correlation analysis of types? Can we make a type-to-type affinity graph from query log analysis
Anyone doing things with HDT triples?
Topical Use cases
[edit]List specific topics, fields, types, properties etc here, including links to existing work:
- Gazetteer
- Citation analysis, …
- Use case: dB of expertise in university
- Historical epistolary analysis - people, Viaf IDs, etc
- Move Wikicite stuff into its own KG? Take out the bibliographic data, to give it room to grow. (compare SemScholar + scite.ai data on citation type + affect)
- Currently there is natural concern amongst wikicite enthusiasts not to move it outside b/c many of the benefits of wikidata require you to be in core wikidata, and not in the wider community. Is a mirror/sync suitable?
- https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2020/01#Complete_list_of_PMIDs
- AI applications using Large Knowledge Graphs for training - selection of training datasets
Tools and Data
[edit]Collect links to projects and tools that subset, or actual subsets, also statistics and data questions:
- See e.g. https://www.wikidata.org/wiki/Wikidata:Statistics/Wikipedia#Type_of_content for analysis of what is in wikidata.
- Top 100 properties: https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/Top100
- Q: How much of Wikidata’s full dataset is provenance, source, annotations, truth-levels etc? [formulate a query]
- Tools. ShEx can be used to define data models which could help define subsets. There are also some tools that can be used to extract ShEx from some existing data...which could help to create subsets. See, for example, ShExer.
- https://github.com/bennofs/wdumper
- simple-wikidata-db - "a set of scripts to download the Wikidata dump, sort it into staging files, and query the data in these staged files in a distributed manner."
- related: the Linked Data Fragment Service
- Help:Dataset sizing
- Knowledge Grapher
Participants
[edit]- -- Andrawaag (talk) 09:29, 4 June 2020 (UTC)
- --Daniel Mietchen (talk) 09:34, 4 June 2020 (UTC)
- -- John Samuel (talk) 10:00, 4 June 2020 (UTC)
- -- Jose Emilio Labra Gayo (talk) 08:56, 6 June 2020 (UTC)
- -- Alejgh (talk) 10:11, 8 June 2020 (UTC)
- -- Will (Wiki Ed) (talk) 21:01, 16 June 2020 (UTC)
- -- SCIdude (talk) 06:15, 26 June 2020 (UTC)
- -- Sj (talk)
- -- Fuzheado (talk) 17:57, 13 December 2020 (UTC)
- --Hogü-456 (talk) 21:37, 16 March 2022 (UTC)
- --WolfgangFahl (talk) 07:37, 19 September 2023 (UTC)