Wikidata:WikiCite/Roadmap
- This page describes the state of the WikiCite roadmap as of 2018. Since then Wikidata:WikiCite has been merged into Wikidata:WikiProject Source MetaData. For the latest updates about WikiCite, see m:WikiCite on Meta.
- For a parallel proposal from 2020 to solve the issue of redundant references across the Projects, see Shared Citations. For an update from 2023, see here.
The future of bibliographic data in Wikidata: 4 possible federation scenarios
[edit]This document summarizes a discussion that took place during a meetup hosted at Wikimania 2018 in Cape Town to discuss possible directions for the WikiCite project, in particular the risks and benefits of 4 possible scenarios. (see raw discussion notes from the meetup).
- How to contribute
These notes try and capture a variety of perspectives and concerns expressed by participants in the meetup. Feel free to add new risks/benefits, but please discuss on the talk page before removing content or making substantial changes. You can also chime in on the talk page with your preferred scenario (while understanding that this is a set of informal proposals and not a binding RfC or formal decision process).
- Update
- Some efforts have been made in 2019 to set up a separate wikibase for WikiCite experiments; details on the talk page.
Background
[edit]WikiCite was originally envisioned as an initiative to build a knowledge base of citable sources and their bibliographic metadata to support the sourcing and fact-checking needs of Wikimedia contributors, by representing these citations as structured data entities and moving away from unresolvable (or difficult to resolve) strings and templates. However, the scope of the project has evolved and expanded significantly over time and it can now be defined in at least three distinct ways:
A database of all sources cited in Wikimedia projects
[edit]Significant progress has been made towards the goal of representing sources as structured data, through the creation of Wikidata items for works cited across Wikimedia projects, the design of a range of bibliographic data models in Wikidata, as well as the creation of bots and tools (1 - 2) to support these workflows. Major gaps still exist to date — we have yet to ingest, at scale, sources other than academic/scientific papers (such as books, newspaper articles, web pages, TV/ radio, etc. and their authors) — but the project is making progress towards this goal.
The notability of these items, supporting their inclusion in Wikidata, has been long established by community consensus.
A database for curated bibliographic corpora
[edit]Several specialized open bibliographic corpora have been created in Wikidata to support a variety of needs that go beyond the representation of sources cited in Wikimedia projects:
- The GeneWiki, Wikidata:WikiFactMine and ScienceSource projects use Wikidata as a bibliographic store to represent the provenance of facts extracted from the scholarly literature for their respective domains.
- The Zika Corpus was created to build a comprehensive, annotated knowledge base in Wikidata describing the state of scholarship on the Zika virus.
- Inventaire.io is creating a free-licensed corpus of CC0-licensed bibliographic metadata on books, on an on-demand basis, using Wikidata as a data store.
- A dataset of scholarly citations (using Wikidata’s cites property) has been ingested on an ongoing basis, creating a sparse but growing citation graph of 70-million edges.
The notability of each of the above use cases hasn’t been formally established, but it arguably falls under the 3rd of Wikidata’s notability criteria (“fulfilling some structural need, for example: [an item] is needed to make statements made in other items more useful.”)
The “bibliographic commons”
[edit]A part of the WikiCite community has been looking at the project as the future home for a comprehensive open body of structured data on every possible source, not limited to what is currently cited in Wikimedia projects, or to specialized corpora and use cases. A bibliographic commons, collaboratively created—that individual contributors, libraries, GLAM institutions, and metadata organizations can contribute to and rely upon, and can act as a hub connecting a variety of bibliographic knowledge bases—has been often referred to as WikiCite’s “moonshot”.
This knowledge base would compete with existing proprietary or semi-proprietary databases, by providing an open layer of bibliographic structured data, reusable by anyone and free from copyright restrictions. However, the notability of items to be created to meet this ambitious goal in current Wikimedia projects is far from granted: to date this goal hasn’t been discussed or explicitly backed by community consensus. The technical, programmatic, and financial requirements to achieve this goal are also substantial and well beyond the scope of the current WikiCite initiative.
Growing pains
[edit]The growth of WikiCite beyond its original scope has been causing a number of scaling issues, both technical and social, that need to be addressed.
- While some bibliographic corpora in Wikidata are complete and usable as of today, a “complete” and usable citation graph would require ingesting over a billion data points (as a lower bound) or up to several billion data points. Taking into account that a typical publication cites some dozens of references, such a complete corpus would represent 2 to 10 times the current size of Wikidata.
- The prevalence of bibliographic content in Wikidata (nearly 40% of its items as of August 2018) makes it hard to sell its value proposition as a domain-general knowledge base. For example, searching Wikidata for any given keyword is much more likely to return bibliographic items than other types of entities users might be interested in: given the way in which search and other Wikidata user interfaces currently function, it is impossible or extremely hard to separate bibliographic content or to filter results by domain. Outreach efforts, as a result of this, may suffer from an inflation of contents of a specific type at the detriment of others.
- The rapid ingestion of content is taking a toll on the querying infrastructure, causing frequent timeouts.
Four scenarios for the future of WikiCite
[edit]To identify strengths and weaknesses of various solutions to the above scalability problems, the group discussed 4 possible scenarios, covering a broad range of options from a centralized solution — continuing with the status quo — to a fully decentralized view, leveraging Wikibase’s federation capabilities. Our goal is to flesh out risks and benefits of each of these scenarios, to inform technical and policy decisions on the future of the project and its scope for existing and future stakeholders.
While the scenarios are presented independently, it’s worth noting that they could also co-exist (for example, there could be a separate namespace for bibliographic data and also federation with different datasets held outside of Wikimedia projects); or one could transition into another (for example, the current centralized status quo becoming unsustainable due to amount of citation data leading to a separate namespace).
1. Centralized
[edit]All bibliographic data stays in Wikidata – i.e. the status quo.
2. Namespace
[edit]All bibliographic data stays in Wikidata, but in a dedicated namespace with a dedicated data model (similar to lexemes).
3. Sister site
[edit]Bibliographic data (other than what’s considered notable in Wikidata) moves to an official Wikimedia sister project, powered by Wikibase, and tightly integrated with other Wikimedia projects.
4. Federated
[edit]Bibliographic data is handled and curated through a federation of specialized wikibases, with no central hub or data store.
Risks and benefits compared
[edit]1. Centralized | 2. Namespace | 3. Sister site | 4. Federated | |
---|---|---|---|---|
Description | All bibliographic data stays in Wikidata | All bibliographic data stays in Wikidata, but in a dedicated namespace with a dedicated data model (similar to lexemes) | Bibliographic data (other than what’s considered notable in Wikidata) moves to an official Wikimedia sister project, powered by Wikibase, and tightly integrated with other Wikimedia projects. | Bibliographic data is handled and curated through a federation of specialized wikibases, with no central hub or data store. |
Community implications | Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Technical implications |
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Governance implications |
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Data quality implications |
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|
Other |
Benefits
Risks |
Benefits
Risks
|
Benefits
Risks
|
Benefits
Risks
|