Wikidata:WikiCite/Roadmap

From Wikidata
Jump to navigation Jump to search
WikiCite wordmark.svg

The future of bibliographic data in Wikidata: 4 possible federation scenarios[edit]

Participants in the WikiCite roadmap meetup at Wikimania '18.

This document summarizes a discussion that took place during a meetup hosted at Wikimania 2018 in Cape Town to discuss possible directions for the WikiCite project, in particular the risks and benefits of 4 possible scenarios. (see raw discussion notes from the meetup).

How to contribute

These notes try and capture a variety of perspectives and concerns expressed by participants in the meetup. Feel free to add new risks/benefits, but please discuss on the talk page before removing content or making substantial changes. You can also chime in on the talk page with your preferred scenario (while understanding that this is a set of informal proposals and not a binding RfC or formal decision process).

Background[edit]

WikiCite was originally envisioned as an initiative to build a knowledge base of citable sources and their bibliographic metadata to support the sourcing and fact-checking needs of Wikimedia contributors, by representing these citations as structured data entities and moving away from unresolvable (or difficult to resolve) strings and templates. However, the scope of the project has evolved and expanded significantly over time and it can now be defined in at least three distinct ways:

A database of all sources cited in Wikimedia projects[edit]

Significant progress has been made towards the goal of representing sources as structured data, through the creation of Wikidata items for works cited across Wikimedia projects, the design of a range of bibliographic data models in Wikidata, as well as the creation of bots and tools (1 - 2) to support these workflows. Major gaps still exist to date — we have yet to ingest, at scale, sources other than academic/scientific papers (such as books, newspaper articles, web pages, TV/ radio, etc. and their authors) — but the project is making progress towards this goal.

The notability of these items, supporting their inclusion in Wikidata, has been long established by community consensus.

A database for curated bibliographic corpora[edit]

Several specialized open bibliographic corpora have been created in Wikidata to support a variety of needs that go beyond the representation of sources cited in Wikimedia projects:

The notability of each of the above use cases hasn’t been formally established, but it arguably falls under the 3rd of Wikidata’s notability criteria (“fulfilling some structural need, for example: [an item] is needed to make statements made in other items more useful.”)  

The “bibliographic commons”[edit]

A part of the WikiCite community has been looking at the project as the future home for a comprehensive open body of structured data on every possible source, not limited to what is currently cited in Wikimedia projects, or to specialized corpora and use cases. A bibliographic commons, collaboratively created—that individual contributors, libraries, GLAM institutions, and metadata organizations can contribute to and rely upon, and can act as a hub connecting a variety of bibliographic knowledge bases—has been often referred to as WikiCite’s “moonshot”.

This knowledge base would compete with existing proprietary or semi-proprietary databases, by providing an open layer of bibliographic structured data, reusable by anyone and free from copyright restrictions. However, the notability of items to be created to meet this ambitious goal in current Wikimedia projects is far from granted: to date this goal hasn’t been discussed or explicitly backed by community consensus. The technical, programmatic, and financial requirements to achieve this goal are also substantial and well beyond the scope of the current WikiCite initiative.  

Growing pains[edit]

The growth of WikiCite beyond its original scope has been causing a number of scaling issues, both technical and social, that need to be addressed.

  • While some bibliographic corpora in Wikidata are complete and usable as of today, a “complete” and usable citation graph would require ingesting over a billion data points (as a lower bound) or up to several billion data points. Taking into account that a typical publication cites some dozens of references, such a complete corpus would represent 2 to 10 times the current size of Wikidata.
  • The prevalence of bibliographic content in Wikidata (nearly 40% of its items as of August 2018) makes it hard to sell its value proposition as a domain-general knowledge base. For example, searching Wikidata for any given keyword is much more likely to return bibliographic items than other types of entities users might be interested in: given the way in which search and other Wikidata user interfaces currently function, it is impossible or extremely hard to separate bibliographic content or to filter results by domain. Outreach efforts, as a result of this, may suffer from an inflation of contents of a specific type at the detriment of others.
  • The rapid ingestion of content is taking a toll on the querying infrastructure, causing frequent timeouts.

Four scenarios for the future of WikiCite[edit]

To identify strengths and weaknesses of various solutions to the above scalability problems, the group discussed 4 possible scenarios, covering a broad range of options from a centralized solution — continuing with the status quo — to a fully decentralized view, leveraging Wikibase’s federation capabilities. Our goal is to flesh out risks and benefits of each of these scenarios, to inform technical and policy decisions on the future of the project and its scope for existing and future stakeholders.

While the scenarios are presented independently, it’s worth noting that they could also co-exist (for example, there could be a separate namespace for bibliographic data and also federation with different datasets held outside of Wikimedia projects); or one could transition into another (for example, the current centralized status quo becoming unsustainable due to amount of citation data leading to a separate namespace).

1. Centralized[edit]

All bibliographic data stays in Wikidata – i.e. the status quo.

2. Namespace[edit]

All bibliographic data stays in Wikidata, but in a dedicated namespace with a dedicated data model (similar to lexemes).

3. Sister site[edit]

Bibliographic data (other than what’s considered notable in Wikidata) moves to an official Wikimedia sister project, powered by Wikibase, and tightly integrated with other Wikimedia projects.

4. Federated[edit]

Bibliographic data is handled and curated through a federation of specialized wikibases, with no central hub or data store.

Risks and benefits compared[edit]

1. Centralized 2. Namespace 3. Sister site 4. Federated
Description All bibliographic data stays in Wikidata All bibliographic data stays in Wikidata, but in a dedicated namespace with a dedicated data model (similar to lexemes) Bibliographic data (other than what’s considered notable in Wikidata) moves to an official Wikimedia sister project, powered by Wikibase, and tightly integrated with other Wikimedia projects. Bibliographic data is handled and curated through a federation of specialized wikibases, with no central hub or data store.
Community implications Benefits
  • No community fragmentation
  • Bibliographic items have the same status as other items and enjoy the benefits of the existing population of curators in Wikidata
  • This option directly supports the significant referencing needs of the Wikidata community by facilitating the reuse, discoverability and curation of source metadata, without introducing an additional learning curve.

Risks

  • Imbalance of the knowledge base
  • Friction between Wikidata and other large Wikimedia projects could be further exacerbated.
  • Tensions around WD scope could be exacerbated: endless, unresolved debates around inclusion criteria for items/properties/corpora that are more valuable outside a Wikipedia context than inside.

Benefits

  • Bibliographic items stay in Wikidata but in a segregated space, where different curation workflows can be experimented with
  • Potential opportunity to absorb more of the detractors from sisters sites into the new Namespace as norms developed in that namespace can change and evolve independently.
  • Easier to convince Wikimedia communities to use “citation data only” rather than “all of Wikidata” -- could be a foot in the door to demonstrate the value of Wikidata to other users within the Wikimedia ecosystem.

Risks

  • Establishing a new space requires community labour
  • Continued friction with sister sites
  • Sets a precedent for creating multiple Content Namespaces within a single Wikimedia project
  • Reduces discoverability, perceived importance of bibliographic content (anything outside Main is a second-class citizen).

Benefits

  • Opportunity for experimentation with curatorial workflows
  • Potential for stronger integration with sister sites
  • Easier to develop a push-pull strategy with external platforms for bibliographic data, that is more automated.

Risks

  • Community fragmentation is a severe risk: the project may drain the Wikidata community without reaching critical mass to become sustainable.
  • Conflict/territoriality issues during migration of existing bibliographic content to new project.
  • Increased tension between WD core volunteer community and WMF/WMDE/WikiCite if community feels organizational stakeholders are doing an end-run around them.

Benefits

  • Individual communities and institutions are able to focus on curating the types of bibliographic contents they care about, with the granularity and data modeling approach of their choice
  • Reduce instances of Wikimedia duplicating others’ efforts
  • Community of participant organizations doesn’t have to make the entire “cognitive leap” of releasing their data to community curation

Risks

  • There is no centralized curator community, this scenario may replicate the status quo of siloed knowledge bases and communities.
  • More difficult for Wikimedia community volunteers to participate in curation.
Technical implications

Benefits

  • Existing tools and workflows can be reused as is for bibliographic items

Risks

  • Potential issues resulting from the huge scale of data to be ingested
  • Queries are already taxed by having so many citation items in the database

Benefits

  • Bibliographic items can be configured to have a dedicated data model
  • The bibliographic corpus is more clearly delineated and can be more easily plugged into tools that expect bibliographic contents.
  • Wikipedia and Wikimedia projects have a dedicated space to map references to. Integration with editing interfaces in Wikimedia projects should be cheaper than in other scenarios
  • Could embrace a new data model that embraces the version to work with bibliographic complexities (i.e. FRBR and other edition issues) that are not handled well right now in the core data model (similar to Lexemes)
  • Opportunity to experiment with search and discovery workflows designed specifically for bibliographic data
  • Could be a way to mitigate scalability issues related to querying, search, servers, etc.

Risks

  • Existing tools and workflows designed with Wikidata can be adapted for bibliographic items, but at some cost
  • Search on Wikidata doesn’t support searching properties and lexeme namespace very well yet—making the bibliographic data less accessible in the human-interface.

Benefits

  • Wikipedia and Wikimedia projects have a dedicated space to map references to. Integration with editing interfaces in Wikimedia projects should be cheaper than in other scenarios
  • Dedicated data model
  • Opportunity to develop tool integration specific to bibliography
  • Forking a new sister project, relieves short term technical scalability burden on Wikidata proper.
  • Opportunity to experiment with search and discovery workflows designed specifically for bibliographic data

Risks

  • Significant operational burden for Wikimedia’s technical teams (operations, services, performance, security etc) to spin up and maintain a full-fledged sister project.
  • Development of a solid federation infrastructure to handle all of the needs in terms of concept use from Wikidata and Lexemes, could create technical delays/failures.

Benefits

  • Potential to offload technical/infrastructure support to partners

Risks

  • License incompatibility across multiple knowledge bases may hinder the reusability of this data
  • Discoverability in a fully federated scenario is more challenging
  • Tighter integration of federated contents in Wikimedia projects is not possible with the current infrastructure
  • Federation models and technical infrastructure are not clear in the Wikidata development roadmap
  • Wikibase is not easy to install and setup yet, and separate Wikibases can’t yet feed back into Wikidata
Governance implications

Benefits

  • Project already exists so limited additional coordination is required relative to establishing a new project

Risks

  • Potential mismatch between wikicite and broader goals of Wikidata project
  • Tension between oversight of bibliographic data models versus other models
Benefits
  • Policies could be treated differently than the core namespace, therefore bibliographic data could have standards that are different from the framework used for Wikidata concepts
  • Property/statement creation is integrated into the data model of the existing community leading to ontological consistency

Risks

  • Not enough people engage in the new namespace to resolve disputes or identify problems
  • May still be accountable to WDs overall policies (e.g. inclusion criteria), even if those policies are not well-suited to bibliographic data, since decisions will be made by the same community.

Benefits

  • Could lead to more active community development from a concerted team of community development folks.
  • Could lead to development of policies and norms more suitable to bibliographic data, including better integration with both sister sites and external partners
  • Could make targeted outreach to new external partners easier, because scope of project and scope of WikiCite initiative are congruent.

Risks

  • Spinning up a new sister project is a huge endeavor that requires coordination between many organizations and stakeholders, as well as dedicated technical and programmatic leadership
  • Importing property and data model creation from Wikidata risks not building a community of believers into the core governance of the project.
  • Ongoing conflict around resolving edge cases in scope (e.g. content that could reasonably be hosted on either a sister site or Wikidata).

Benefits

  • Authority control can still stay within different institutions, without the perception that they would lose control on their data and data modeling choices to “arbitrary” community decisions.

Risks

  • Lack of central control and curation
  • Professional communities have a way of making these kinds of efforts incredibly bureaucratic
  • A potential existential risk for WikiCite initiative (if construed as a centralizing effort).
  • More difficult to promote and maintain free culture values in content inclusion criteria and contribution/curation practices
Data quality implications

Benefits

  • Take advantage of existing workflows for ensuring data quality

Risks

  • Overwhelm data curation processes with abundance of citations

Benefits

  • Gains the kinds of tooling and other monitoring environment already built around Wikidata proper.
  • Gets greater integration with Wikidata concepts (esp. Topics and authors) and lexeme concepts -- without having to deal with the challenges of multi-content type federation

Risks

  • Filtering strategies for recent changes start excluding Bibliographic data, and you need to develop a patrolling community for that area

Benefits

  • Dedicated community of patrollers/reviewers with specific interest/expertise

Risks

  • Higher likelihood of semantic drift than a namespace or Centralized, but less risk than federation because the WikiCite community defines the standard for this data.

Benefits

  • Take advantage of existing institutional expertise

Risks

  • Potential fragmentation and/or drift of data models, reducing long-term usability
Other

Benefits

  • Retain full brand identity with Wikidata

Risks

Benefits

  • Consistent branding that facilitates outreach to contributors and stakeholders

Risks

  • Confuse outsiders who expect all of Wikidata to be a single ‘thing’ but find it has multiple sections/spaces
  • The concept of namespaces may be confusing to new contributors, data consumers

Benefits

  • Can focus branding to facilitates outreach to contributors and stakeholders
  • Ability to set own fair-use rules, giving the ability to include texts of abstracts, COI statements, and other fair-use metadata

Risks

  • Duplication. What would stay in Wikidata?
  • Not being branded Wikidata may lead to confusion and need to reestablish its identity in the linked data space.
Benefits

Risks

  • Fragmented branding hinders outreach to contributors and stakeholders
  • Lack of integration with other Wikimedia projects
  • Duplication of efforts