User:John Cummings/Archive/RFC

From Wikidata
Jump to navigation Jump to search

Note: please discuss on the Discussion page


Background information

What is a data schema?[edit]

A data schema is a set of rules that define the structure of the data stored in a database. In the context of Wikidata editing, data schemas can provide a standardised structure for data on a subject area. For example, all the items on museums would use the same structure to describe basic facts about them like location, collection type, date opened etc.

There are many existing schemas available online which apply to different kinds of data about the world, e.g. schema.org.

Wikidata Schemas have recently been released on Wikidata. This provides the technical means for recording the schemas for different subject areas, but requires advanced technical knowledge to create and currently have limited documentation. They also do not include any kind of process for discussion or consensus for creating schemas.

Problem statement[edit]

There is currently no place for the community to propose, or "agree" on the correct model for specific types of items (e.g. a train, plant, human author, human astronaut etc).

Now that Wikidata Schemas have been implemented, we have a place that can represent the model at a technical level (e.g. EntitySchema:E10 for a human). But without a community driven discussion and creation process the models will not be used as the "consensus".

This inevitably leads to all of the issues associated with data inconsistency, which Wikidata editors and third party re-users need to find a way to work around. A few of the main issues are listed below:

Difficult to query[edit]

Inconsistent data modelling means that it's much harder to find, add and query. E.g the location for museums is modelled in at least three ways on Wikidata:

  1. Using the statement 'coordinate location'
  2. As a qualifier for in the statement headquarters
  3. As 'coordinate location' for the building the museum is housed in, with another item for the legal entity

These were the ones found, but there may be others. This variation in the way different items records the same data means it is very difficult to work with the data, to trust that a query is returning all the information Wikidata has on that subject and to reuse data from Wikidata on other Wikimedia projects.

Repeated mistakes[edit]

There is also no way to collate subject specific knowledge on how data of a kind should be modelled to reflect the data accurately and so editors can't see how to correctly model it. E.g Many World Heritage sites include multiple buildings, some include 100s of listed buildings. Different editors have repeatedly added 'Heritage status = World Heritage site' to all the buildings inside a World Heritage site, which is incorrect. This leads queries on World Heritage sites being completely wrong, showing 100s of World Heritage sites in a single city.

Impedes third party re-use[edit]

There is a general lack of confidence that data will remain intact after a donation from third party, or when simply reusing the data in another application. As there is no location to put a stamp of approval on a particular way to model something it's much more difficult to have faith that it will not change (or that you even know where to look if you find it is changing).

Benefits of having community agreed data schemas on Wikidata[edit]

Some of the major benefits of a more unified approach with community agreed schemas are:

Data quality[edit]

  • Increases data quality and data completeness by allowing people to find and use the most appropriate schema for the subject.
  • Provides data that could be used to improve the ‘property suggester’ tool (statements which are part of the schema could be used as the highest priority suggestions).

Usability[edit]

  • Make data easier to find and use, including being able to have simpler and more consistent queries.
  • Could be used by query tools like ProWD to understand data completeness.
  • Easier to build third party applications using Wikidata, as expected models would be easy to find

Community growth and health[edit]

  • Help people learn Wikidata more easily by having clear instructions to follow.
  • Reduce arguments and increase community health by having clear rules on data structure that everyone can follow.
  • An item creation wizard could be created based on the schema for that topic.

Existing work on data schemas on Wikidata[edit]

  • Wikidata schemas allow you to define models for any desired type/grouping of items, e.g. E10 is the schema for a human.
    • Any community agreed schema should ideally be converted into ShEx code as soon as possible.
    • They are written using the ShEx ("Shape Expressions") language. Shape Expressions are extremely flexible, allowing you to define very precise (or very broad!) conditions that the data should meet, as well as complex conditional requirements (e.g. a human item may or may not have a 'date of death', but if it does have one, then it must have a 'date of birth' statement as well).
    • Because of the complexity of learning the ShEx language general non-technical Wikidata editors will not be able to write or edit Wikidata Schemas.
    • Once created, they can be used to do automated checking and reporting on Wikidata items. This could give you a report on what needs to be fixed, or just a completeness score for example.
  • Cradle allows people to create schemas, however they are created by one user and the schema is recorded on a separate database and not part of Wikidata. These schemas could be used as starting points for community discussions to agrees schemas.
  • Model items and showcase items both present best practice for describing a subject but do not explicitly describe what to include and are not include a defined schema e.g Douglas Adams is a model item for a person but few of the statements for Douglas Adams are applicable to people in different professions.




Data schemas on Wikidata

A central place to discuss and create data schemas collaboratively[edit]

A central discussion area, similar to Wikidata:Property proposal but for proposing and collaborating to develop Wikidata Schemas. 'Wikidata:Schema Requests' could use FormWizard and Visual Editor to lower the barriers to participation. This central proposal area should allow subject experts to develop schemas without requiring a deep level of technical knowledge. The schemas, once agreed, would then be recorded in Shape Expressions by people who have the technical knowledge. After a new Wikidata Schema has been created, the "proposal discussion" would be linked on its talk page (or transcluded into the page). All future discussion about the model will continue on the talk page, with the original proposal being archived. Note: There are many ways in which the community could present the model that they have all agreed to. For example, we could use Wiki tables, or templates to show lists of expected Wikidata statements, or simply a list of bullet points initially. If the proposal here is agreed, there can be plenty of subsequent discussion about the best way to communicate the human generated plan.


Recording agreed data schemas[edit]

Models decided by the community would be recorded as new Wikidata Schemas (e.g. E10 for human) by editors who know how to write Shape Expressions.

The schema would then be linked to the corresponding Wikidata item for that class by a statement on the Wikidata item (e.g.human (Q5)--> Wikidata Schema -->E10). So all items withinstance of (P31)human (Q5) should comply with this schema.

Note: The required property has already been proposed, but is on hold waiting for this Phabricator ticket, which will allow Wikidata Schemas to be used in statements

We ultimately need some kind of link between individual instances and the Wikidata Schema that should be used to describe them. However, this would have to be in the form of User Interface enhancements to implement in the future. For example, when someone is editing Douglas Adams (Q42), it would make perfect sense to prompt the user that they should be using the schema for a "human", or have a button to check what's missing/complete. This is just an example solution, but it's included here to emphasise the end goal of Wikidata Schemas informing editors about community agreed models.

Ways of finding and exploring Wikidata data schemas[edit]

The following are some suggested methods to encourage greater use and discoverability of Wikidata Schemas:

  • A statement for 'Wikidata schema' added to the Wikidata item for the subject that links to the Shape Expression (e.g. human (Q5) links to and from E10).
  • An equivalent of Wikidata:List of properties for schemas to be able to explore and search for schemas.
  • Some way of exploring how items in the same class relate to each other e.g the schema for humans and the schema for writers.
  • Show how far up the class tree the schema exists e.g the schema for a person would not include information about the taxonomic tree.
  • Some way to show the schema for the item class on each item page in that class, e.g a template on the talk page which was added by a bot to show the schemas for that class eg author. A longer term version of this could be a 'completeness' scale for the item based on its schemas, and even a schema tab on each item.
  • Linking from Shape Expressions to Wikiprojects and vice versa to better involve relevant community groups in schema creation.
  • Item creation Wizards? e.g. Use the schema for an Author to create a Cradle style form for making new Authors.


Note: There are already many discussions underway for exciting new ways to view and interact with ShEx Wikidata Shemas in a visual way, but these are all really a subject for future discussion. The important point to realise is that Schema pages will likely be human friendly places to visit once the initial run of experimentation and tool development runs its course.

Outstanding questions[edit]

  • Where exactly should the proposal area be? Maybe at Wikidata:Property proposal?
  • What format should the community discussed, non technical schemas be presented in when ready for a Shape Expression? Could we just use existing templates (e.g. Statement+), or Tables in Visual Editor?
  • How could you define the most wanted or most valued statements in a schema? Maybe just 3 groups, "must have", "should have", "would be nice"?
  • How to capture agreement on granularity of items? E.g a museum as one item or as two items, the building and the legal entity
  • Where do we draw the line from one model to the next? ShEx allows you to extend any model to include expectations about the items that are linked (e.g. museum should have a location, but that location should also have coordinates). It would seem logical that we do not extend beyond the model's boundaries. Other connected item should have it's own model to follow.