User:ChristianKl/Draft:New Ranks

From Wikidata
Jump to navigation Jump to search

In addition to the three ranks we have currently Preferred, Normal and Deprecated I believe we need two additional new ones. In this document I want to make the case for a rank of Uncertain and False.

Status quo of ranks and conflicting statements[edit]

In many cases where a data-user such as a Wikipedia infobox asks for a single value of a statement, Wikidata gives it a so called truthy value as a response. A Wikipedia infobox usually wants to show the user a single value for it's population.

Normal and Preferred[edit]

Normal: Berlin (Q64) population (P1082) 3,574,830 / point in time (P585) 31 December 2016

Preferred: Berlin (Q64) population (P1082) 3,613,495 / point in time (P585) 31 December 2017

This will return the truthy value Berlin (Q64) population (P1082) 3,613,495. That's a value that a infobox that wants to display a single value it can now display 3,613,495. The concept of truthy values allows Wikidata to store a complex reality of multiple values and allows us to still provide data users who want a simple reality to access a single value.

Given that a person can only be born once, you would expect that people only have one value for date of birth (P569). The case of Rosa Luxemburg (Q7231) is one where we want to store more then one value for date of birth (P569). While Rosa Luxembourg celebrated his birthday on the 5 March according to her birth certificate she was born on 25 December 1870. In Wikipedia we list:

Preferred: Rosa Luxemburg (Q7231) date of birth (P569) 5 March 1871

Normal: Rosa Luxemburg (Q7231) date of birth (P569) 25 December 1870

We decide to store both values on Wikidata and when someone asks for a truthy value we return 5 March 1871. Ranks allow us to both store the messy reality and give simple answers. Ranks are awesome. There are cases where there are two statements with rank normal Wikidata return the first. Given that we would prefer that our users decide which value should be returned we have single best value constraint (Q52060874) to indicated that there is a problem and a user should move one of the statements to the preferred rank.

Deprecated[edit]

There are also cases where we want to store information that doesn't get returned when a user asks for a truthy value. If there are two VIAF numbers that point to the same person, VIAF merges the two entries. VIAF then deprecates a value. In that case we list it with reason for deprecation (P2241) deprecated identifier value (Q67125514). We also have reason for deprecation (P2241) applies to other person (Q35773207). Those two are ontologically very different. If a person asks Wikidata for the person who's know under the VIAF ID XYZ, Wikidata should return an item where the identifier is tagged with deprecated identifier value (Q67125514) but not one where it's tagged with applies to other person (Q35773207). Having to look at the qualifiers here is inconvenient for data reusers and I would expect that in many cases looking at those qualifiers will be forgotten. There are also deprecation values like unconfirmed hypothesis (Q67203058) that are again ontologically a different class of statement. In those cases we don't believe anything is wrong, but we just don't know whether it's right.

statement disputed by (P1310)[edit]

Wikidata tries to be able to allow for diverse perspectives and store multiple conflicting perspectives at the same time. statement disputed by (P1310) is one tool to store conflicting ideas about truth. The only example currently used for the property is:

Waltz No. 17 in E-flat major, Op. posth. (Q16747520) composer (P86) Frédéric Chopin (Q1268) / statement disputed by (P1310) Chomiński catalog (Q16749680) with the reference stated in (P248) Kobylańska Katalog (Q16747642).

This statement is unclear. A data user doesn't know whether or not the reference supports that the claim is true or that the claim is false. This is especially true in cases where the target value of statement disputed by (P1310) isn't a information source itself but a person or an organization.

New Ranks[edit]

Uncertain[edit]

Uncertain is a new rank that isn't truthy. Low quality data that's tagged as uncertain provides no problem for any data reuser that wants data quality as it's easy to ignore. It should replace usages of deprecated with unconfirmed hypothesis (Q67203058). I see three main advantages of this new rank:

1) When we currently have automatic creation of statements we often deal with data that isn't perfect and where we want human supervision before we declare statements to be true. The Primary Sources tool used to be a way to host those claims having those claims instead with the tag uncertain would be more integrated in Wikidata. It would allow users who do like to interact with the data to do so in whatever way they consider best. Users wouldn't need to use the Primary Source tool to interact with the claims.

2) Having a uncertain-rank will allow us to have higher sourcing standards for statements with normal rank that can be accessed by data users who ask for truthy values without being deletionist and making it harder to contribute to Wikidata. We could have a constraint "This property can only be used for statements with Normal/Preferred Rank if it has a source".

3) When doing research here are many cases where a source is not 100% trustworthy but a researcher still wants to record the claim. Having a uncertain-rank would allow a researcher to record the claim while recording his uncertainity about the reliability of the source at the same time. This will result in less unreliable claims at normal rank and thus higher data quality.

False[edit]

False is a new rank that also isn't truthy. In addition if there's a statement with the normal rank and an equivalent statement with the rank false rank the statement with the normal rank won't be returned as truthy value. In cases where there are conflicting claims by different parties about what the truth happens to be we want that we make a conscious decision about what gets returned as truthy by setting the preferred rank explicitely.

1) This allows for being clear about whether references support or speak against a claim and thus provide us a clearer way then statement disputed by (P1310) that actually has an influence on truthiness of the statement.

2) Currently we have properties like does not have part (P3113) for the opposite of has part (P527). It's unconvenient to have to create new properties whenever we want this functionality. Having this functionality as a rank where it can be used whenever we need the functionality is a lot cleaner.

UX Considerations[edit]

I don't want to prescribe a specific way of how the UI with 5 ranks instead of 3 should look like and it makes sense to task the WMDE user experience people with finding the best implementation. There's a lot of room of how the ranks could be displayed in a clear way to the user. When the false rank gets used I can imagine using strike-through for the relevant text.