Wikidata talk:Requests for comment/Data quality framework for Wikidata

From Wikidata
Jump to navigation Jump to search

Feedback #1[edit]

Hoi, so you choose to open a can of beans. Fine. Quality exists on many levels and, quality on one level does not mean quality on another. When you for instance indicate that an item is the same as a data entry in an external source, it is 100% valid. However when you THEN import the data from that source you assume that because of the trusted nature of that source it is "ok". For instance indicating that a substance is a "medicine". However when literature totally debunks the substance and makes it worse than a placebo, you imported a problem, a bias.

The question about quality exists on many levels. The only sane way to treat data is by comparison and by sources. When many sources agree on particular statements, you do not need to focus on such statements. When statements differ, you want to research. When they are the same you may import them particularly when multiple sources agree. However once there is disagreement you want to notify the differences and you may want to NOT import.

Quality is not in sources, they are often wrong. Quality is often more in consensus and the steady grind in finding sources where they are not. Thanks, GerardM (talk) 09:17, 12 August 2016 (UTC)[reply]

Question 1: How would you propose to reconcile data available in multiple sources?
Question 2: How would you engage the community to look into potential issues?
Question 3: How are suggestions about data quality understood in a multi lingual environment?
Question 4: How do you propose to measure growing quality - what are the KPI's ?
Question 5: When an authoritative source indicates that something is true and later research debunks this. How would you demonstrate this and what does it mean for the status of "authoritative" of said source?

Thanks, GerardM (talk) 10:06, 14 August 2016 (UTC)[reply]


I'm not sure how far "Intrinsic dimensions" is consistent with the current view of accommodating and ranking contradicting statements. The note under "Objectivity" seems too limited. Obviously the impact varies between various types of statements.
The reliance on sources believed to be "authoritative" or "universally (non) authoritative" is a Wikipedia problem we should try to avoid. With the stated "intrinsic dimensions", there doesn't seem to be an explicit goal to avoid reliance on single sources.
--- Jura 13:16, 14 August 2016 (UTC)[reply]
How do you expect to avoid the problems in single sources. We do indicate that substances are a "medicine" while we know that they are ineffective at best and dangerous as well. Do not think that it is their problem; our data ends up in their project. Thanks, GerardM (talk) 13:57, 14 August 2016 (UTC)[reply]
I don't think it can be avoided entirely, but if the purpose of this is to describe an ideal 20 items of our 20,000,000 items, then I think it's worth stating it.
--- Jura 14:04, 14 August 2016 (UTC)[reply]
I think that is irresponsible. We do not work for the best 20 items we work for all of them. If that does not fit, the framework is faulty never mind whose framework. Thanks, GerardM (talk) 14:42, 14 August 2016 (UTC)[reply]
You mean the framework as such or explicitly stating that we aim to provide multiple sources for a claim. Given that part of the framework relies on some unspecified use case (e.g. feed G) it's hard to assess it in general.
--- Jura 14:46, 14 August 2016 (UTC)[reply]
When we refer to a source and claim that its data is about the same subject, there is no problem. The problem arises when we import data and state that it is correct because it is from a "trusted source". A claim is imo a statement we do not have any process going that reconciles data in different sources so importing IS problematic. Thanks GerardM (talk) 11:22, 16 August 2016 (UTC)[reply]
@GerardM: interesting questions. I would not be able to answer to all of them, but I try to explain my point of view. I think that if we accept the definition of Wikidata as a secondary database, we should assess each piece of information with respect to its provenance, without assuming that there is a truth out there. This is why (1) assessing the reliability of the sources become even more important and (2) appropriate rankings should be assigned to statements. With regard to Wikidata multi-lingual environment, I would preferably refer to multi-cultural environment. Again, I think that the possibility to add contrasting statements should be used to enable different cultural points of view to be reflected. With regard to Q2 and Q4, I think that it depends also on what the community thinks quality is and what the main issues, but if you have any suggestions, I'be happy to read them.--Alessandro Piscopo (talk) 09:31, 16 August 2016 (UTC)[reply]
Multi cultural is distinctly different from multi lingual. You can easily assess the quality of language support by having labels and the growth of those labels in a language as a KPI. As it is there is no language with 100% support. Multi cultural is accepting the difference between German Austrian and Swiss and having a way of dealing with the differences. Thanks, GerardM (talk) 11:25, 16 August 2016 (UTC)[reply]
So I think I misunderstood your previous comment. Then yes, multilinguality is taken into account by the framework draft, as ease of understanding. However, this considers the number of languages in which labels are available for an Item (a possible metric could be to define the coverage that some languages provide in term of world population), but does not take into account a measurement under a language coverage pount of view. Talking in theoretical terms, would you include such measurement under the same dimension (Ease of understanding)? --Alessandro Piscopo (talk) 12:57, 16 August 2016 (UTC)[reply]

Ease of understanding[edit]

Ease of understanding is again something entirely different. "Ease of understanding" is about whether the data makes sense given one language, one culture. I can safely say that understanding the data of Wikidata is exceedingly difficult. There are many words in use that make little sense in the way they are ordinarily understood. It has, among other things, everything to do with the convoluted upper structure of Wikidata. Thanks, GerardM (talk) 18:48, 16 August 2016 (UTC)[reply]

What do you mean when you say that "understanding the data of Wikidata is exceedingly difficult"? Could you make some concrete examples? Thanks, --Alessandro Piscopo (talk) 16:08, 17 August 2016 (UTC)[reply]
Compare Wikidata with Reasonator. They are worlds apart and it is much easier to understand the data in Reasonator. For all I care you can link within both examples to wherever and the experience remains the same. Wikidata is not fit for humans to understand what the data is about. That is a quality that is desperately needed. Talking about quality in the abstract while there is a large body of thought on the subject to make a practical difference is a waste of time.
In my opinion we can take your point and discuss practical ways of improving them and the metrics involved but "hey guys, we think it is hard so make it easy for us" will not help us much. Thanks, GerardM (talk) 09:31, 23 August 2016 (UTC)[reply]
Hi GerardM, please be sure that setting a framework following previous literature on data quality and our personal observations would have been the easy wasy, rather than discussing that here. As I have already written in another reply, this RfC has been opened to collect advice and opinions and stimulate discussions about data quality within Wikidata community, because I am convinced that this can be beneficial to Wikidata itself and because I think that, given the collaborative nature of Wikidata, this was the most appropriate method to proceed. Thanks, --Alessandro Piscopo (talk) 12:05, 24 August 2016 (UTC)[reply]
Previous literature is exactly that. Wikidata is by definition different; it is done based on "community consensus" and everyone does his own thing. The notion that you could get away with just publishing is naive. Ask yourself; we have people adding substances as a medicine even though we KNOW that these substances are not more effective than a placebo and have negative side effects to boot. Based on their ass-umption that because an official resource says that a substance is permitted for medical use it does make it safe to re-publish this on Wikidata. We do not have a way to meaningfully discuss this, we do not have a way to annotate the issues. It is just people with an interest thinking this is ok. A framework without practical considerations is imho a waste of paper; just a mind game. Thanks, GerardM (talk) 05:11, 25 August 2016 (UTC)[reply]

Samples[edit]

  • Description and examples: All instances of the class "human" (Q5) must have the "date of birth" (Property:P569) property.
  • Description and examples: All Italian towns should be present, given the class "comune of Italy" (Q747074).

I tend to disagree with both.
--- Jura 11:00, 12 August 2016 (UTC)[reply]

Maybe "municipality" would be the better word instead of "town".
--- Jura 13:16, 14 August 2016 (UTC)[reply]
Hi @Jura1: do you disagree on the terminology used or on the fact that they are used to define completeness? Could you please explain your thoughts about that? Thanks, --Alessandro Piscopo (talk) 09:34, 16 August 2016 (UTC)[reply]
It's the second point, but for the Italian sample it might just be a question of wording. All towns should link to their municipalities and all municipalities should be found with Q747074. Not sure which one you meant.
--- Jura 17:55, 16 August 2016 (UTC)[reply]
I see. It is the case, for instance, of Q84 and Q23306. I find this distinction correct, but I believe it may be confusing for several users (I haven't thought about that at the beginning). However, this does not relate with the definition of 'population completeness' that given a class, all its instances should be present. And if not all, what would be a good measure of completeness in this case?--Alessandro Piscopo (talk) 15:55, 17 August 2016 (UTC)[reply]
  • Description and examples: All instances of the class "human" (Q5) must have the "gender" (Property:P21) property (with values male, female, etc, or unknown).
  • Description and examples: All Italian municipalities should be present, given the class "comune of Italy" (Q747074).
Maybe the above.
--- Jura 16:09, 19 August 2016 (UTC)[reply]
Hi, why would you change "date of birth" (Property:P569) with "gender" (Property:P21) in the example? Thanks, --Alessandro Piscopo (talk) 08:05, 22 August 2016 (UTC)[reply]
It seems pointless to add P569 to every person on Wikidata:WikiProject_Ancient_Rome/lists/people.
--- Jura 09:37, 23 August 2016 (UTC)[reply]

Modeling[edit]

It seems to me you can't evaluate data quality without starting by an evaluation of the models. --Melderick (talk) 22:13, 12 August 2016 (UTC)[reply]

It seems to say you need to define the task at hand and then figure out its #Contextual_dimensions, mainly "completeness", including "Schema". So if the task at hand is to fill-in an infobox at Wikipedia, this defines completeness criteria. What seems odd is that it can only be complete if an external vocabulary (supposedly non-Wikipedia or non-dbpedia) allows to machine read it (Interpretability)? Potentially this could constraint contributors freedom to model.
--- Jura 13:16, 14 August 2016 (UTC)[reply]
Defining the task at hand would actually be quite complicate: there are several types of tasks for which Wikidata could be used. It would be interesting to know what the community thinks Wikidata can be used for and what it therefore thinks the quality standards should be for these tasks. --Alessandro Piscopo (talk) 09:48, 16 August 2016 (UTC)[reply]

Data or information quality?[edit]

There is data quality (Q1757694)  View with Reasonator View with SQID and there is information quality (Q3412851)  View with Reasonator View with SQID. I just tried to tidy up these items by moving Wikilinks and labels between both. This itself gives an example of information quality: Given two Wikidata items, how many labels/Wikilinks/statements should better be moved from one to the other? As users may disagree, this answer cannot be answered by a simple number. An example of data quality may be Does every Wikidata item have at least one label? -- JakobVoss (talk) 18:31, 14 August 2016 (UTC)[reply]

Regularly a bot is run that adds labels to items when they have at least one link. Also there are stats for this. Thanks, GerardM (talk) 17:50, 25 August 2016 (UTC)[reply]

Completeness wrt. another dataset : definitions[edit]

Full (formal) completeness
There is a subset of wikidata who is isomorphic to the dataset. Which mean a bot can resconstuct the original dataset if the models of both side are known. Of course the transformation has to be accurate in the sense of the meaning of both model have to be conserved by the transformation : we should not transform the date of bearth to a date of death and vice-versa.
Sub (formal) completeness
there is a subset of the original dataset with an isomorphism with a subset of Wikidata. Same constraint as above. This can for example occur if the modelling power of Wikidata is not good enough to represent stuffs in the dataset.

The two above definitions are useful if the dataset are databases, of course.

There is over kind of dataset, such as books with a lot of textual informations that are not so easy extracted. Maybe we could define a sort of "sub-completeness" for this cases where the wikidata dataset is complete wrt. an extraction process, or if a human can tell all the usable informations have been extracted from the source. author  TomT0m / talk page 14:14, 16 August 2016 (UTC)[reply]

One of features of the COOL-WD tool is the possibility to add completeness statements to Items. My doubt is that it would be difficult to do it at a large scale (i.e. on the whole Wikidata). What do you think about that? Thanks, --Alessandro Piscopo (talk) 08:16, 22 August 2016 (UTC)[reply]
I have tested that tool and I don´t think it is useful to add completeness statements for each statement and I´m missing a way to get this information by a sparql-query. For example, it is useful to get the information if all team members of a sport team are listed in an item. The next step is to query all team items to get those teams witch are incomplete and there should be a possibility to show to Wikipedia editors per Lua module that the list is incomplete. I have thought myself about something like a "traffic light" that show red, yellow or green to signal to the editors the quality of the data. I mean, we have "Citation needed" at enWP, that is the same, we are used to show quality problems to the public. --Molarus 10:53, 1 September 2016 (UTC)[reply]

Origin of information[edit]

There is a problem with self-referencing information.

An information should be supported by at least one observation. Naively we could think by the number of sources from which we can extract the same statement is a good estimation of trusworthyness of some information, but it's not the case. We don't know if the sources did not copy each other and so on. I think that a good data should have an identifiable origin - for example a scientific experiment - and that wikidata should be able to store the fact that the initial origin of the data is known to ensure there is a real path with no loop ... author  TomT0m / talk page 14:23, 16 August 2016 (UTC)[reply]

I think this is really a good point. However, it could be hard for users to find the origine of a piece of information or to understand whether some source are circularly referencing each other. It would be a difficult feature to add: perhaps we can rely on ranks to characterise the trustworthiness of data, according to their source. --Alessandro Piscopo (talk) 08:21, 22 August 2016 (UTC)[reply]
Ranks already have a function, and partly answers the question, but are not the ultimate solution imho. Deprecated rank is for statement that are not considered true on todays standards, and the couple "preferred"/"normal rank" is used to discriminate between historical and now-day datas, and it does not mix very well with trustworthiness. "trustworthiness of data, according to their source" => this seem to be the role of "precision" on datatype ? We also have a couple of qualifiers to deal with that, like "circa", "claim disputed by", maybe others. author  TomT0m / talk page 09:21, 22 August 2016 (UTC)[reply]

Idea stuffs to build metrics for measuring the quality of our model[edit]

Number of constraint violations and exceptions. Or ratio of constraint violations wrt. the number of constraint matches - the number of results of a query that would retrieve all correct examples of the constraint seen as a query - for example if a human must have a date of birth, the actual number of humans who actually have a date of birth.

The idea is that a big number of constraints violations indicates either a non relevant constraint and hence a problem in the model we expect, or a problem in the datas.

The number of constraint exceptions is also an indicator that could prove that there is a problem in the community intended data format. The idea is that the constraint could be a bad one and we should think of a better model if someone keeps adding exceptions constantly. I guess that for growing dataset, the ratio of exceptions should not increase over time or tends to become 0 to define a proper exception, and not a way to make an unsustainable model work.

On a more conceptual model, a bad modelling smell imho is bad "domain property fork". In human languages we sometimes gives different names for very close notions in different fields. The idea would is that community is likey to import these peculiarities in form of different properties, that might or not have a common superproperty. It could be a bad thing : it's language dependant, and some languages might use one word for the two of another language and a model might be really match one language modelling of the field but do not match at all in another language. Then a question arise : if a language can model totally correctly a field using few words, why would we use more than necessary in Wikidata ? This would be imho a risk of "domain model fork" that are useless if a unique model is enough for the two fields. That thought comes from a concrete idea to detect what I field are bad smells : I usually oppose a property proposals for which a very close property in another field could be slighly generalized to this field with no major changes in the model. For example, a scene lighter in opera and in theater could have different names. This might be justified to create some properties specific to them, but how to know how ? I think some hint could come to the benefits in terms of actual constraint we can build with it does something different than to check that the field of work is the correct one. If we can only check that an "opera scene lighter" works with opera, and that a "theater scene lighter" works in theater, and nothing more, I think this is useless to fork the property.

Catching this in an early stage could spare community useless work : if two domains are so close that they could be modelled the same way but that we create for each a set of roughly equivalent properties and constraints, constraint reports and so one, it's more useless work. Worse, the two models could diverge for no good reason, which would make the model more complex and probably less easy to grasp for newcomers.

A bad smell pattern for a property fork, for example : A property P is a superproperty of P1 and P2. The domain of P (P1 P2 resp.) is a class C (C1 C2 resp.), its range is a class D (D1 D2 resp.). P, P1 and P2 constraint are the same if you substitue C with C1 or C2 and D with D1 and D2. Then the fork is probably counter productive.

Metrics/bas smell patterns about the model quality could give hints to those of us who work on property proposals. author  TomT0m / talk page 10:42, 3 September 2016 (UTC)[reply]

completeness and the "number" property[edit]

The quantity (P1114) View with SQID property is intended to express the number of instances of a class, for example, and as such can be used to assess that Wikidata knows every instances of some class. author  TomT0m / talk page 10:43, 2 October 2016 (UTC)[reply]

I don't remember if this was considered. --Nemo 23:46, 22 March 2017 (UTC)[reply]

Published: What We Talk About When We Talk About Wikidata Quality: A Literature Survey[edit]

The outcome was presented at OpenSym just now and the paper is available at https://opensym.org/wp-content/uploads/2019/08/os19-paper-A17-piscopo.pdf Nemo 13:29, 22 August 2019 (UTC)[reply]