Wikidata talk:Ontology issues prioritization

From Wikidata
Jump to navigation Jump to search

Results of the survey on ontology issues reusers are facing[edit]

Wikidata ontology issues — suggestions for prioritisation 2023

Ontology issues had been identified as a major issue reusers are facing when trying to build applications, services, etc with Wikidata’s data. We dug into the issue and identified different types of ontology issues that are happening on Wikidata and then discussed this initial classification at the Data Quality Days and WikidataCon. Earlier this year we then conducted a survey among reusers. We wanted to better understand how severe and wide-spread each of these issues is and check if we missed any important issue type. The focus was on finding the issues that hinder reuse of our data the most. We now want to share the results of this survey. You can find them summarized in this slidedeck on Commons.

It would be great if you could have a look at the results. Is there anything that jumps out at you or if there is anything where you can provide additional context for example? What do you think is the most important thing to do to address the issues (regardless of who’ll need to do it)? We have also collected potential solutions for the most important issues based on the many discussions we’ve had over the past years on this topic for example at the two Data Quality Days events. We want to get a broader set of ideas and not to limit them to existing approaches. So we’d be very happy to hear your thoughts on this topic first, and we’ll summarize and share the potential solutions here in about a week.  

Cheers Lydia Pintscher (WMDE) (talk) 09:01, 21 June 2023 (UTC)[reply]

Thanks very much for this interesting survey. I reflected on it yesterday and here are my first thoughts:
  • in general: I very much agree with the results of the survey and I would like to underline something I consider particularly relevant.
    • 1) "There are not a lot of resources on best practices for data modelling on Wikidata" is very much true (scattered materials in Wikidata, often not much suited for the practical purpose of teaching simple tactics for cleaning a bit the ontology; nearly no materials elsewhere, e.g. tutorials on Youtube); a simple "best practices page" would be a very good starting point, as supporting tool I think that Wikidata Graph Builder is already very good but I would have one suggestion, i.e. giving the possibility of visualizing in the graph only all the items which have/don't have properties X and/or Y and/or Z etc. (I'm thinking about e.g. the identifiers of important thesauruses like Art & Architecture Thesaurus ID (P1014): this would help a lot in spotting holes and strangenesses in the process of reconciliation of a thesaurus with Wikidata) - this could be done just adding two more fields, "Having property" and "Not having property", giving the possibility for each field to add more than 1 value if desired (additional note: another user asked a similar thing for Wikipedia articles; similarly, you could add two more fields, "Having Wikimedia project article" and "Not having Wikimedia project article", giving the possibility for each field to add more than 1 value if desired)
    • 2) I think the idea of phab:T295275 could have some positive effect; although the comment of Tagishsimon correctly points out that it will probably not change things on a massive scale, I think it could be useful at least for a few users involved in cleaning the ontology; I would suggest to create it as a gadget which could be enabled in the Preferences - the community will then have the possibility to discuss if it deserves to be enabled by default or not
  • in particular, the two topics on which I am most interested and I have worked significantly are unclassified items and duplicate items:
    • 3) unclassified items: they are probably the biggest problem in terms of quantity (more than 2.4 M items presently lack both instance of (P31) and subclass of (P279)) and these items, being much more difficult to find than others, often get duplicated (yesterday I was very much concerned by this case: Q9323411 had 4 duplicates, so there were 5 items, all unclassified, one for each Wikipedia - de, fr, pl, pt, uz; and many other cases are similar ...); my first suggestion for them is just enabling by default some sort of very visible alert (I'm thinking about something similar to the message produced by MediaWiki:Gadget-UnpatrolledEdits.js) asking users to add P31 and/or P279 if they are both missing (the message should contain a link to Help:Basic membership properties, which provides very good guidance)
    • 4) duplicate items: I agree about including them as an ontology issue, they are specular to "Conceptual ambiguity" (in the case of duplication, the same concept is scattered in more items; in the case of ambiguity/conflation, more concepts are stored in the same item); as I said above, it is deeply connected to unclassified items, so classifying more items will surely help finding and merging more duplicates; as of now the main tool I miss is mw:MW2SPARQL, which has been not working for at least 3 years (try e.g. https://w.wiki/6rwH, obtaining error 500) but was of significant help - a good alternative would be an extension of PetScan, which as of now only searches 1 Wikimedia category per time, but could potentially search simultaneously all the Wikimedia categories connected to a given Wikidata item (e.g. the 4 ones connected to Q8882780) and this would greatly help in spotting duplicates. --Epìdosis 13:26, 22 June 2023 (UTC)[reply]
Thank you so much! Those are great additional insights. Lydia Pintscher (WMDE) (talk) 18:41, 22 June 2023 (UTC)[reply]
Oh, I underestimated the numer of unclassified items. We should do something about it, indeed, but better tackle it together with the more severly assesed problem of overgeneralization. Just adding instance of (P31) entity (Q35120) to every unclassified item will not help. JakobVoss (talk) 20:46, 22 June 2023 (UTC)[reply]
Thanks for the survey and its good summary. My thoughts reading it were as following. tl;dr: focus on problems that can be fixed (unclassified items, cycles...). For the rest aim at better (methods of) documentation instead.
  • Conceptual ambiguity: this seems to be the hardest and most mentioned problem. Unfortunately it can only be solved to some degree, because many concepts are fuzzy, concepts depend on languages and being too precise will lead to inconsistencies as well. Wikidata should not aim to fully solve this problem but people must live with it.
  • Inconsistent modelling: can be solved by constraint, data schemas and most important good documentation and examples. But like conceptual ambiguity there will never be one and only one consistent model because every model has a specific viewpoint and limits.
  • Complexity introduced by conflicting real-world models: like mentioned in the summary, general ontological plurality is not a bug but a feature of Wikidata. Solution: better documentation.
  • Unclassified items: are a problem that can be adressed by Wikmedia community very well.
  • Messy upper-level ontology: should be adressed for obvious inconsistencies but in is not fully solveable without some dictator to decide and enforce it.
  • Mix-up of meta levels: seems to be solveable by Wikimedia community but the devil is in the details. I guess in some areas it get religions just like arguments about the upper ontology.
  • Semantic drift: The solutions and workaround seem to misunderstand the core of the problem (transitive application of subclass-of). To some degree the problem is same as conceptual ambiguity, but its also this: subclass-of-hierarchy only makes sense for limited sets, not for Wikidata as whole (see messy upper-level ontology).
  • Exchanged sub-/superclasses: should be cleaned up.
  • Overgeneralization: is better than unclassified items but still a (solveable) issue in data quality.
  • Redundant classification and Redundant generalization: can be cleaned up to some degree but sometimes it cannot be avoided because this are changing. It seems to be a minor problem anyway.
  • Cycles: should be cleaned up.
Other issues and problems: I often stumble upon use of subclass of (P279) when instance of (P31) should be used instead.
  • Wikidata’s ontology is not stable: that's a challenge but without changes Wikidata would not be better than ChatGPT with its cut-off date in training data. I'd put this on top of the list with conceptual ambiguity and inconsistent modelling
  • Item Completeness: an interesting topic, scoring items might help. I bet there is already a tool to do so.
  • There are not a lot of resources on best practices for data modelling on Wikidata and The whole Wikidata ontology cannot be viewed: we need better ways to document and illustrate the current state of modelling in Wikidata. Maybe some semi-automatic aggregation of subsets of Wikidata with intellectual annotation? Neither manually written texts nor purely statistical overviews are suitable.
JakobVoss (talk) 20:26, 22 June 2023 (UTC)[reply]
Interesting that you find items that are defined as subclasses but ought to be items. I find the reverse a lot. - PKM (talk) 21:19, 22 June 2023 (UTC)[reply]
I agree that reducing conceptual ambiguity and inconsistent modelling (which I view as being closely related) is difficult. But having better descriptions of concepts would help a lot, particularly if there was some structure to the descriptions, such as exemplars and counter-examples and disjointness information. Peter F. Patel-Schneider (talk) 17:02, 4 July 2023 (UTC)[reply]
I ran into a couple of interesting cases of conceptual ambiguity yesterday while working with unmatched items in the UNESCO Thesaurus. newspaper (Q11032) was subclass of both "publication" and "organization" (and a bunch of other things), so every instance of a newspaper was ambiguous. I created newspaper press (Q119830904) for the organization.
Also, dropping out (Q780562) (process) was conflated with "dropout" (person). I created drop-outs (Q119831851) but I still need to clean up the external identifiers, most of which are about students not the process.
This suggests some possible tools to clean up conceptual ambiguity: items that are both publications/organizations; processes/products; buildings/organizations (a well known issue), and so on.
Also it would be cool to have a tool for splitting an item that is more sophisticated than just duplicating, because once an item is split it is necessary to look at every external identifier and sitelink to assess where it belongs. I frequently get lost cleaning up the external identifiers (did I check this one yet?). If the tool would prompt you to assign each statement to the right item (with a link to read it, or it's text description from the source), it would be helpful for me.
And one more thought: anything with a constraint violation of "single value" on more than two external identifiers is a good candidate for close examination for conflation of concepts (though some are perfectly valid when our ontology is less granular than someone else's). PKM (talk) 21:51, 22 June 2023 (UTC)[reply]
I also note that I keep finding external datasets that mark "exact matches" to items in other datasets mixing classes as we think of them (often groups of people matched to conditions or processes), and datasets where the "broader" concept is a different class by Wikidata standards. It's never going to be perfect. PKM (talk) 22:08, 22 June 2023 (UTC)[reply]
When it comes to conceptual ambiguity we could have a tool that make it easy to split one item that has multiple incompatible P31 claims into multiple items. Maybe there could even be a button that automatically appears in the Wikidata UI in those items. ChristianKl13:44, 27 June 2023 (UTC)[reply]
Yes, please! PKM (talk) 19:59, 27 June 2023 (UTC)[reply]
I think framing the issue of "we lack pages that document best practices" and thinking about whether a "best practice page" would help misunderstands the problem. For something to be a best practice, there needs to be a decision to make it a best practice. That usually needs either a conversation or a way where individual users feel empowered to decide on a best practice.
Focusing on "make best practices" easy to consume instead of "making best practices" easy to agree on is unlikely to be very useful.
I think model item (P5869) is good because the bar for someone to feel empowered to add it should be lower than the bar for someone to write a best practice page. It's also very nice because it's easy to discover. ChristianKl14:31, 23 June 2023 (UTC)[reply]
Is it possible to get a copy of the survey? I seem to remember that there was more information there (particularly more examples) than are in the report. Peter F. Patel-Schneider (talk) 16:27, 4 July 2023 (UTC)[reply]


Overview of potential solutions[edit]

Hi everyone,

Based on the many discussions we've had over the past years and now as part of this survey, I've put together an overview of the different paths I can see that could significantly improve the situation for some of the ontology issues identified. I've focused on solutions that could have a more systemic and wide-ranging effect.

I'd love to hear your thoughts about it. Are there important pieces missing? Is there something you disagree with? Generally sounds sensible?


improve EntitySchemas make constraint violations queryable (phab:T204024) increase the visibility of WikiProjects (phab:T329284) best practices, showcase Items and other documentation improvements incl. making existing documentation more discoverable introduce automated list generation (phab:T67626) increase the visibility and adoption of sitelinks to redirects make suggestions from the Property Suggester more visible introduce a separate section in the UI for ontology-related Properties incl. linking to documentation about implications of changing them (phab:T295275) make it easier to see the bigger picture of the ontology when making very local edits make it easier to split an Item comment
initially identified types of ontology issues conceptual ambiguity more widespread checking and finding of Items that violate EntitySchemas due to conflation of concepts more easily find Items that violate a constraint due to conflation of concepts better documentation and examples for notorious cases of conflated concepts can help people understand how to do it correctly integrating query results more in the other Wikimedia projects will lead to more cleanup of important conflated Items this should decrease the conflation caused by overlapping concepts within one Wikipedia article this should lead to less effort when untangling a conflated Item
inconsistent modeling EntitySchemas should become a place for discussion and agreement on modeling and then serve as a tool for checking existing data against those agreements more easily find Items that violate a constraint due to wrong use of a Property make it easier to find good examples and modeling recommendations for a particular area of interest better documentation and examples can help people model Items more according to agreed upon or lived standards integrating query results more in the other Wikimedia projects will force more consistent modeling makes it easier to do the right thing by picking what the suggester recommends based on common usage in other similar Items
conflicting real-world models document better for reusers how to deal with conflicting models expressed in the data We consider it a strength of Wikidata that it is able to deal with different models of the world so we should focus on making it easier to deal with it.
unclassified Items some of the involved Items will trigger constraint violations due to missing "instance of" and subclass of" statements which will then be easier to find and fix missing Items in query results will lead to people searching for them and classifying them so they show up in query results on Wikipedia lead editors to at least add a classifying statement to an Item makes the importance of adding a classifying statement clearer
messy upper ontology
mix-up of meta levels more agreed upon and automatically tested EntitySchemas will help find deviations some of the Items involved in this will trigger constraint violations which should then be easier to find and clean up integrating query results more in the other Wikimedia projects will lead to more cleanup let people see how the Item they are looking at fits into the bigger picture
semantic drift more agreed upon and automatically tested EntitySchemas will help find deviations some of the Items involved in this will trigger constraint violations which should then be easier to find and clean up integrating query results more in the other Wikimedia projects will lead to more cleanup splitting up of Items can help delineate the concepts they represent more clearly and thereby lead to less semantic drift let people see how the Item they are looking at fits into the bigger picture it'll make it easier to untangle the conflation of aspects of an Item that cause semantic drift
exchanged sub-/superclasses more agreed upon and automatically tested EntitySchemas will help find deviations some of the Items involved in this will trigger constraint violations which should then be easier to find and clean up integrating query results more in the other Wikimedia projects will lead to more cleanup let people see how the Item they are looking at fits into the bigger picture
overgeneralisation integrating query results more in the other Wikimedia projects will lead to more cleanup let people see how the Item they are looking at fits into the bigger picture
redundant classification more agreed upon and automatically tested EntitySchemas will help find deviations some of the Items involved in this will trigger constraint violations which should then be easier to find and clean up integrating query results more in the other Wikimedia projects will lead to more cleanup let people see how the Item they are looking at fits into the bigger picture
redundant generalisation more agreed upon and automatically tested EntitySchemas will help find deviations some of the Items involved in this will trigger constraint violations which should then be easier to find and clean up integrating query results more in the other Wikimedia projects will lead to more cleanup let people see how the Item they are looking at fits into the bigger picture
"subclass of" cycles integrating query results more in the other Wikimedia projects will lead to more cleanup let people see how the Item they are looking at fits into the bigger picture
newly identified types of ontology issues duplicate Items for the same entity duplicate Items should trigger constraint violations and making them easier to find will lead to more of them being fixed and the Items merged duplicate Items showing up in query results in articles will be noticed more often and then merged to fix the duplication in the query result encourage people to add more identifier statements, which would then lead to more constraint violations being triggered for duplicate Items
mix-up of "instance of" and "subclass of" some of the mixups will violate EntitySchema definitions some of the mixups will trigger constraint violations which should then be easier to find and clean up help people understand the meaning of these Properties and when to use which with more examples and more accessible documentation wrong use of "instance of" and "subclass of" will lead to queries not returning expected results and triggering cleanups help people understand the meaning of these Properties and when to use which let people see how the Item they are looking at fits into the bigger picture
other issues that were brought up but are not strictly ontology issue types close/related Wikipedia articles are matching different Wikidata Items or a Wikipedia article is matching several Wikidata Items documentation about how to handle these cases can be improved and made more accessible more use of sitelinks to redirects can help untangle some of the cases
the right Properties to use are difficult to identify without domain expertise or examples to copy from makes it easier to find people who have domain expertise and can help clarify modeling questions better and more examples to copy from helps people do the right thing even without deep domain expertise more visible suggestions can nudge people towards making the right modeling choice when they are unsure This is one of the underlying reasons for inconsistent modeling
the ontology is not stable more agreed upon and automatically tested EntitySchemas will help find deviations raising awareness about where and when stability in the ontology is desired and why can help people make more conscious choices about their changes to the ontology force more stability through reuse on Wikipedia let people more easily understand the larger consequences of their local edits and then maybe reconsider the change let people more easily see the larger consequences of their local edits and then maybe reconsider the change We will not reach the point where Wikidata's ontology is entirely static and don't consider this a desirable endstate. Instead we focus on making unnecessary instability less likely.

Lydia Pintscher (WMDE) (talk) 13:30, 4 July 2023 (UTC)[reply]

Discussion on potential solutions[edit]

Thanks for putting this together. My first impressions:
  • Given I'm a fairly active user and have never once interacted with an "EntitySchema" I think "improve EntitySchemas" probably makes sense. Make them more visible? Easier to use? Better docs? Honestly I'm not even sure what using them buys me today. Maybe that's partly my fault.
  • I'm generally skeptical of "make constraint violations queryable". Is the problem that we can't query them? Or that we don't have enough people who care about them? Maybe if this solution could do things like tell a user (or tag in recent changes) that an edit has generated tons of constraint violations in the item or elsewhere. I imagine doing that is computationally infeasible though (this is similar to "make it easier to see the bigger picture of the ontology").
  • I'm also a little unsure about "increase the visibility of WikiProjects". To me there are three general classes of modeling problems: 1) consistently modeled concepts 2) inconsistently modeled concepts 3) totally unmodeled concepts. So if someone is making an item for a person or a film or scholarly work then Wikidata has very extensive examples of how to model this. Most items in the corpus represent these kinds of concepts. The next class is things the corpus has a lot of but are not consistently modeled. For example books (vs. literary works) or genes or chemical compounds (whose ontology is kinda a mess in my opinion). This is the area where Wikiprojects could help. And the final category are totally off the wall things that are hard to model like fear (Q44619) or touchdown celebration (Q7828686) (which looks well modeled but isn't). Different solutions are needed for each of these categories of item. The first category of things is actually well served by (a project of mine) User:BrokenSegue/Psychiq which can also be used to find errors in the ontology User:BrokenSegue/PsychiqConflicts (I'm generally bullish on ML based approaches). Inconsistently modeled concepts are the ones where we really need better Wikiproject coverage. Totally unmodeled concepts are too disparate to really solve with WikiProjects.
  • I am always in favor of "best practices, showcase Items and other documentation improvements incl. making existing documentation more discoverable".
  • Generally I'm in favor of anything that encourages use of Wikidata in other Wikimedia projects like "introduce automated list generation". I recently worked on an effort to get youtube sub counts displayed on enwiki and there is a lot of resistance to Wikidata integration into Wikipedia even when there is a clear win. Part of the problem is lack of Wikipedia familiarity with Wikidata. Part of the problem is the user interface of Wikidata. I'm worried we will build out all this tooling to make automated lists and then Wikipedia will reject it because they can't figure out how to maintain it or use it. But if we could get Wikidata better adopted in Wikipedia then that would improve Wikidata quality. An example UI element I can imagine people asking for is "how do I (as someone who has never used wikidata) add something to this autogenerated list on Wikipedia". The answer cannot be "just go to wikidata and click create item and then magically know how to structure it". Wikipedians will rightfully point out that using Wikidata makes contributing to this autogenerated list much harder for the average person.
  • I don't think I understand "make suggestions from the Property Suggester more visible ". More visible how? I'm imagining a totally new interface for making items where based on the kind of thing you are adding it walks you through a wizard that has you populate common fields and alerts you to fix constraint errors. I think there are some open source tools to this end. Could we make it easy for Wikiprojects to maintain/update such wizards?
  • "introduce a separate section in the UI for ontology-related Properties incl. linking to documentation about implications of changing them" seems like a good idea.
  • "make it easier to see the bigger picture of the ontology when making very local edits " is a good idea but lacks some detail about how it will be achieved.
  • "make it easier to split an Item ". Personally I haven't run into this very often but maybe that's because I don't often bother trying to untangle an item. Untangling items is incredibly time consuming if there are lots of inbound statements.
Generally I think it's good to think about the users these are meant to serve. Who are the users we are imagining? There are experts that import thousands of items. There are Wikipedians who are only begrudgingly adding sitelinks (and don't have time/patience to add a P31). There are users making items for things in their community (a local business) in the hopes it gets a knowledge panel. There are people in certain fields who are dedicated to modeling all the concepts in some, e.g., TV show (e.g. Mickey Mouse) or subfield (Genes/Astronomy). Then there are users doing across the board work trying to improve the ontology (e.g. looking for constraint violations / finding poorly modeled concepts/ etc).
Hope this helps. BrokenSegue (talk) 01:16, 5 July 2023 (UTC)[reply]
" there is a lot of resistance to Wikidata integration into Wikipedia even when there is a clear win" - large part of the problem is that things feed directly from Wikidata break all the time on their own. Regular text will not do this. For my use case I was invested/interested enough to maintain User:Mateusz Konieczny/failing testcases and thanks to people more invested in Wikidata that fixed 200+ reported issues it makes sense to maintain it. But for someone wanting simply a list such investment is absurd.
Maybe tool that uses Wikidata to detect missing list elements and notifies people would make more sense? This way people would be more insulated from gems such as "secondary education (Q14623204) is process that takes place without human involvement" (fixed) or Andorra (Q228) classified as goods and services (unfixed) Mateusz Konieczny (talk) 06:32, 5 July 2023 (UTC)[reply]
On "make constraint violations queryable" - I use the reports under Wikidata:Database reports/Constraint violations quite a lot, but they are (A) often several days out of date, and (B) do not handle deprecated rank statements correctly (deprecated statements should be ignored in assessing constraints). So if this would fix those issues I think it would be a big win. ArthurPSmith (talk) 15:07, 7 July 2023 (UTC)[reply]
I was in the same boat with very little idea about how to make use of an EntitySchema.
@Salgo60 has been pushing it in our chat as a possible solution so I started looking for working validators recently.
I have used the entityshape userscript for some time but I wanted a way to easily batch check all items belonging to e.g. Wikidata:WikiProject Hiking trails. I found no working solution so I forked entityshape and after converting it into a reusable python module I got batch validation to work in a notebook in PAWS. See https://github.com/dpriskorn/entityshape for details. So9q (talk) 22:10, 23 July 2023 (UTC)[reply]
My usage of the module @So9q created "RiksdagensCorpusSchema_List"
  • interesting is that this data in Wikidata "Swedish PM members since 1885" is the best data available so people doing research has start using this data from Wikidata and they are very active on GITHUB and validate Wikidata when doing a Push Release
- Salgo60 (talk) 23:31, 23 July 2023 (UTC)[reply]
I'm working on improving watercraft (Q1229765). There are a lot of ontology violations, like something being both a ship (Q11446) and a ship type (Q2235308). It would have been very useful to have a constraint that prohibits this (or at least produces a warning for potential violations) and have the ability to produce a report of violations. Even better would be something on the talk page for ship (Q11446) that shows the number of violations and links to a list of the actual violations. What I had to do instead is to write query service queries but this is annoying both because the query has to be written and because there is no way to easily see all the violations in one place. Peter F. Patel-Schneider (talk) 22:40, 9 August 2023 (UTC)[reply]
Thanks for your input, BrokenSegue! I'll try to reply to a few of your points:
  • EntitySchemas: They are indeed way too obscure still and not integrated in key areas. That's something we are belonging to change now with the current development work on them. But that's not gonna be enough. There will have to be more work in the future beyond that. We will also do sessions as part of the upcoming Data Modelling Days that dive deeper into what they are and how they can help. For starters Wikidata:Schemas is a good overview if you are interested.
  • Constraint violations: I think the problem is that we don't have a good way to create and share a specific list of constraint violations. Those could form the basis for task lists for editathons and more. Right now you either have to go Item by Item to see if you randomly stumble upon one or you have to recreate the intend of the constraint in SPARQL and write your query or you have to rely on generic lists of for example all constraint violations on Property PXXX. None of those are great and/or discoverable for people. Making them queryable I believe can expose them in a lot more places and tools that could lead to more people being empowered to do something about them and interested in doing it.
  • WikiProjects: You're right that that will not solve all problems but I think there is a lot of potential there for the case you mention where it does help. This also came up repeatedly during past events like the Data Quality Days. Additionally I think this has the chance of giving more people an understanding that there are more people in Wikidata who are interested in similar topics as they are instead of it being a large anonymous blob of a community ;-)
  • More use on Wikipedia: You're right that it can't end at automated list generation. There is more work to do on tools like the Wikidata Bridge as well as general improvements on Wikidata to make it feel less alien for editors from Wikipedia. I hope that the new part of the dev team tasked with improving integration between Wikidata and the other Wikimedia projects will help here. Additionally we are looking into what we can do to improve the Item UI. That's very early stages but what you bring up is very much on my mind for this.
  • Make Property suggestions more visible: I am imagining for example the additional Properties not only popping up when you already are in the process of adding a new statement but instead slightly before that. I can imagine something like "Additional data that could be relevant on this Item: X, Y, Z". But doing that properly needs more thought. Your ideas about making it easy to build input wizards: That is one of the things I want us to be able to do based on the work we're putting into EntitySchemas. They can already serve as the source for a Cradle form for example: https://cradle.toolforge.org/?#/shex/e10
And you are of course right about the groups these solutions would serve. I have these in mind but didn't add that in the table. Maybe that would have helped.
Thanks a lot for your input! --Lydia Pintscher (WMDE) (talk) 09:28, 16 October 2023 (UTC)[reply]
I think you insuffiently consider the potential damage caused by splitting discussions via having some of them in EntitySchemas and others in the existing ways that Wikidata decisions are made. It's fits into the classic Xkcd comic about standards.
The current page for creating new entity schema suggests: "Try and involve as many people as possible in the discussion, especially those with specific knowledge about the subject area. A good place to start is posting messages to relevant Wiki Projects on Wikidata and Wikipedia with a request for feedback (make sure you include a link to this page). You can also post messages on Wikidata:Project chat for general feedback and help finding other editors with relevant subject knowledge."
The key problem with that approach is that many of our most active users have so many items in their watchlists that there's a good chance that they won't see a lot of the updates in Wikiprojects through the watchlist. Having a way to give people who signed up to a Wikiproject even if it's >50 people would be central to actually getting attention for the discussions that need to happen. ChristianKl15:42, 5 July 2023 (UTC)[reply]
That's a fair point. I assume you're referring to the issue where we can't ping large Wiki Projects because of the ping limit. I don't have a good solution for that that doesn't open itself to abuse but I'll think more about it. Ideas very welcome. Lydia Pintscher (WMDE) (talk) 09:29, 16 October 2023 (UTC)[reply]

I finally had the time to read well the table and the very interesting comments below! So, first of all thanks for the effort of systematization, which is surely a good start IMHO. Here are my thoughts (first on all the columns, then some sparse; sorry for the length):

  • EntitySchemas: unfortunately I completely share BrokenSegue's experience, viz. I have never interacted with them; I mostly think that constraints, if well used, could be sufficient in many cases, but I'm probably wrong; in fact, I still don't understand fully their usefulness in daily work on items; but probably, if they are improved, they could become an effective tool. Moreover, since EntitySchemas are (or, more realistically, should/may be) a relevant tool for the establishment of coherent data models, I read again what I wrote on the subject exactly one year ago in Wikidata:Events/Data Quality Days 2022/Modeling data#Detect, decide, enforce; I have two more reflections, in the next subpoints
    • after conflicting data models are detected, the community should decide which one is to become the standard (and so which others should be replaced with the standard one); I very much agree with the issue raised above by ChristianKl, discussions on data models (and also other thematic discussions) are scattered in too many places, i.e. the Project chat (by far the most dispersive place), WikiProject talks (probably the best place), property talks and - maybe in the near future - EntitySchema talks. Apart from the fact that often these discussions have difficulties in reaching an agreement, it is firstly crucial that they are participated by all interested users, so they should be noticed; especially in the case of big WikiProject (affected by the impossibility of pinging 50+ users through {{Ping project}}) we should make sure that discussions happen only in WikiProject talks, otherwise they will be surely missed by someone
    • after a data model is chosen as standard, it should be enforced; our main enforcement method is currently {{Autofix}}, which has many significant issues and limitations (I have just written about this in Wikidata talk:Events/Data Quality Days 2022/Modeling data and filed phab:T341405); this problem should IMHO be prioritized, because persons (me included) often loose interest in trying to establish coherent data models because the methods for enforcing them are often not fully adequate
  • querying constraint violations: again I'm somehow skeptical like BrokenSegue, constraint already generate (through {{Property documentation}} added in all property talk pages) queries that could be used to spot violations, so I think that improving on this should not be a priority; of course the biggest issue is users not caring enough of them; I also agree with BrokenSegue that notifying users in some way that they "created" a new constraint violation could be more interesting, if feasible
  • visibility of WikiProjects: I agree, WikiProjects should surely be more visible, both as places where established data models are collected and as places in whose talks new data models are discussed; the concrete idea in the ticket seems very good to me, "let each WikiProject provide a Property ID or a statement that Items in their area must match": so, each WikiProject lists a series of statements which are of its interest, while a series of properties related to a certain WikiProject can just be automatically found looking for properties linking to that WikiProject through maintained by WikiProject (P6104); then, on the basis of the presence of these statements and/or properties, the software links to the WikiProject(s) interested in the item (the link could be placed near the statement/property motivating it and/or in the item talk; I would say that the link could be generated by a gadget enabled by default, so that users potentially find annoying these links could easily deactivate them)
  • documentation: as I said above, documentation is always good; we especially need documentation on how to solve easy and not-so-easy problems in the ontology, like confusion between instance of (P31) and subclass of (P279) (a very common issue); I recommend also video tutorials of various lengths (here a long one I recently made)
  • automated lists: I agree with the comment of Gehel in the ticket, viz. this is mostly supported by Wikidata:Listeria; at the same time, it's also true, as stated by Bugreporter, that "Listeria is a 3rd party tool, and we need a feature built-in Wikimedia", and there also many issues and limitations of Listeria whose solution would be welcome; surely it is true that having a good way to show lists from Wikidata could make it more visible in Wikipedia and thus can involve more users in data maintenance, so I think this point is effectively useful, although maybe not prioritary due to the existence of Listeria (BTW: as Listeria "is a 3rd party tool, and we need a feature built-in Wikimedia", the same can be said also for autofix - see above -, whose issues and limitations are IMHO more concerning that those of Listeria, mainly because autofix edits huge amounts of Wikidata items, whilst Listeria is just used for showing data but not editing them)
  • increase the visibility of sitelinks to redirects: I surely agree, this is surely an effective way to solve some conflations
  • Property Suggester: again I'm a bit skeptical like BrokenSegue, either it is completely renewed or it is difficult to imagine how to make it more visible; however, I would suggest to evaluate something similar to User:MichaelSchoenitzer/quickpresets, a gadget which I judge very useful; in particular, I would suggest to show, in unclassified items, the fields instance of (P31) and subclass of (P279) ready to be compiled, possibly accompanied by "some sort of very visible alert" enabled by default (something similar to the message produced by MediaWiki:Gadget-UnpatrolledEdits.js, as I proposed in the first section point 3) and a brief explanation of the difference between the two (since many users confound them) and a link to Help:Basic membership properties (and maybe Wikidata:WikiProject Ontology) to deepen the topic
  • separate section of ontology-related Properties: sure support, as I proposed in the first section point 2
  • bigger picture of the ontology: just quote BrokenSegue, "is a good idea but lacks some detail about how it will be achieved"
  • splitting items more easily: surely it would be good, when splitting an item it is fundamental to care of all its parts (labels/descriptions/aliases, statements, identifiers, sitelinks; incoming links); given the users some help, so that they don't forget one of this aspects and thus leave the item partially conflated, would be welcome - anyway, also here some more practical details would be needed
  • for duplicate items, I copy here what I proposed in the first section point 4, since I don't see it in the table and I didn't get a specific answer: "the main tool I miss is mw:MW2SPARQL, which has been not working for at least 3 years (try e.g. https://w.wiki/6rwH, obtaining error 500) but was of significant help - a good alternative would be an extension of PetScan, which as of now only searches 1 Wikimedia category per time, but could potentially search simultaneously all the Wikimedia categories connected to a given Wikidata item (e.g. the 4 ones connected to Q8882780) and this would greatly help in spotting duplicates"

Thanks again for this interesting opportunity of reflection and I hope that this feedback could be at least partially useful! --Epìdosis 11:15, 8 July 2023 (UTC)[reply]

I agree that we should try to keep all discussions in WikiProjects and create new projects if missing. I created a few myself already for areas of my interest, I encourage others to do the same. So9q (talk) 22:19, 23 July 2023 (UTC)[reply]
I like the idea with a gadget helping to highlight relevant WikiProjects :) So9q (talk) 22:30, 23 July 2023 (UTC)[reply]
Thanks a lot, Epìdosis! Trying to reply to a few of your points:
  • Autofix: Good point. I should have added that to the overview.
  • EntitySchemas: I covered this point in my above reply to BrokenSegue.
  • Constraint violations: Point taken about the lists on talk pages. But those are so obscure to most people that they have no idea that they exist and if they know they are not at all easy to understand or filter down to something that is relevant and manageable. These are the parts where I think making them queryable and more integrated in other tools can really make a difference. That might or might not be enough of course to get more people to help with fixing them. I am hopeful though that it would be a lot more possible than with what we have right now.
  • Thanks for sharing the video. I'll have a look at it. Looks interesting!
  • Listeria and automated list generation: You're raising a good point that we maybe should dig into deeper: What is it with Listeria that makes it for example not accepted in articles in many larger projects. Can whatever else we build adress that? My understanding so far was that one of the big issues people raise for use of Listeria in main namespace articles is that it would overwrite edits made to the local content in the next bot run.
  • Property suggester: See my comments on it to BrokenSegue please.
  • Seeing the bigger picture in the ontology: Yeah this is still pretty undefined and I have no clear idea how to achieve it. I'll see if we can have a session at the Data Modeling Days to brainstorm a bit together about what could be done.
  • MW2SPARQL: I'm not sure I understand what you did with it to adress duplication of Items. Can you elaborate in that maybe?
Thanks again for your input :) Lydia Pintscher (WMDE) (talk) 09:30, 16 October 2023 (UTC)[reply]

Thanks everyone for the input and discussion so far. That's really valuable and helpful. I just wanted to say that I'm reading it all but it'll take me a bit to think it all through and get back to all the comments and suggestions because of travel. Please keep it coming :) Lydia Pintscher (WMDE) (talk) 12:22, 9 July 2023 (UTC)[reply]

A symptom fix for progress[edit]

Sorry, no useful comments, it is too hard to understand all the issues, however, from a practical point I would be happy to see an ontology fix that makes a query like https://w.wiki/6xcL (scientists born in July) - not to list theologians - which presumably happens because of some entanglement of "science" (or "natural science") with the German idea of "Wissenschaft". Shyamal (talk) 13:51, 4 July 2023 (UTC)[reply]

Re: Make it easier to see the bigger picture of the ontology[edit]

.. and also related to best practices/documentation:

I'd like to see us encourage/advertise use of the tools in the "item documentation" template. (Perhaps a video or tutorial on what the various options are and how they can be used?)

And while I'm at it, it drives me NUTS that the template is often hidden on a red-linked talk page. Putting useful information behind a red link completely breaks one of the basics of our UX, that red links go to empty pages. (I suspect this was an unintended consequence of automating the Items Documentation template for all items, which was a terrific idea.) I'd like to have the automatic item documentation template on its own subpage - maybe an "About this item" or "Item stats" link to the right of "Discussion", or in the Tools menu - I'm sure the UX team could come up with a good solution. (Is a Phabricator ticket to improve the visibility of the item documentation template appropriate?) PKM (talk) 21:31, 5 July 2023 (UTC)[reply]

I think it is totally appropriate for a ticket :)
I'm wondering if this should be thought of as part of the whole discussion of making it easier to see the big picture when making edits. Lydia Pintscher (WMDE) (talk) 09:31, 16 October 2023 (UTC)[reply]

concerns about the questionnaire[edit]

I took another look at the questionnaire and I don't think that it supports the conclusions that are being made. The reason is that the examples in the questionnaire have much more complexity than the questions therein. This leads me to conclude that the answers in the questionnaire may have more to do about the examples than about the questions.

I'm going to illustrate my concerns with the inconsistent modeling question, which uses color (Q1075) in its example.

Information about colors can be set up in many ways. Some of these are regimented. For example, there are sets of named colors, like the Pantone or HTML colors. There are also spaces of colors, like the RGB colors. Wikidata instead appropriately uses a non-regimented set of colors, mostly descriptive colors like mauve (Q604079), violet (Q428124), purple (Q3257809), and white (Q23444). There are then relationships between descriptive colors, for example mauve, which has "Pale purple colour" as its English description, is a specialization of purple, which has "range of colors with the hues between blue and red' as its English description, and also a specialization of violet. It is then reasonable to make descriptive colors be classes and the specialization relationship the subclass relationship. (What the actual instances are here is somewhat irrelevant. They could be specific colors or even colored objects.)

Under this setup there is nothing wrong with mauve being an instance of color and a subclass of one of its instances. So answers to this question should be viewed as not necessarily being about inconsistent modeling, which is not present in the example.

Of course, it would be possible for there to be inconsistent modeling related to colors. It is possible to model colors using a specific specialization relationship between colors. If both this modeling and the modeling above were to be present in Wikidata then mauve subclass of violet and mauve specialization of purple would be evidence of inconsistent modeling.

But there is indeed a problem with color. There is no information that says how colors are to be modeled in Wikidata. There should be information on how to model (and how not to model) colors, for example that using subclass between descriptive colors is fine, or that descriptive colors should not given RGB values as values for "also known as".

Peter F. Patel-Schneider (talk) 13:30, 20 July 2023 (UTC)[reply]

instance of (P31) / subclass (P279) confusion[edit]

Frequently I saw cases where items which are defined as instance of (P31) something are itself subclassed (i.e. other items have a a P279 relation to the first item) which does not make much sense in an ontology. Even more frequently you see instance of to an item which itself is instance of something. So I propose to well redefine instence of and subclass and clean the data. I tried once to create a huge graph using only the subclass and superclass relations and ended habe which many little graphs BTW: the label of P31 says "that class of which this subject is a particular example and member; different from P279 (subclass of); for example: K2 is an instance of mountain; volcano is a subclass of mountain (and an instance of volcanic landform)". I do not agree with the last bit of this definition (and an instance of volcanic landform). I think an instance cannot be a subclass and vice versa. Ioan (talk) 15:08, 20 July 2023 (UTC)[reply]

instances can definitely be subclasses. there are lots of examples of this. see for example minesweeper (Q202527). BrokenSegue (talk) 16:27, 20 July 2023 (UTC)[reply]
that is the problem I was talking of: minesweeper should not be an instance, but just a class (i.e. subclass of a more general class of ships). Since minesweeper has itself instances (i.e. boats which are minesweepers, like USS Serenee Q168200). In short something which is an instance, cannot have instances.
The additional problem is that minesweeper has subclasses, but instances cannot have subclasses.
So strictly speaking a wikidata itme should have not have instance of (P31) and subclass (P279) at the same time, and an item which has instance of must not have subclasses (i.e. no other item should link with subclass of to this item).
I agree it is difficult to change that, but otherwise it is difficult to exploit these relations Ioan (talk) 13:12, 22 August 2023 (UTC)[reply]
This is a common organization of individuals and groupings of individuals. There are individuals, like K2 (Q43512); classes of individuals, like mountain (Q8502) and volcano (Q8072); and classes of classes of individuals, like volcanic landform (Q29025902). Any of the classes can participate in subclass of (P279) relationships. Watercraft are arranged similarly, with Titanic (Q25173) instance of (P31) four funnel liner (Q3362987) instance of (P31) ship type (Q2235308).
There should be better support in Wikidata for this kind of organization, though. Peter F. Patel-Schneider (talk) 22:29, 10 August 2023 (UTC)[reply]

messy top-level ontology[edit]

I was independently looking at the top level of the Wikidata ontology (class, metaclass, etc.) and decided to try to improve it. I put some details at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Ontology#plurality_and_description_of_metaclasses - discussion is welcome. I also noticed that many of the instances of second-order class (Q24017414), third-order class (Q24017465), and fourth-order class (Q24027474), are incorrectly classified. I have done some fixups and plan to look at most of the instances. This doesn't address the rest of the upper-level ontology but might be a good place to start. Peter F. Patel-Schneider (talk) 22:33, 9 August 2023 (UTC)[reply]

I created a table with information about the top level of the Wikidata ontology and some of my comments and suggestions at User:Peter F. Patel-Schneider/metaclass table. This shows in one table what the top-level classes are, how they interact, and how many instances and subclasses they have. Peter F. Patel-Schneider (talk) 22:18, 10 August 2023 (UTC)[reply]

Mundane ontology problems[edit]

I was looking around Wikidata and ran a query to find the instances of instances of profession (Q28640). Based on my understanding of how people are linked to their profession (via occupation (P106)) so I was expecting to find very few. But as of 26 August 2023 I found 3652 results. The actual number is a problem in and of itself but there are also some surprising results.

For example, the first query result (in the query results that I saw) was meat scientist (Q6804279), which is an instance of (P31) of scientist (Q901) and meat (Q10990). The former is a misunderstanding about the description of scientist (Q901) - person who use scientific methods to study in an area of interest, but the second is just flat out wrong.

What can be done to detect this sort of incorrect information? One thing that could be done is to add disjointness statements in Wikidata and mandatory constraints that prevent edits that violate them. This would be a lot of work but would have a large payoff if these violations could be caught early.

The second query result was hippologist (Q10524313), which is an instance of scientist (Q901) and a subclass of zoologist (Q350979). This appears to be a simple misunderstanding of the status of scientist (Q901), which in Wikidata is a first-order class and not a second-order class. scientist (Q901) is also a subclass descendant of zoologist (Q350979) so it is obvious that something is wrong.

But the issue is how to fix this sort of problem. Creating reports is only useful if there is a (large) group of people who are willing to go through the reports and correct errors, and this does not currently appear to be the case. So instead something should be done when the problem is created.

In general, my suggestion is to detect and block ontology violations at edit time so that there is not so much that needs to be cleaned up later (never). This probably requires significant extensions to the constraint mechanisms in Wikidata. Peter F. Patel-Schneider (talk) 14:57, 26 August 2023 (UTC)[reply]

Classes that do not fit into the correct order (e.g., first-order classes that have classes as instances)[edit]

I wrote a Python program to find classes that don't correspond to the fixed-order metaclass they are an instance or subclass of or that might have problems related to differing orders of their instances. For example, the program finds instances of first-order class (Q21522908) that have classes as instances and are thus not first-order or subclasses of WUA (Q4017414) that have instances that have instances that have instances that have classes as instances and are thus not third-order. The program considers an item to be a class when it has superclass, a subclass, an instance, or is an instance of class (Q16889133). The program found a lot of problems. In the cases that I have checked the output is correct but I make no claims about the overall correctness of the program.

In many cases a count of the probematic items and an example are given along with the class and how it is related to the class from the top-level ontology. A low count for a class likely indicates that there is some data error related to the class. A high count for a class may indictate that the class should be linked to a different top-level class (but may indicate that there are a lot of data errors related to the class).

For example in the section 'Non-first-order-class direct instances of Q21522908 "Wikidata instance class"@en out of 6'

1	Q11229       Q27084       parts-per notation                Q21522908    Wikidata instance class

reports that percent (Q11229) is the sole item that makes parts-per notation (Q27084) not be a first-order class. In the section 'Non-second-order-class indirect instances of Q24017414 "second-order class"@en out of 52'

19	Q54505       Q4162444     branch of physics                 Q109542218   class of parts

reports that quantum field theory (Q54505) is one of 19 items that make branch of physics (Q4162444) not be a second order class and that branch of physics (Q4162444) is an instance of class of parts (Q109542218), which is a subclass of second-order class (Q24017414). In the section 'Potentially first-order class direct instances of Q24017414 "second-order class"@en out of 414'

Q21937945    Silver Badge of Janek Krasicki

reports that Silver Badge of Janek Krasicki (Q21937945) might be a first-order class (because the program didn't find any instances of it that themselves were classes).

The program doesn't just do single queries to find bad instances or subclasses of a top-level ontology class but in many cases iterates over the instances or subclasses. It uses this complex method because the single query times out. Even then some of the queries time out. Communications issues are retried, but only once. The program runs a lot of queries and puts quite a load on the Wikidata query server, so please don't run it often.

The program is at User:Peter F. Patel-Schneider/fixed-level-program. A recent set of results is at User:Peter F. Patel-Schneider/order problems.

I have fixed a number of problems identified by the program during testing and that thus do not show up in the above results and a few that do. Quite a few of the problems are caused by an item being both an instance and a subclass of the same other item.

The question is what to do with all the problems my program found. Peter F. Patel-Schneider (talk) 11:29, 28 August 2023 (UTC)[reply]

more information on EntitySchemas[edit]

There is considerable talk of EntitySchemas on this page, but no links to a place to find out more about them. What are EntitySchemas and how are they used in Wikidata? Peter F. Patel-Schneider (talk) 19:20, 18 September 2023 (UTC)[reply]

We do have Wikidata:Schemas and we will also be talking more about it as part of the upcoming Data Modelling Days. Lydia Pintscher (WMDE) (talk) 09:34, 16 October 2023 (UTC)[reply]
Even after attending some of the Data Modelling Days I am confused as to just what schemas are supposed to do for Wikidata in general. Or is it just that schemas are to be used by groups that are working on part of Wikidata to check what they are doing and have no use outside of the group? Peter F. Patel-Schneider (talk) 14:15, 2 December 2023 (UTC)[reply]
@Peter F. Patel-Schneider @Lydia Pintscher (WMDE) @AWesterinen The reasons of building and promoting EntitySchemas by Wikimedia Deustchland as I understand are:
  • To make the documentation of current data structures for types of items more accessible, consistent, multilingual, and machine-readable. Currently, most data structures are documented on only on WikiProject pages as hard-to-read tables in typically only English.
  • To provide a way to validate types of items against their data schema. For example, making sure that it has all of the typically-expected properties.
IMO EntitySchemas aren't that useful - as you might suspect.
Basically the only current uses of them right now are:
  • To get a list of items that don't have all their "typical" properties by using https://shex-simple.toolforge.org. That might seem helpful, however you can do this easily with just writing a SPARQL query to find all items of a class that don't have a particular property. People who do cleanup like @Moebeus still just do this and don't utilize EntitySchemas (educated guess).
  • To check individual items to make sure they have all their "typical properties" using User:Teester/CheckShex.js . This userscript provides a way to see what properties are missing on an item given an entity schema. This isn't really useful as you have to plug in the EntitySchema ID every time you open a new entity. Constraints and the property recommendation system when adding new statements already kind of fulfill this role and they instantly provide recommendations as to how to make an entity better.
I believe EntitySchemas also have also failed to provide a good documentation medium for data schemas in various domains. I don't see many WikiProjects linking to them or updating them. They also require knowldge of how to write ShEx and an editor or simple-displayer of ShEx is not currently easily accessible (or what I know exists).
Because the uses of EntitySchemas are already covered by the constraint system and querying, I don't see they have much use. If WMDE went forward with developing a SHACL data constraint and validation system for Wikidata which is much more flexible and enforceable than ShEx and that replaced the current constraint system, the data quality of Wikidata would be miles better than what it is today. However, I think they didn't go with that since developing a SHACL system is much more challenging than ShEx.
I know WMDE is still "working on" EntitySchemas to improve them, but because EntitySchemas don't act as an enforceable tool to improve data quality and the other reasons I mentioned above, I think they will always fall short of being useful and widely used.
These are just my biased thoughts based on my personal lack of use for EntitySchemas and observance of others in the community lack of use of them. Let me know if I got anything wrong. Lectrician1 (talk) 21:33, 8 December 2023 (UTC)[reply]

Data Modelling Days, online gathering, November 30 - December 2, 2023[edit]

Hello all,

Following the past events dedicated to data quality and data reuse, the Wikidata team wanted to host a new gathering dedicated to data modelling.

The Data Modelling Days will take place online over three days and will host a variety of discussions, workshops and practical sessions on the topics of Wikidata ontologies, EntitySchemas, modelling issues and various other challenges.

The event is open to everyone, regardless of your experience with modelling data on Wikidata. We particularly encourage people who are working on specific topics to join the event and present their modelling challenges.

If you know people or groups who are already discussing modelling issues on Wikidata, or would have something interesting to contribute, please share this message with them!

You can find more information on the dedicated page, sign up and let us know what you are interested in, you can already propose discussions and workshops on the talk page until November 19th.

If you cannot attend, don’t worry, most sessions will be recorded, notes will be taken and slides will be shared.

We are looking forward to seeing you and learning more about your modelling challenges during the Data Modelling Days! If you have any questions, feel free to reach out to me. Best, Lea Lacroix (WMDE) (talk) 14:23, 9 October 2023 (UTC)[reply]