Wikidata:Events/Data Quality Days 2022/EntitySchemas

From Wikidata
Jump to navigation Jump to search

Introduction to Entity Schemas and Shape Expressions

[edit]

Kat Thornton, Andra Waagmeester, Eric Prud'hommeaux
👥 Number of participants (including speakers): 23

🖊️ Notes & links

  • Slides: TBD
  • Kat starts with how they came up with an idea for a schema: they were talking about IEEE standards, and decided to make a schema out of it. There are 40 IEEE standards subclasses, 9 instances of IEEE standard, and 134 items that have P5638 with a value - "A schema would help here indeed!"
  • What is a Schema? Schema is a data model basically, up until now we have 300+ schemas available since May 2019
  • It is possible to have multiple data models per domain - in some cases this is necessary, in others... well, not so much, maybe differences can be composed through consensus into one shared model
  • Schemas are expressed in ShEx language -- http://shex.io
  • ShEx is good because it allows humans and machines to work together - much as Wikidata does!
  • Back to the example: how should the IEEE schema be formatted? Starting by parsing the existing involved items, then proposing a schema for them, and discussing the schema on its talk page
  • Having a schema module is desirable, because people who want to contribute data may want a model to follow while contributing, and may not have the time to actually see on their own how to do it (or may be frustrated by asbence of instructions/having to find stuff on their own)
  • Someone querying for IEEE standards might find the whole lot of stuff we found at the beginning of the example - very frustrating to find which one to follow
  • Internal v external model:
    • National Archives of UK keeps PRONOM registry, so who works with this registry imported a schema (E79) for formatting it with it
    • but sometimes you don't have an external modelling, and you need to come up with one on your own - in this case there's several tool, such as sheXer or Shape Designer or Yet Another ShEx Editor (YASHE) and many more that can help you with that
    • Find the links here: https://www.wikidata.org/wiki/Wikidata:WikiProject_Schemas#Tools
  • Implementations: there are currently 5 outside of Wikidata, and you can use them to build your own ShEx experience
  • Also there are online validators to check if your schema works on Wikidata before committing it to the project
  • There is an interlinked environment of schemas on Wikidata, just check the schema gallery on the WP Schema main page to see how many do we have
    • You might not want to restart from zero when you have to describe something - you can import an existing schema for the basic properties, and then add/modify what you need to further model your particular domain
    • An example is the schema for Danish verb (E56)
  • Schemas can be multilingual: not just the label can be expressed in several languages, but the schema itself is language-agnostic
  • What's next
    • More and more editors and schemas will be out there hopefully
    • this ecosystem will support the work of many more editors
    • but we need to encourage awareness about schemas
  • Scholia has several papers about ShEx, check them out! + more stuff to read/watch in the slides


[OVER TO SAYED]


  • Slides: TBD
  • How to use Entity Schemas in evaluating Wikidata references? Quality is a multi-dimensional issue, so it's complex
  • Metric 1: syntactic validity of reference triples - meaning check if they fit in the Referencing ShEx schema, a schema made to check on the formal validity of a reference
    • Experiments on various corpus (?, gene, ship, random set of item), result show that 99.95-99.98% of references are valid, but this means still that THOUSANDS of references fail the check, mostly for blank nodes
  • Metrics 2+3: ShEx in completeness - how many classes and fact types have a schema for references, and how many instances of X have a reference with the property mentioned in the schema?
    • How many schemas are out there and which classes do they cover?
    • For the deeper question, there are challenges: some schemas are empty or invalid, parsing is required, references shapes might be in different schemas
    • Results are far lower in this case: on 319 schemas, only 13 have referenced facts (+ more info in the slide)
    • There is a lot more of work to do with schemas, to review all instances and that they do not fail checks
  • Metric 4: if a fact of type X has a reference using a property Y, how many other type X facts have a reference using property Y? -- In other words: how many similar statements have the same kind of references?
    • Answer: still computing


❓ Questions and discussions

  • What's the exact meaning of the class/property column?
    • The meaning of the class/property column is the class of the entity described by the schema and some of the properties used in the schema. I'm not sure of the details of how that table in the schema gallery is populated.
  • Is importing a schema an extension of the full schema? Or can you use only parts of a schema once imported?
    •  Import allows you to import the entire schema 
  • Is there any plan to have a GUI to support creating schemas, instead of writing directly the ShEx language? 
    • I'm not yet aware of plans for a GUI to support creating schemas. 
  • Could/should we use recoin suggestions as a base for what an entity schema have as properties? 
    • Could be a great idea
  • [Ariel] AFAIK, enforcing a schema is still not possible. So what is the plan to make it mandatory eventually? If I'm creating a new item, can I use an existing schema to get guidance?
    • There are a couple of user scripts that have been contributed so far. One of them, allows you to type in the ID of the Schema and check if it's compliant. But no work so far about what you're suggesting. It's probably still too early for schemas to have that, but definitely you have a point there
  • Could we use Cradle forms to generate schemas (since we are using schemas to generate cradle forms)?
    • [Kat] Again, great idea. Still haven't seen any work in that, but I'd support it.
    • [Andra] Also the cradle forms could be nice first sketches that can be drafted into a ShEx using e.g. Yashe
  • [Lydia] If people are intimidated by Shex, we can set up a page much like "request a query".
  • [Vigneron] is there plan to highlight syntax error in ShEx in Schema ? (as a newbie, I'm always confused which part has gone wrong) 
    • [Lydia] So far, not. But it's a good idea! We're still in the early stages of implementing schemas on WD, but this might be included in the development
    • [Kat] For the time, you might use the external validators to check on mistakes about what you're writing.
  • [Ariel] One thing that we're missing is to link the labels 

Schema editing session

[edit]

Kat Thornton, Andra Waagmeester, Eric Prud'hommeaux
👥 Number of participants (including speakers): 19

🖊️ Notes & links