User:Lucas Werkmeister/Wikidata Shape Expressions Inference

From Wikidata
Jump to navigation Jump to search

Wikidata Shape Expressions Inference is a tool I am developing as part of my master’s thesis which automatically infers a ShEx (Q29377880) schema from a set of user-specified exemplary items. You’re very welcome to try it out, e. g. to see what a data schema for items from your area of expertise might look like, and give me feedback on the talk page or elsewhere! (In fact, it helps me if you try it out even if you don’t give feedback, since I’m also interested in more samples for how long the process takes given different input data sets.)

Toolforge admin notice[edit]

If this tool is causing problems, create any file under /data/scratch/wd-shex-infer/. This will block the submission of any new jobs to the grid engine. See the README file in that directory for more information.

Usage[edit]

To start a new inference process, write a SPARQL query that will select your exemplary items (e. g. some heads of state, or some cities, or some novels by a certain author, or whatever), then visit toolforge:wd-shex-infer/job/new and enter your query there. The query should select an ?entity variable listing the selected items (other result variables are ignored) and not return more than 50 or so items (add a LIMIT if necessary). You should also give your submission a title briefly describing your dataset, and you may also add a longer description and/or a link to a page with more information (e. g. a subpage of your user page).

After you submit the form, you will be redirected to the page for the job you submitted, which will at this point probably say that the job is still running. Refresh the page periodically until the job has finished (this may take over an hour!), at which point you will be able to download the resulting ShEx file.

You can also look at jobs that other users have submitted; a list of them can be found on the index page.

Note that the tool limits the number of jobs that can be running at the same time. If two or more inference processes are already running, you will not be able to submit a new job until one of them has finished. (Running jobs are listed both on the index page and on the “new job” page if there are any.)

Validation[edit]

The tool currently doesn’t include any provisions for validating items against the inferred schema. You can copy the ShExC code and try to validate it somewhere else, but be warned: the automatically inferred schemas are rather large, and have been known to cause problems with validators (e. g. shexSpec/shex.js#31). In particular, I strongly advise against trying this with the online validator in your default browser. (You can try running it in a separate browser session or a different browser, but beware that it might freeze that browser.)

I’ve tried some strategies of reducing the size of the generated schemas (discarding unimportant parts), with mixed success; perhaps I’ll make them available via the tool in the future.

Apart from that, the inferred schemas can also be useful without validating anything against them: you can browse through them to check if there’s anything odd in them, and try to track down the cause in the input data set when you notice something wrong.

Shape Expressions[edit]

Shape Expressions are a language to express schemas for RDF data. In Wikidata, they can be used to describe what statements an item should have and what the values for those statements should look like. For example, consider this excerpt from a schema about RfCs (automatically inferred by the tool in job #20):

wd:Q1328899 { # & wd:Q43229
  wdt:P159 .;
  wdt:P17 .;
  wdt:P355 .+;
  wdt:P373 xsd:string;
  wdt:P488 @wd:Q5;
  wdt:P571 xsd:dateTime;
  wdt:P856 .
}

This describes a shape, titled wd:Q1328899 (after standards organization (Q1328899)). An item that matches this shape must have:

Or, less formally: “a standards organization should have a human chairperson; a country, headquarters location, official website, inception date, and Commons category; and at least one subsidiary.”

Such a schema can then be used to validate whether a certain item matches a shape or not. For example, the item Deutsches Institut für Normung (Q152746) currently doesn’t match the shape wd:Q1328899 from above, because it’s missing subsidiary (P355) and chairperson (P488) statements. This can mean that such statements should be added to the item.

For more information, please see WikiProject ShEx and/or the ShEx primer.

How it works[edit]

Well, presumably my thesis is the exhaustive answer to the question of “how does this work?” :) But to give credit where credit is due, it’s largely based on the RDF2Graph program, as described in RDF2Graph a tool to recover, understand and validate the ontology of an RDF resource (Q27976744). You can find the source code of my fork of RDF2Graph and a utility library it uses on GitHub (RDF2Graph and RDFSimpleCon), and the source code of this tool itself on Phabricator (tool-wd-shex-infer).

Trivia[edit]

I usually abbreviate the tool as wdsi, for wd-shex-infer (the identifier of the tool). Some people also abbreviate it as WSEI, presumably for Wikidata Shape Expressions Inference (the name of the tool). I’m sure this will never lead to confusion and there’s no need to settle on one of these ;)