Wikidata:WikiProject Schemas/Tutorial

From Wikidata
Jump to navigation Jump to search

We here will create a simple ShEx schema and apply it to a set of Wikidata items.

Starting point is that we can derive the expected type of a Wikidata item from certain external identifiers which are attached to it. Here we take PM20 folder ID (P4293) as an example, which links to en:20th_Century_Press_Archives. The first two characters of the external id value identify persons (pe), companies and other organizations (co) or other types of folders.

We now want to check that all Wikidata items linking to "co/" ids have some subtype of organization (Q43229) as class.

Identifying the items we want to check with a query

[edit]
select ?item where {
  ?item wdt:P4293 ?pm20Id .
  filter(regex(?pm20Id, '^co/'))
}
Try it!

Creating the schema

[edit]

We start with a very simple schema, which just requires that at least one instance of (P31) property with an IRI value (like a URI, but in Unicode, used to identify resources) is defined for each item:

 PREFIX wdt: <http://www.wikidata.org/prop/direct/>
 
 start = @<wikidata-instanceof>
 
 <wikidata-instanceof> {
   wdt:P31 IRI+;
 }

Since May 2019, Shape Expressions - like items or properties - have their own namespace (https://www.wikidata.org/wiki/EntitySchema:), and, like Q an P, their own prefix E for a numerical identifier. They can be created by everybody via the Create a new EntitySchema link on Special Pages. Label, description and aliases can be provided, and the text of the schema can be added in (or edited in) a text area.

Having done that for the above schema, we can inspect the result at EntitySchema:E97:

(For additional labels in other languages you can use Set label, description and aliases for EntitySchema on Special pages.)

N.B. E97 only makes sure that an item has at least one class. The constraint that at least one class is an organization (Q43229) or a subclass of (P279) organization (Q43229) requires a more complex recursive definition and is found here and in E98.

Syntax

[edit]

Schema are written in ShEx (see specifications) and more precisely the ShEx syntax. A relatively beginner friendly intro to the syntax can be found in this primer

At the beginning, you need to indicate the prefix used later in the schema.

A good practice is to then indicate in comment (starting with a #) the SPARQL query for the items we want to check.

Cardinalities (?, +, * often at the end of a line) follows the notation in the XML specification.

Useful scripts

[edit]

The user script User:Zvpunry/EntitySchemaHighlighter.js highlights and links property and item identifiers in the displayed schema, which makes inspection much more easy. Install it by adding importScript(' User:Zvpunry/EntitySchemaHighlighter.js' ); // Backlink: [[User:Zvpunry/EntitySchemaHighlighter.js]] to common.js.

Screenshot of EntityShape.js on item Q98642744.

The user script User:Teester/EntityShape.js allows to check whether an item conforms to a schema and displays whether each property and statement in an entity conforms to that entityschema. Only works for relatively straightforward schemas at the moment. Install it by adding importScript( 'User:Teester/EntityShape.js'); to your common.js.

Checking entities against the schema

[edit]

By clicking the link "check entities against this schema" on the schema page, the ShEx2 Online Validator (Q65921524) is invoked, loading the schema automatically.

For fetching the items to be validated, the user has to submit a query in the small query box in the lower left corner. (Having the query included as a comment in the schema, as in the example above, makes that part a simple copy&paste.) Hitting Ctrl-Enter or the "run query to fetch entities" button load the items for validation. Validation starts with another "Ctrl-Enter".

A few items create warnings for missing P31 properites.

Most simple: Check button via user script

[edit]

The most comfortable way to validate an item is the User:Teester/CheckShex.js user script (help on activating). It adds an input box for a schema id and a "Check" button to each item page. Validation for a particular schema can result in "Pass" or "Fail", with a short error message displayed and a more verbose one accessible by mouse-over.

The user script creates a similar input box on the schema page. There, an item to check can be selected with autosuggest support.

The tool works on an API based PyShEx (Q51672520).

Full control: Manually stitching schema and queries together in a manifest file

[edit]

The ShEx2 tool linked from the schema pages can also be used standalone. With a manifest file, it can load a schema, load and execute a query against a particular SPARQL endpoint, and validate the nodes selected by the query. The manifest has to be in JSON, and follow this template:

[
    {
        "schemaLabel": "<label for the schema>",
        "schemaURL": "<url of the shex file>",
        "dataLabel": "<label for the data set>" ,
        "data": "Endpoint: <endpoint url>",
        "queryMap": "SPARQL '''<query>'''@START",
        "status": "conformant"
    }
  ]

For the manifest file, the <query> has to be json-encoded. One way to do that is to copy the query (as above) into https://www.freeformatter.com/json-escape.html and hit "Escape". The resulting value can be inserted into the manifest file. Alternatively, one can use Template:ShEx2, actually as was used to generate the code above.

One manifest file can include references to multiple different schemas and datasets/queries.

[
    {
        "schemaLabel": "Wikidata class exists",
        "schemaURL": "https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-class.shex",
        "dataLabel": "Get 20 items with PM20 ids starting with 'co/'" ,
        "data": "Endpoint: https://query.wikidata.org/sparql",
        "queryMap": "SPARQL '''select ?item where {\r\n  ?item wdt:P4293 ?pm20Id .\r\n  filter(regex(?pm20Id, '^co\/'))\r\n}\r\nlimit 20'''@START",
        "status": "conformant"
    },
    {
        "schemaLabel": "Wikidata organization class derived from PM20 ID",
        "schemaURL": "https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-organization.shex",
        "dataLabel": "Get 20 items with PM20 ids starting with 'co/'" ,
        "data": "Endpoint: https://query.wikidata.org/sparql",
        "queryMap": "SPARQL '''select ?item where {\r\n  ?item wdt:P4293 ?pm20Id .\r\n  filter(regex(?pm20Id, '^co\/'))\r\n}\r\nlimit 20'''@START",
        "status": "conformant"
    },
    {
        "schemaLabel": "Wikidata organization class derived from PM20 ID",
        "schemaURL": "https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-organization.shex",
        "dataLabel": "Get all items with PM20 ids starting with 'co/'" ,
        "data": "Endpoint: https://query.wikidata.org/sparql",
        "queryMap": "SPARQL '''select ?item where {\r\n  ?item wdt:P4293 ?pm20Id .\r\n  filter(regex(?pm20Id, '^co\/'))\r\n}\r\nlimit 20'''@START",
        "status": "conformant"
    }
]

The manifest url and the <url of the shex file> should be accessible via https, in order to avoid browser blocking for security reasons (the file:// schema does not work). One easy way to achieve this is put the files into a GitHub repository, and to link to the raw content url.

The "manifestURL" argument can be added to the request URL:

 https://rawgit.com/shexSpec/shex.js/wikidata/packages/shex-webapp/doc/shex-simple.html?manifestURL=https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-pm20-organization.manifest.json

That lets you select a manifest by a button from the list at the left side and a query by a button from the list at the right side:

Example Shape Expressions validation in ShEx2

From here, you can edit the schema pane as well as the query map pane, and - after page-reload - re-execute the query and the validation.

More complex shapes

[edit]

Some shapes are linked from Wikidata:WikiProject_ShEx.