Wikidata:WikiProject ShEx/How to get started?

From Wikidata
Jump to navigation Jump to search

We here will create a simple ShEx schema and apply it to a set of Wikidata items.

Starting point is that we can derive the expected type of a Wikidata item from certain external identifiers which are attached to it. Here we take PM20 folder ID (P4293) as an example, which links to en:20th_Century_Press_Archives. The first two characters of the external id value identify persons (pe), companies and other organizations (co) or other types of folders.

We now want to check that all Wikidata items linking to "co/" ids have some subtype of organization (Q43229) as class.

Identifying the items we want to check with a query[edit]

select ?item where {
  ?item wdt:P4293 ?pm20Id .
  filter(regex(?pm20Id, '^co/'))
}

Try it!

Creating the schema[edit]

We start with a very simple schema, which just requires that at least one instance of (P31) property with an IRI value is defined for each item:

 PREFIX wdt: <http://www.wikidata.org/prop/direct/>
 
 start = @<wikidata-class>
 
 <wikidata-class> {
   wdt:P31 IRI+; 
 }

The expression only makes sure that an item has at least one class. The constraint that at least one class is an organization (Q43229) or a subclass of (P279) organization (Q43229) requires a more complex recursive definition and is found here.

Stitching it together in a manifest file for ShEx2[edit]

ShEx2 is a simple online validator. With a manifest file, it can load a schema, execute a query against a particular SPARQL endpoint, and validate the nodes selected by the query. The manifest has to be in JSON, and follow this template:

  [
    {
        "schemaLabel": "<label for the schema>",
        "schemaURL": "<url of the shex file>",
        "dataLabel": "<label for the data set>" ,
        "data": "Endpoint: <endpoint url>",
        "queryMap": "SPARQL '''<query>'''@START",
        "status": "conformant"
    }
  ]

For the manifest file, the <query> has to be json-encoded. One way to do that is to copy the query (as above) into https://www.freeformatter.com/json-escape.html and hit "Escape". The resulting value can be inserted into the manifest file. One manifest file can include references to multiple different schemas and datasets/queries.

[
    {
        "schemaLabel": "Wikidata class exists",
        "schemaURL": "https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-class.shex",
        "dataLabel": "Get 20 items with PM20 ids starting with 'co/'" ,
        "data": "Endpoint: https://query.wikidata.org/sparql",
        "queryMap": "SPARQL '''select ?item where {\r\n  ?item wdt:P4293 ?pm20Id .\r\n  filter(regex(?pm20Id, '^co\/'))\r\n}\r\nlimit 20'''@START",
        "status": "conformant"
    },
    {
        "schemaLabel": "Wikidata organization class derived from PM20 ID",
        "schemaURL": "https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-organization.shex",
        "dataLabel": "Get 20 items with PM20 ids starting with 'co/'" ,
        "data": "Endpoint: https://query.wikidata.org/sparql",
        "queryMap": "SPARQL '''select ?item where {\r\n  ?item wdt:P4293 ?pm20Id .\r\n  filter(regex(?pm20Id, '^co\/'))\r\n}\r\nlimit 20'''@START",
        "status": "conformant"
    }
]

The manifest url and the <url of the shex file> should be accessible via https, in order to avoid browser blocking for security reasons (the file:// schema does not work). One easy way to achieve this is put the files into a GitHub repository, and to link to the raw content url.

The "manifestURL" argument can be added to the request URL:

 https://rawgit.com/shexSpec/shex.js/wikidata/doc/shex-simple.html?manifestURL=https://raw.githubusercontent.com/jneubert/wd-shex-test/master/wikidata-pm20-organization.manifest.json

That lets you select a manifest by a button from the list at the left side and a query by a button from the list at the right side:

Example Shape Expressions validation in ShEx2

From here, you can edit the schema pane as well as the query map pane, and - after page-reload - re-execute the query and the validation.

More complex shapes[edit]

A more comprensive example, creating a schema for file formats, is available from here.