Wikidata:Lexical Masks

From Wikidata
Jump to navigation Jump to search

Lexical Masks: Introduction[edit]

General Principle[edit]

Lexicon masks aim at representing, in a consistent way, the expected internal structure of lexical entries. Masks are defined for each language and each part-of-speech in that language.

Lexical masks are specifications of the requirements a lexical entry should fulfill. In particular, a mask defines:

  • how many forms the entry should have to be complete;
  • what features are expected for each form.

Masks are specific to part-of-speech and language. One particular part-of-speech of one particular language can have more than one mask. For example, the table below shows the specification for Italian adjective entries. It specifies that four forms are expected, and each form should have one unique combination of gender and number features (i.e. there is one form for each feature bundles: MascGender / SingNumber, MascGender / PlurNumber, FemGender / SingNumber, and FemGender / PlurNumber).

Masculine Gender Feminine Gender
Singular Number form1 form2
Plural Number form3 form4

Of course, lexical entries can be (and often are) much more complex, both in terms of numbers of forms, but also in how the forms are being combined from the different available dimensions, in terms of the features used to describe these forms, and the entry in general.

Distinguishing entry-level and form-level features[edit]

Lexical entries are not only characterized by their forms and the features associated with the forms, but also by the feature assigned at the entry-level inherent to the entire entry. For example, the mask for Russian nouns below shows an entry-level specification that requires the combination of animacy and gender features at the entry-level, and a set of form-level features, specifying that each form must have a case and a number feature.

Entry-level MascGender+Inanimate OR FemGender+Animate OR NeutGenderAnimate OR ...
Form-Level Number=Sing Number=Pau Number=Plur
Case=Nom form1 form10 form19
Case=Gen form2 form11 form20
Case=Dat form3 form12 form21
Case=Acc form4 form13 form22
Case=Inst form5 form14 form23
Case=Prep form6 form15 form24
Case=Part form7 form16 form25
Case=Loc form8 form17 form26
Case=Voc form9 form18 form27

Examples for entry-level features include gender and animacy for nouns, or aspect, transitivity for verbs and degree for adjectives.

Accounting for more granularity: multiple masks[edit]

The configuration of lexical entries must also provide a certain level of flexibility to account for different structures of different entries. For example, there are two masks for German nouns: the first mask, shown in the first table below concerns nouns that have an intrinsic gender (i.e. at the entry level) and all the case and number declensions of that noun. The second mask, given in the second table, describes the nouns that don't have an inherent gender at the entry-level but have specific inflections per gender (e.g. think of nouns for professions).

Mask 1[edit]
Entry-level MascGender OR FemGender OR NeutGender
Form-level Number=Sing Number=Plur
Case=Nom form1 form5
Case=Gen form2 form6
Case=Dat form3 form7
Case=Acc form4 form8
Mask 2[edit]
MascGender+SingNumber MascGender + PlurNumber FemGender+SingNumber FemGender+PlurNumber
Case=Nom form1 form5 form9 form13
Case=Gen form2 form6 form10 form14
Case=Dat form3 form7 form11 form15
Case=Acc form4 form8 form12 form16

Using masks for lexicon validation[edit]

The mask model presented here is used to perform a semi-automatic evaluation of the lexicon. Each lexicon entry of a particular language (in the example an Italian adjectival entry) is ingested through the mask. During this process, we are checking that (1) this adjectival entry has indeed four forms, and (2) that each form has one of the required unique combinations of gender and number features (e.g. we cannot have two forms that are plural and feminine).

This evaluation process will mark all the entries that are passing the masks as ``structurally valid. The other entries that are not passing the masks will have to be looked at more carefully.

How does it work in practice?[edit]

The masks are formalized as JSON files (published on Github). From these JSON files, other artefacts can be created to be used in practise, such as ShEx files (published in Wikidata). A script available in that GitHub repository translates the JSON files to EntitySchemas.

In that example, on line 12, the SPARQL query is given to find all lexicographic entries the ShEx file applies to (all possible focus nodes for the shape described by the ShEX file). Below then, we see the description of the lexical entry: in line 22, we require the grammatical gender to be given at the entry level, and in lines 23ff. we see the definition of the eight individual forms that constitute a German noun as per the table above.

The validation will ensure that all required forms are present, that the right combination of grammatical features are given throughout the forms, and that all entry-level values are set as required. Furthermore, as usual with RDF, the validation will not prevent the data from having additional annotations and markers, e.g. it will not interfere with semantic annotations on the lexical entries, or linkages between entries from different languages. The ShEx files exclusively check for the completeness of the morpho-syntactic forms of the lexical entry.

To run the validation, go to a specific ShEx file (see list below), and click on the link on the right of the ShEx ("check entities against this Schema"). This will open a new window, with the ShEx validator script, and a field to run the command. In this field, used the command shown on line 22 of the script with some limits (to avoid the script to run for a too long time).

For example, for the German noun ShEx explained above, you can use the following command:

SELECT ?focus {?focus dct:language wd:Q188;wikibase:lexicalCategory wd:Q1084} limit 10

Then click on the button on the right ("validate") and you should see a list of Lexeme entry, with information about whether they pass the ShEx or not.

List of existing Masks[edit]

  • German
  • English
  • French
  • Italian
  • Russian

See on GitHub.

List of existing ShEx files[edit]

There are more ShEx files in the GitHub repo. Feel free to copy them here.

Using Mask for Lexical Editing Form[edit]

Wikidata is developing its platform and infrastructure to support ShEx files in a wide range of use cases across Wikidata. Most importantly for us, we can use the files we publish to validate existing lexicographic entries. This allows for the large semi-automatic validation of the crowdsourced entries in Wikidata, and thus provides a feedback loop for the community to see the quality of their entries. They can get a generated list of all entries that do not fulfill the constraints described in the ShEx files, and then decide case by case how to handle the data (i.e. whether it is a valid exception, whether it requires an alternative or silver mask, or whether the entry needs to be improved).

We expect that the JSON files can be translated to form definitions easier.

Languages available so far:

  • Basque
  • English
  • French
  • German
  • Hebrew
  • Italian
  • Russian


  • The paper about lexical masks was published at LREC 2020: "Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons.", by Bruno Cartoni, Daniel Calvelo Aros, Denny Vrandečić, and Saran Lertpradit.

Frequently Asked Questions[edit]

What if I don’t agree with how a mask is set up for a particular POS/language?[edit]

It is possible that available masks are not accurate. The most important question is to know whether the proposed structure is not suitable for a particular set of words, or if it’s not suitable at all. If it’s the former (masks not suitable for a specific set of words), it is better to create another mask (a “silver” mask) that will probably be a subset of the specification designed in the existing masks. If it’s the latter, you can change the existing mask. But be careful, this will be applied to all the entries.

There is no mask for my language, how can I set one up?[edit]

We are happy to help! Please contact us on the talk page. What we often lack is the knowledge about a given language - but we have the expertise on how to create the specifications. We would love to cooperate, and specify the masks based on your knowledge.

Our suggestion is that you create two or three full entries for a given language and part-of-speech, exemplifying how a good entry would look like - and we will then try to capture that. Any additional explanations are welcome, and it would be great if you could be available for clarifying questions. If you have another idea on how to cooperate, please let us know.