- 1 Lexical Masks: Introduction
- 2 Using masks for lexicon validation
- 3 List of existing masks
- 4 Using Mask for Lexical Editing Form
- 5 Frequently Asked Questions
Lexical Masks: Introduction
Lexicon masks aim at representing, in a consistent way, the expected internal structure of lexical entries. Masks are defined for each language and each part-of-speech in that language.
Lexical masks are specifications of the requirements a lexical entry should fulfill. In particular, a mask defines:
- how many forms the entry should have to be complete;
- what features are expected for each form.
Masks are specific to part-of-speech and language. One particular part-of-speech of one particular language can have more than one mask. For example, the table below shows the specification for Italian adjective entries. It specifies that four forms are expected, and each form should have one unique combination of gender and number features (i.e. there is one form for each feature bundles: MascGender / SingNumber, MascGender / PlurNumber, FemGender / SingNumber, and FemGender / PlurNumber).
|Masculine Gender||Feminine Gender|
Of course, lexical entries can be (and often are) much more complex, both in terms of numbers of forms, but also in how the forms are being combined from the different available dimensions, in terms of the features used to describe these forms, and the entry in general.
Distinguishing entry-level and form-level features
Lexical entries are not only characterized by their forms and the features associated with the forms, but also by the feature assigned at the entry-level inherent to the entire entry. For example, the mask for Russian nouns below shows an entry-level specification that requires the combination of animacy and gender features at the entry-level, and a set of form-level features, specifying that each form must have a case and a number feature.
|Entry-level||MascGender+Inanimate OR FemGender+Animate OR NeutGenderAnimate OR ...|
Examples for entry-level features include gender and animacy for nouns, or aspect, transitivity for verbs and degree for adjectives.
Accounting for more granularity: multiple masks
The configuration of lexical entries must also provide a certain level of flexibility to account for different structures of different entries. For example, there are two masks for German nouns: the first mask, shown in the first table below concerns nouns that have an intrinsic gender (i.e. at the entry level) and all the case and number declensions of that noun. The second mask, given in the second table, describes the nouns that don't have an inherent gender at the entry-level but have specific inflections per gender (e.g. think of nouns for professions).
|Entry-level||MascGender OR FemGender OR NeutGender|
|MascGender+SingNumber||MascGender + PlurNumber||FemGender+SingNumber||FemGender+PlurNumber|
Using masks for lexicon validation
The mask model presented here is used to perform a semi-automatic evaluation of the lexicon. Each lexicon entry of a particular language (in the example an Italian adjectival entry) is ingested through the mask. During this process, we are checking that (1) this adjectival entry has indeed four forms, and (2) that each form has one of the required unique combinations of gender and number features (e.g. we cannot have two forms that are plural and feminine).
This evaluation process will mark all the entries that are passing the masks as ``structurally valid. The other entries that are not passing the masks will have to be looked at more carefully.
How does it work in practice?
The masks are formalized in a ShEx files (published in Wikidata (e.g.: https://www.wikidata.org/wiki/EntitySchema:E131.
In line 12, the SPARQL query is given to find all lexicographic entries the ShEx file applies to (all possible focus nodes for the shape described by the ShEX file). Below then, we see the description of the lexical entry: in line 22, we require the grammatical gender to be given at the entry level, and in lines 23ff. we see the definition of the eight individual forms that constitute a German noun as per the table above.
The validation will ensure that all required forms are present, that the right combination of grammatical features are given throughout the forms, and that all entry-level values are set as required. Furthermore, as usual with RDF, the validation will not prevent the data from having additional annotations and markers, e.g. it will not interfere with semantic annotations on the lexical entries, or linkages between entries from different languages. The ShEx files exclusively check for the completeness of the morpho-syntactic forms of the lexical entry.
List of existing masks
- standard German noun (simplified, runs fast)
- standard German noun (more comprehensive, may take long in some implementations)
Using Mask for Lexical Editing Form
Wikidata is developing its platform and infrastructure to support ShEx files in a wide range of use cases across Wikidata. Most importantly for us, we can use the files we publish to validate existing lexicographic entries. This allows for the large semi-automatic validation of the crowdsourced entries in Wikidata, and thus provides a feedback loop for the community to see the quality of their entries. They can get a generated list of all entries that do not fulfill the constraints described in the ShEx files, and then decide case by case how to handle the data (i.e. whether it is a valid exception, whether it requires an alternative or silver mask, or whether the entry needs to be improved).
Frequently Asked Questions
What if I don’t agree with how a mask is set up for a particular POS/language?
It is possible that available masks are not accurate. The most important question is to know whether the proposed structure is not suitable for a particular set of words, or if it’s not suitable at all. If it’s the former (masks not suitable for a specific set of words), it is better to create another mask (a “silver” mask) that will probably be a subset of the specification designed in the existing masks. If it’s the later, you can change the existing mask. But be careful, this will be applied to all the entries.
There is no mask for my language, how can I set one up?
We are happy to help! Please contact us on the talk page. What we often lack is the knowledge about a given language - but we have the expertise on how to create the specifications. We would love to cooperate, and specify the masks based on your knowledge.
Our suggestion is that you create two or three full entries for a given language and part-of-speech, exemplifying how a good entry would look like - and we will then try to capture that. Any additional explanations are welcome, and it would be great if you could be available for clarifying questions. If you have another idea on how to cooperate, please let us know.