Wikidata:Comparison of Projects and Proposals for Wiktionary

From Wikidata
Jump to: navigation, search

Introduction[edit]

The purpose of this work is to discuss a new proposal based on two previous ones that have all been made regarding a possible support of Wiktionary through Wikidata via a provision of structured language data that can then be represented in the Wiktionaries. This includes the answering of certain questions in regards to the current structure of Wikidata, as not all decisions emerge automatically. Examples are the scope and delimitation of a linguistic entity in Wikidata, the handling of derived, yet not synonymous word forms, their structured representation or the structure/constitution of an entry that might subsume different subcategories, itself. Also, the structures of similar projects – WordNet, EuroWordNet and OmegaWiki – and lemon, a model for sharing lexical information, shall be covered.

There will be an example-item in each section in order to illustrate the differences between them. The example in each case shows, how the different structures represent the words ”Hamburger” in English, meaning ”person from Hamburg” and the word ”Hamburger” in German which can either mean ”person from Hamburg” or ”hot sandwich consisting of a patty of cooked ground beef, in a sliced bun, sometimes also containing salad vegetables, condiments, or both” (Note that this can be translated to "hamburger", which however is not the same as "Hamburger" due to capitalization). In each system, a box will then represent an entry, ie. a separate item-page. Of course, in the most cases, the contents of these boxes will be heavily reduced in order to obtain a more schematic representation.

Terminology[edit]

In order to avoid misunderstandings, what follows is a short outline of a terminology that will be used throughout the document. A more extensive glossary of terms used in the following chapters can be found under Further Wikidata-Terminology. Those terms that are only needed in certain sections will be introduced there.

surface form
This term is used to refer to the morphological surface form only (how a word is written), neglecting phonological surface forms (how a word is pronounced) etc. This also means that the surface form of “go” as a French noun for a board game is equal to the one of “go” as the English verb. The words “Polish” and “polish”, however, do not share the same surface form, due to the capitalization of only one of them.
gloss
A gloss is a short description of what the word/expression denotes but which does not need to be as detailed as a definition that can be used interchangeably with the word/expression. An example of a word together with its gloss is ”Fado – musical genre”.
expression
The term expression will refer to same surface forms that also share the feature-value for language and have the same morphological features, thus also belong to the same word category. The English adjective “blue”, referring to the color and the English adjective “blue” referring to a melancholic state of mind are for example the same expressions (just with different meanings) but the English verbs “tear” (to rend by holding or restraining in two places and pulling apart, whether intentionally or not; to destroy or separate) and “tear” (to produce tears) are not, due to different morphological values such as different inflection-forms: tear/tore/torn versus tear/teared/teared. The English noun “bike” and the English verb “bike” are not the same expressions due to different word categories and neither are the English noun “chat” (an exchange of text or voice messages in real time through a computer network) and the French noun “chat” with the same meaning, due to different languages.
sense
The sense of a word/multi-word-expression is the word/multi-word-expression combined with the meaning it is associated with. The English word “chat” for instance can have the meaning to be involved in an informal conversation, “chat” in French can either have the same meaning or have the one of “cat”, however their senses are different, as the expressions differ. A sense is dependent on the term it belongs to – to give another example: “Le Mépris” (film by Godard) and “Pierrot le fou” (film by Godard) are both described with the same gloss (“film by Godard”), but they do not share their senses. The sense in the first example is “Le Mépris” (film by Godard) and in the second “ Pierrot le fou” (film by Godard).
entry
The term entry denotes the basic presentation-unit in any of the different dictionaries and is in this respect equal to the term page. For Wikidata, an entry is a wiki page, and is the basic editorial unit, used to track edit histories, authorship, etc.

Wiktionary and Wikidata[edit]

This section aims at introducing the functionality of both the current Wiktionary and Wikidata in order to create the groundwork for the analyses of the three proposals. These overviews are followed by a short motivation, why a possible liaison of the two Wiki-projects can be beneficial.

Wiktionary[edit]

Fig. 1: Example entry "Hamburger", Wiktionary
Fig. 2: Alternative Example entry "Hamburger", Wiktionary

The approach of the open-source Wiktionary is to have a multi-lingual dictionary with entries from all languages. In the English Wiktionary there are entries about the German word “Unterstrich” as well as about the Hungarian word “tüzes” etc. All information in the entry (except the foreign-language vocable) are written in English and also English words have entries. In the Thai Wiktionary, they are written in Thai and so on. Currently, there are expressions from 1062 languages, of which there are 522 with more than ten entries[1], in the English version alone. At the moment, there are 170 language versions[2].


The entries are structured as follows: One entry handles all expressions from all languages according to one morphological surface-form. This means for example that both the German adjective “arm” and the English noun “arm” are associated with the same entry. The sub-sections are structured according to language (the respective Wiktionary’s language has priority when it comes to ordering the sub-sections) and POS, which means that the entry of the page which covers a surface-form is divided into expression-sections, which in turn are divided into sense-sections if one expression holds more than one sense. If property-values such as pronunciation or etymology are shared, they can appear at the beginning of the respective section, if they differ, they are split up into their subsections. There are some smaller differences between the different language versions, such as the fact that the German one also lists translations to other languages (not to confuse with Wiktionaries in other languages that handle the same surface-form) and links to those that are available in the German Wiktionary, the English or French Wiktionaries for example do not treat translations into other languages. Besides such things, the different language-versions follow a similar structure.


The information in Wiktionary includes pronunciation (both IPA(/Sampa)-transcripts and audio documents), etymology, anagrams, synonyms, hypernyms etc., inflection tables, example sentences, translations (and, if existing, links to the according translation), links to Wikipedia-pages about the concept(s) behind the sense(s) of the expression(s) and links to entries in other Wiktionaries which also handle the respective surface-form. Wiktionary also has entries for compositions, acronyms, abbreviations, misspellings and simplified spellings.

Figure 1 shows how Wiktionary represents the "hamburger"-example.

Note that in some Wiktionaries the pronunciation of the two German words may be in front of the respective noun-sections for shared features such as pronunciation in the example above in order to avoid redundancy (see Figure 2).

Wikidata[edit]

Fig. 3: Example statement in Wikidata

Wikidata is an open-source database that stores structured data language-independently. The information about items are thus shared by every language version, since the labels of the concepts (that are linked via property links) and the property links themselves are represented in different languages – but anchored at the according ID of the item/property. Further, a third entity type is planned: queries. These shall be used for an automatic generation of lists such as a list of food additives or lists of rivers with the according information to them.

Every item can have a list of statements. The information that Berlin (Q64) has the status of a state in Germany for example is represented by a statement claiming “type of administrative division: state of Germany”. These claims, consisting of a property and a value, and potentially qualifiers, are accompanied by a (possibly empty) list of references. An example statement is given in Figure 3.

The properties (in Fig. 3 ”Population”) are entities that are explained in the property-section and can, just like other items, be created by the users. Examples are: date of birth (P569, date on which the subject was born), signature (P109, image of a person's signature) or ancestral home (P66, place of origin in China for ancestors of subject). The information currently available in Wikidata are added by users, partly manually and partly by bots. The underlying software is Wikibase. It is obvious that there cannot be an example entry for the expressions of ”Hamburger” yet, as the approaches of the possible Wikidata-entries for linguistic items is the topic of this document.

Motivation[edit]

It seems obvious that having access to structured linguistic data can be of great benefit for these 170 language-dependent Wiktionary-versions.

  • Firstly, it would reduce the editing-effort, as information can be drawn from the database automatically if desired.
  • Secondly, as the same holds for corrections that could then have an effect in all entries in all languages at once, a higher information quality can be achieved.
  • Thirdly, this may lead to more extensive entries also in Wiktionaries from smaller languages.
  • Fourthly, having a vast collection of free, structured linguistic data will be very useful for natural language processing applications, researchers, linguists and people “just personally” interested in linguistic structures that can be browsed easily.

Acknowledging the need for certain unstructured information as well - not least: having definitions and explanations of foreign words in an own language, which might and will diverge in context due to different cultural/language backgrounds - , there is not at all the wish to replace Wiktionary by Wikidata but only to help maintain and extend it via the means of offering both a structure, and a basis for anchoring information.


Comparison of the Structure of other Projects[edit]

In this section, the underlying structures of the projects WordNet, EuroWordNet and OmegaWiki shall be compared in order to display structural differences between them and Wiktionary.

WordNet[edit]

Figure 4: Example entry "Hamburger", WordNet

Structure[edit]

WordNet is a free dictionary for English by the Princeton University. Every entry (a word or multi-word term) is associated with one or more so-called synsets. Those group together words that, in a certain context, are synonymous, see the two synsets for the expression “copper” as examples. Further, a gloss is provided, giving a short explanation/definition of the synset. In most cases, there are also sentences for example use.[3]

  1. S: (n) bull, cop, copper, fuzz, pig (uncomplimentary terms for a policeman)
  2. S: (n) copper, copper color (a reddish-brown color resembling the color of polished copper)

The synsets are linked to each other by ontological relations, mostly hyponym/hypernym-relations. This means that for instance the synset {photograph, photo, exposure, picture, pic} can be represented as a direct hypernym for {wedding picture}, {still} or {snapshot, snap, shot}, while functioning as a direct hyponym of {representation}. Also directed relations like “derivationally related form“ are included. The “sub-WordNets” of nouns, verbs, adjectives and adverbs are treated separately with only very few cross-POS-pointers. The structure, however, is the same in each one of them. In total, there are 117.000 synsets.

Terminological Contrasts and Comparison with Wiktionary[edit]

In regards to our terminology, one WordNet-item correlates with the term expression: one surface-form with a certain language (in WordNet: English) and one POS-tag. A WordNet-synset is similar to what we define as sense – a certain meaning of a linguistic entity. Here, however, it is represented by a set of synonyms, whereas Wiktionary represents a sense by attaching a gloss/definition to the linguistic entity.

WordNet thus tackles the task of synonymy between senses and also some other semantic relations in the manner of a monolingual thesaurus. One of the main differences to Wiktionary lies in the different representations of senses (synsets versus glosses/definitions and descriptions). Further, WordNet, in contrast to Wiktionary does not provide morphological or phonological information of words, does not approach locutions and does not offer translations.

The Hamburger-Example[edit]

Note that WordNet only covers English words, therefore there is no representation at all for the German words in the hamburger example. Furthermore, WordNet does not provide any entries for the capitalized word ”Hamburger” in English, either. Figure 4 therefore shows a sample entry for ”Hamburger” as it would be according to the typical WordNet-structure. As shown, there is only one synset in this case. Also, this one does not provide any synonyms, but only a gloss and relational information.

EuroWordNet[edit]

Figure 5: Example entry "Hamburger", EuroWordNet

Structure[edit]

While WordNet only covers English words, with the commence of the EU-project EuroWordNet, also WordNets in Dutch, Spanish, Italian, German, French, Czech and Estonian were created and linked to each other, resulting in a multilingual database. Via the interlingual links that are stored in the Inter-Lingual-Index (ILI), wordnets from one language are linked to another. Since the purpose of these links is to match “equivalent” synsets in different languages, no relations between the single ILI-Records are established. This task remains in the single WordNets. This also allows an easy extension of them because no consensus over all groupings needs to be maintained.

Language-internal relations have been broadened with the start of the project, new ones were added and relations now have features such as conjunction or disjunction - "airplane" can have the meronyms "door", "jet airplane" and "propeller". The word "door" can have the holonyms "car", "room" or "airplane". Also, in EuroWordNet, links between synsets with different POS-tags are stored.

However, matching these synsets can be rather complicated – concepts might not exist in different languages or can, together with their in- and outgoing relations, be non-congruent. A concept might for example only be hyponomic relative to another in one language, but not in another. It is thus hard to conclude or infer relations from interlingual mappings.

Terminological Contrasts and Comparison with Wiktionary[edit]

The terms item, synset and gloss are equal to those in WordNet, as explained above. The POS-constraint between synsets from different languages, however, is softened: here, equivalence-links may also be between synsets with items with different POS-tags.

Just as WordNet, EuroWordNet uses synsets as sense-representation, which is a major difference to Wiktionary. EuroWordNet in contrast to WordNet is multi-lingual to the extent that it connects synsets of seven languages with each other. Another important difference to Wiktionary, however, lies again in the exclusion of morphological and phonological information.

The Hamburger-Example[edit]

Since the use of EuroWordNet is not, unlike the one of WordNet, free, the figure above is built according to what it should look like considering the structure and other examples.

OmegaWiki[edit]

Fig. 6: Example entries "Hamburger", OmegaWiki

Structure[edit]

OmegaWiki is a multilingual open-source dictionary whose aim is to „describe all words of all languages with definitions in all languages, including lexical, terminological and ontological information”.

The internal structure relies on entries regarding one DefinedMeaning (DM), which is a combination of an expression together with its definition. This definition is regarded to be language-independent and therefore translated into the different languages. Speaking in the terms discussed in the terminology, one DefinedMeaning thus corresponds to a sense. Hence, in example 3 and 4, there are separate pages for the following because they are two distinct DefinedMeanings – in the first case, there is the expression “song” combined with the definition “A musical piece with lyrics…”, in the second one, the expression “song” is combined with the definition “The act of singing”.

  1. song: A musical piece with lyrics (or "words to sing"); prose that one can sing.
  2. song: The act of singing.

Further, there are entries for DMs in other languages that also have expressions with the same surface form. In this example, there is an entry for the Faroese word “song”, which translates to “bed“. This one, however, is represented on a different entry. There are lists per language with the different definitions a word can have, i.e. listing all DMs associated with surface-forms per language. However, these pages only list the existing DMs per expression with their information and it is not possible to share identical information (such as syllabication within one language) between different DefinedMeanings. These therefore need to be duplicated on all concerned entry-pages.

Besides these DefinedMeaning-pages, OmegaWiki also stores certain semantic and ontological relationships between the DMs. These include synonymy, antonymy etc. as well as hyponomy, translation to other languages etc.

Regarding these relations, it is possible to differentiate between exact relations and inexact ones. One example of these would be synonymy: While the English word “German” does not make any statement about the gender of the German person that is talked about, the German word “Deutsche” (as opposed to ”Deutscher”) also encodes the information “sex: female”. Since there is no word that would exactly translate into the English word “German” – where no information about the sex is represented – one could not use the hyponomy/hypernomy-relation to express the mapping between these DefinedMeanings, either. In the database, this is shown by the symbol “~”, meaning the translation is not an exact one. The user can decide, which language he/she wants to use OmegaWiki in. There are more than 300 interface-languages. If an entry/information does not exist in a language, things will be displayed in English.

Terminological Contrasts and Comparison with Wiktionary[edit]

OmegaWiki offers translations, glosses in different languages and information about semantic relations. In this matter, its approach is similar to the one of Wiktionary. One main difference, however, lies in the independence of the various language-versions of Wiktionary. In Wiktionary, expressions can be explained/defined in the respective language. In OmegaWiki, translations of concept-definitions which are associated with the respective concept are stored.

The Hamburger-Example[edit]

The example of German/English ”Hamburger” in OmegaWiki is shown in Figure 6.

As described above, there are different pages for each DM, however, as shown in the box on the left-hand side, there are no separate pages for German ”Hamburger”–person from Hamburg and English ”Hamburger”–person from Hamburg. Note also, how the entry title on the right hand side is ”hamburger” instead of ”Hamburger” and the German equivalent appears as a translation.

Overview[edit]

The above-mentioned projects WordNet, EuroWordNet, OmegaWiki and Wiktionary differ from each other to quite a large extent. Especially as they are partly pursuing different objectives, it is hard to say which ones are, in a general way, “better” than other ones when it comes to the underlying structures of them.

In reference to the chapters above, some of the main structural differences between the four dictionaries are illustrated in the table below.

Comparison of the different Projects
System Free Open Source Entry Scope No. of Languages: Expressions No. of Languages: Definitions Translation
WordNet yes yes one expression with all its senses, classified into synsets 1 1 --
EuroWordNet no no one expression with all its senses, classified into synsets 7 7 interlingual synset-links
Wiktionary yes yes one surface-form with all its expressions with all their senses 1062 (522 with >10 entries) 170 explanation in respective language
OmegaWiki yes yes one sense of an expression (Defined Meaning) 469 469 translation of concept-definitions

However, naturally there are some positive/negative aspects to each project, which could be seen as main criteria in the structure of dictionaries. These will be outlined in this section.

Representation of Translations and Synonymies[edit]

In the projects discussed here, there are two different approaches to this matter: One is to link linguistic material from different languages to an (ontological) entity. The other is to ensure translation and synonymy between senses. The advantage of the first one lies in the smaller number of links – in the second approach, there will in the worst case be links from each language to each other one, yielding in a quadratic complexity. However, this way a finer granularity of translations and synonymies can be achieved, resulting in a probably higher quality of information stored in the dictionary. It can be seen as a main advantage of online dictionaries that the problem of space is not as relevant as it is for paper dictionaries, so one could make advantage of this fact and opt for the second version of translation/synonymy-representation.

In the projects above, only Wiktionary uses this structure: both EuroWordNet and OmegaWiki make use of abstract entities that serve as “anchors” to the linguistic material (please see the illustrations of the structures in the respective chapters) and WordNet does not cover any other languages than English in the first place and therefore no translations at all. Regarding synonymy it does link between the sets directly, without recourse to an abstract entity.

Required Knowledge of Foreign Language and Language Specificity of Definitions[edit]

One of the most valuable factors of the structure of Wiktionary is the fact that in order to understand what the respective foreign-language material means, one does not need to have a high understanding of this language. This is obviously different in monolingual dictionaries like for instance WordNet, which does not cover any interlingual connections and can thus not be evaluated in this respect. In OmegaWiki, there can be translations of definitions in all languages, making it multi-lingual. However, the content of the definition is translated instead of being language-specifically formulated. Thus fine differences in the meaning may not be representable in certain cases. EuroWordNet also ensures the “definition” rather language specific due to links between synsets in different languages. However, here, a “translation” can only be represented if there actually exists an equivalent synset in the language at question. If this is not the case, there is no possibility for representing the meaning of a synset in a different language at all – this can be approached only vaguely via the relational information in the respective wordnets. However, this cannot be seen as a deficit of the structure of EuroWordNet, as the designated objective is to display equivalence relations between languages and not to give translations of foreign-language material.

The Scope of an Entry[edit]

It may be very handy to represent different senses of one expression collectively, for instance in those cases where a user wants to look something up and is not quite sure, what it refers to, as in these cases, the gloss may not be sufficient for deciding between the different senses. If they are collected, it reduces the manual search effort. Furthermore, all information that is shared between the different senses (these could include pronunciation, etymology, morphology etc.) can be displayed in a more effective way, with shared features displayed accordingly. There is no true disadvantage of displaying different senses of an expression grouped together but certain advantages that can make the structure clearer and more concise, and the look-up user-friendlier. In the list of the discussed projects, all make use of this entry-wide collection of senses in regards to the look-up, even though the representations vary – WordNet and EuroWordNet refer to them via synsets, OmegaWiki allows a display of either a DefinedMeaning (one sense) or an expression (with possibly various senses) on one page and Wiktionary even groups together different expressions that may belong to different languages as long as the surface form is identical. Since none of them prohibits the display of collected senses per expression when it comes to searching the database, this does not serve as a feature of differentiation between EuroWordNet, WordNet, OmegaWiki and Wiktionary but should be taken into account when it comes to structuring linguistic Wikidata-entries (even though lucidity may not be a main criterion as the Wiktionaries can process the Wikidata-information very efficiently). However, only Wiktionary allows for structural flexibility in respect of storing information that holds for more than one sense.

Covered Linguistic Material[edit]

WordNet and EuroWordNet only cover words or multiword expressions from a limited number of categories of speech. They do not cover phrases, semantically non-referring material, colloquial language or inflected forms. Wiktionary does cover them and OmegaWiki does partly and would at least have the possibility due to the underlying structure.

Features[edit]

Regarding the possibilities of representing various kinds of linguistic information, the different projects do not differ to a very large extent in some respects: There are sentences of example uses in all four of them, translations (except WordNet) and, obviously, some form of definitions. The same holds for relational information such as antonymy, hypernymy etc. Information about phonology or morphology (especially inflectional forms), however, is not featured in WordNet or EuroWordNet and also etymological information is only covered to a very limited extent by these two as well as by OmegaWiki. Of the four projects, Wiktionary is the only one accounting for this feature. Media files can be included in both OmegaWiki and Wiktionary.

Lemon[edit]

Since the lemon model might be a promising model for our purpose, its main structure will shortly be outlined.

Structure[edit]

The purpose of ”lemon” is to offer a model for ”sharing lexical information on the semantic web ”. In our case, it may be useful for the structuring of Wikidata as it imposes a structure that offers just the right amount of granularity that we wish to represent in the third proposal, ie. it is language-dependent and differentiates between a surface form of a lexical entry and its sense, which refers to an ontology entry. In both relation types, multiple relations can be represented and can also be subcategorized into ”common form” versus ”variant” etc. The different categories are built as illustrated in the figure and will be explained separately.

Lexicon
A Lexicon contains all Lexical Entries of a certain language and labels them with the according language code.
Lexical Entry
A Lexical Entry represents one lexeme, i.e. a word or multi-word term in a certain language that has one or more forms and one or more senses.
Lexical Form
The Lexical Form of a certain Lexical Entry is described by its written representation. There can be various Lexical Forms of a Lexical Entry that may be categorized into Canonical Form – the usual written representation –, Other Form – which can be a different and less common spelling or for example an inflected form – and Abstract Form – a non-realizable form, for instance the stem of a word. Via properties, alternative forms can further be described. An example could be ”property: category plural”. Alternative written representations that are equally common can be represented accordingly. It is not necessary to decide for one variant.
Lexical Sense & Ontology
The Lexical Sense represents the relationship between the lexical entry and the ontology entry, thus, to what the entry refers. In the case of homonyms or polysemous words, one Lexical Entry refers to more than one Ontology Entry. Since one Ontology Entry can also be referred to by different Lexical Entries, there is a many-to-many-relationship between Lexical Senses and Ontology Entries.

Further features that may be of value for our plans include the possibility of representing either words or multi-word terms as lexical entries. It is also possible to store information about the decomposition into words and morphological compounds. Also it can be advantageous to be able to assign properties to the relations. The model also offers modules for automatic inflection generation, which will not be covered at this point but that may be interesting as soon as it comes to deciding if and how automatic information generation shall be handled.

Example[edit]

The following example, which also explains how translations can be handled, is taken from the lemon cook-book.

The left side of the illustration above shows three different Lexica (English, German, French), each of which has one Lexical Entry (”cat: LexicalEntry” in English, ”chat: LexicalEntry” in French, ”katze: LexicalEntry” in German) and each of these LexicalEntry-boxes points to a Lexical Form with the written representation of the entry including the language label. These relations carry the value ”canonicalForm”. As described above, there can also be alternative forms represented at various points in the system. Every Lexical Entry also points to a Sense and these Senses are all interconnected, carrying the label ”translationOf”. Thus, translation happens between Senses. These Senses would all point to the same Ontology Entry as explained above but is not illustrated in the figure.


Proposals[edit]

There are three main proposals that were made regarding the restructuring/extension of Wikidata in order to support Wiktionary.

Initial Proposal[edit]

The initial proposal was made by Denny Vrandečić and first announced on June 19, 2013. It is based on the introduction of two new entity types to Wikidata: expression and sense.

While the typical Wikidata-item may have a label in every language (the English label for Q1749 is “Copenhagen”, the Danish label of the same item is “København“ etc.), regarding an expression, there would only be one label altogether. Since in this proposal, the term (word or multiword term) itself together with its linguistic information is what is of interest, it seems clear that there may not be any translated word forms in the different languages when talking about the same expression. An expression itself is dependent on the language it belongs to, the English word “Berliner” is a different expression than the German word “Berliner”. The expression “Berliner” would thus be dependent on the morphological surface form (and not have a different label – the French translation Berlinois/e or anything similar). In the latter case, there are (at least) two different meanings to the expression that are attached to it – “person from Berlin” and “doughnut with a sweet filling” – regardless of their different etymologies. Likewise, it would be the same if they required a different pronunciation, hyphenation etc. as long as the spelling is identical. As explained in the terminology, the notion of sense is introduced.

These short descriptions (such as “person from Berlin”) are called glosses. Hence, the expression “Berliner (German)” has two different senses that can be referred to with the glosses “person from Berlin” and “doughnut with a sweet filling”, respectively. The expression “Berliner (English)” has one sense, which can be referred to with the gloss “person from Berlin”. In Wikidata, there would therefore be two pages: “Berliner (English)” with the section “person from Berlin” and “Berliner (German)” with the sections “person from Berlin” and “doughnut with a sweet filling”.

Linguistic properties would be registered as statements by the users and both an expression and a sense can have statements. While in the case of “Berliner (German)” the statement regarding hyphenation would be attached to the expression, the statements regarding for instance synonymy or translation would need to be associated with the according senses. This proposal does not plan any search-aliases for inflections. Every derivational term (plural forms, inflected verbs etc.) will be a discrete expression.

Alternative Proposal[edit]

The alternative proposal by User:Micru (David Cuenca) and User:Francis Tyers and served as a reaction to the first one and was announced on July 1, 2013. It is based on the introduction of two new entity types (defined meaning, bond) and one new data type (a paradigm).

One of the main differences to the initial is the splitting between expressions and their senses – while in the initial proposal, all senses of one expression are collectively listed on one page, in the alternative proposal, there will be one page per sense. Following a similar terminology, these senses will be language-dependent as well (that is: “Berliner (German) – person from Berlin” will be a different entity than “Berliner (English) – person from Berlin”). What is referred to as sense in the initial proposal is called defined expression in this proposal, similar to the terminology in OmegaWiki although not quite the same, as the OmegaWiki-DM is based on a translatable definition, while in this proposal, the defined expression can have its own definition in each language.

The second new entity type, a bond shall replace property-links to some extent, representing certain statements as results to automatic searches, thus partly being built automatically. This will happen whenever an automatic search/inference allows it. Examples could be the automatic linking of exact translations or exact synonymy. Since there are certain difficulties associated with these kinds of inferences (semantic drifts etc.), a differentiation between strong (for example exact meronymy) and weak links (for example near-synonymy) is proposed in order to handle these phenomena better. Paradigms are language-dependent sets of rules to automatically generate derived forms. In this proposal, those shall serve as aliases to the base form of the defined expression (and optionally be stored as “inflected forms”).

Third Proposal[edit]

Fig. 7: Example entries "Hamburger", Third Proposal
Fig. 8: Example entry third proposal

The third proposal emerged predominantly out of discussions about the initial and the alternative one. It was put forward on August 2, 2013 by Denny Vrandečić. This proposal uses a slightly different terminology which is introduced below. The terms sense and gloss, however, are defined the same way as in the terminology.

  • A lexeme, also known as word or lexical entry, is what is described on one page in the lexical part of Wikidata. A lexeme consists of a lemma, a lexical category, a language, a set of forms, a set of senses, and a set of statements.
    • The lemma is the canonical form or dictionary form of the lexeme, e.g. for verbs this is usually the infinitive form, for a noun the nominative singular, etc.
    • The lexical category, also known as the part of speech or word class, defines the lexeme to be either a noun, or a verb, or an adjective, etc. The set of possible values is open and taken from the Wikidata items.
    • The language of a lexeme is taken from Wikidata items, and thus an open set.
    • A form is a specific, fully conjugated or inflexed form of the lexeme. A form consists of a representation, a set of lexical properties, and a set of statements. A form always belongs to one (and exactly one) lexeme.
      • A representation is the actual string value realizing a given form, e.g. the string value "wrote" for the past tense of the lexeme for "write". All representations are indexed for search.
      • A lexical property describes the form, e.g. tense or number for verbs, case for nouns, etc. This is an open set and points to Wikidata items.
    • A sense is described by a gloss and has a set of statements. A sense always belongs to one (and exactly one) lexeme (and lexemes belong to one language only). Senses are not independent of lexemes.
      • A gloss is a short description (translatable in all languages of the Wikidata UI) of one sense of the given lexeme.

The terms Wikidata item, property, string value, qualifier, statement, and claim are taken from the Wikidata glossary and have the same meaning here. See also the further glossary.


Some of the most important alterations to the previous proposals are the following:

The ”basic” unit is the lexeme. It is not the expression as the initial proposal suggested and where each morphological form was a separate expression thus having a separate entry-page, nor the sense as the alternative proposal suggested (which may only be one part of the lexeme, in case the lexeme is polysemous/homonymous), nor the language-independent surface form as it is the case in Wiktionary.

Senses, forms and lexemes can have statements. This differs from the initial proposal to the extent that in the initial one, a separate entry for all derived forms was demanded. In the third one, inflections are ”alternative forms” that may but do not need to have statements separately from their lemma. While the alternative proposal suggested statements on sense-level (and depending on the implementation of inflections statements on either all or no inflected forms), in the third proposal, it is possible to decide where a statement is the most useful. This way, all necessary differentiations can still be drawn but shared information can be stored less redundantly.

Inflections are handled as aliases for search and do not need to have a separate entry. This is similar to the alternative proposal. However, in the third one, decisions about what may be computed automatically - for example via paradigms - are postponed to a stage where there is enough linguistic data in Wikidata for a more detailed discussion on this matter. Figure 8 shows the example entry, taken from the proposal, with more detail than the more schematic "Hamburger"-Example.

The Hamburger-Example[edit]

The "Hamburger"-example would in this case be represented as in Figure 7.

Overview[edit]

The table shows a comparison of some details of the three proposals.

Comparison of the Wiktionary/Wikidata-Proposals
Proposal Entry Scope Inflection Handling Statements Storage in Wikidata
Initial one expression; each morphological form separately own entry for each inflection possible on both expression and sense via statements
Alternative one sense of an expression aliases to base form; stored as inflected form attached to sense via bonds
Third lexeme with all its senses attached to lexeme via form; can hold own statement possible on lexeme, sense and form via statements


Further Wikidata-Terminology[edit]

The following are taken from the Wikidata glossary and are shortened at some points.

claim
A claim is a piece of data about the entity on whose page the claim appears. A claim consists of a property (such as "Location") and a value (e.g., "Germany"), or some other relation or composite or missing value. A claim can have qualifiers, such as temporal qualifiers saying that the claim is valid within a specific time frame. Compared to the triplets used in linked data, a claim uses a property to express the predicate of a triplet and a value to express the object of a triplet. Claims form part of statements on item pages.
item
A Wikidata item is a page in the Wikidata main namespace that represents real-life item topic, concept, or subject. Items are identified by a prefixed id, or by a sitelink to an external page, or by a unique combination of multilingual label and description. Items may also have aliases to ease lookup. The main data part of an item is the list of statements about the item. An item can be viewed as the subject-part of a triplet in linked data.
property
A Wikidata property (in some languages translated to attribute) is the descriptor for a data value, or some other relation or composite or possibly missing value, but not the data value or values themselves. Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value.
qualifier
A qualifier is a part of the claim that says something about the specific claim, often in a descriptive way. A qualifier might be a term according to a specific vocabulary but can also be a variant descriptive phrase.
statement
A statement is a piece of data about an item, recorded on the item's page. A statement consists of a claim (a property-value pair such as "Location: Germany", together with optional qualifiers), augmented by optional references (giving the source for the claim) and an optional rank (used to distinguish between several claims containing the same property). Wikidata makes no assumptions about the correctness of statements, but merely collects and reports them with a reference to a source.
string
A string (short for character string) is a general term for a sequence of freely chosen characters interpreted as text (e.g. "Hello") — as opposed to a data interpreted as a numerical value (3.14), a link to an item (e.g. Q1234) or a more complex datatype (the set {1,3,5,7}). Wikidata will in addition to a string datatype support language specific texts; "monolingual-text" and "multilingual-text" as the value of a property.

References[edit]

  1. http://en.wiktionary.org/wiki/Wiktionary:Statistics
  2. http://meta.wikimedia.org/wiki/Wiktionary/Table
  3. AboutWordNet