Wikidata:Jämförelse av projekt och förslag till Wiktionary

From Wikidata
Jump to navigation Jump to search
This page is a translated version of the page Wikidata:Comparison of Projects and Proposals for Wiktionary and the translation is 17% complete.
Outdated translations are marked like this.

Introduktion

Syftet med detta arbete är att diskutera ett nytt förslag baserat på två tidigare som har gjorts om ett eventuellt stöd av Wiktionary med Wikidata tillhandahållande strukturerad språkinformation som sedan kan representeras i Wiktionaries. Detta inkluderar att svara på vissa frågor i förhållande till Wikidatas nuvarande struktur, eftersom inte alla beslut framträder automatiskt. Exempel är omfattningen och avgränsningen av en språklig enhet i Wikidata, hanteringen av härledda, men inte synonyma former, deras strukturerade representation eller strukturen/konstitutionen för en post som själv kan utgå från olika underkategorier. Dessutom ska strukturerna för liknande projekt - WordNet, EuroWordNet och OmegaWiki - och lemon, en modell för delning av lexikalisk information, täckas.

Det kommer att finnas ett exempel i varje avsnitt för att illustrera skillnaderna mellan dem. Exemplet visar i varje fall hur de olika strukturerna representerar orden "Hamburger" på engelska, vilket betyder "person från Hamburg" och ordet "Hamburger" på tyska, vilket antingen kan innebära "person från Hamburg" eller "varm smörgås bestående av en färsbiff av kokt nötkött, i en skivad bulle, ibland även innehållande salladsgrönsaker, kryddor eller båda "(Observera att detta kan översättas till "hamburgare", vilket emellertid inte är detsamma som "Hamburger" på grund av versalisering). I varje system representerar en låda en post, dvs. en separat objektsida. Naturligtvis, i de flesta fall, kommer innehållet i dessa lådor att reduceras kraftigt för att få en mer schematisk representation.

Terminologi

För att undvika missförstånd är följande en kort beskrivning av en terminologi som kommer att användas i hela dokumentet. En mer omfattande ordlista med termer som används i följande kapitel finns under Further Wikidata-Terminology. De termer som endast behövs i vissa avsnitt kommer att introduceras där.

ytform
Den här termen används endast för att referera till den morfologiska ytformen (hur ett ord är skrivet), försumma fonologiska former (hur ett ord uttalas) etc. Detta betyder också att ytformen av "go" som ett franskt substantiv för ett brädspel är detsamma som "go", det engelska verbet. Orden "Polish" och "polish" delar emellertid inte samma ytform, på grund av att endast en av dem är versaliserad.
glosa
En glosa är en kort beskrivning av vad ordet/uttrycket betecknar men som inte behöver vara så detaljerat som en definition som kan användas utbytbart med ordet/uttrycket. Ett exempel på ett ord tillsammans med dess glosa är "Fado - musikalisk genre".
uttryck
Termen uttryck kommer att referera till samma ytformer som också delar funktionsvärdet för språket och har samma morfologiska egenskaper och hör därmed till samma ordkategori. Det engelska adjektivet "blue", som hänvisar till färgen och det engelska adjektivet "blue", som hänvisar till en melankolisk sinnestillstånd är till exempel samma uttryck (bara med olika meningar) men de engelska verken "tear" (att renda genom att hålla eller hindra eller separera) och "tear" (för att producera tårar) är inte på grund av olika morfologiska värden som olika böjningsformer: tear/tore/torn versus tear/teared/teared. Det engelska substantivet "bike" och det engelska verbet "bike" är inte samma uttryck på grund av olika ordkategorier och inte heller det engelska substantivet "chat" (ett utbyte av text eller röstmeddelanden i realtid via ett datanätverk) och franska substantivet "chat" med samma mening, på grund av olika språk.
betydelse
Betydelsen av ett ord/fras är ordet/frasen i kombination med den mening det är associerat med. Det engelska ordet "chat" kan till exempel ha betydelsen att vara inblandad i en informell konversation, "chat" på franska kan antingen ha samma betydelse eller ha en av "katt", men deras betydelser är olika, eftersom uttrycken skiljer sig åt. En betydelse är beroende av termen den tillhör - för att ge ett annat exempel: "Le Mépris" (film av Godard) och "Pierrot le fou" (film av Godard) beskrivs båda med samma glosa ("film av Godard"), men de delar inte betydelser. Betydelsen i det första exemplet är "Le Mépris" (film av Godard) och i den andra "Pierrot le fou" (film av Godard).
post
Termen post anger den grundläggande presentationsenheten i någon av de olika ordböckerna och är i detta avseende lika med termen sida. För Wikidata är en post en wikisida, och är den grundläggande redaktionsenheten, används för att spåra redigeringshistoria, författarskap etc.

Wiktionary och Wikidata

Detta avsnitt syftar till att införa funktionaliteten för både det aktuella Wiktionary och Wikidata för att skapa grunden för analyserna av de tre förslagen. Dessa översikter följs av en kort motivation, varför en möjlig koppling av de två Wiki-projekten kan vara till nytta.

Wiktionary

Fig. 1: Exempelpost "Hamburger", Wiktionary
Fig. 2: Alternativ Exempelpost "Hamburger", Wiktionary

Tillvägagångssättet i open source-Wiktionary är att ha en flerspråkig ordbok med poster från alla språk. I det engelska Wiktionary finns det uppgifter om det tyska ordet "Unterstrich", liksom om det ungerska ordet "tüzes" etc. All information i posten (utom främmande språk som kan ordnas) är skrivet på engelska och även engelska ord har poster. På thailändska Wiktionary skrivs de på thailändska och så vidare. För närvarande finns det uttryck från 1062 språk, av vilka det finns 522 med mer än tio poster[1], endast i den engelska versionen. För närvarande finns det 170 språkversioner [2].


The entries are structured as follows: One entry handles all expressions from all languages according to one morphological surface-form. This means for example that both the German adjective “arm” and the English noun “arm” are associated with the same entry. The sub-sections are structured according to language (the respective Wiktionary’s language has priority when it comes to ordering the sub-sections) and POS, which means that the entry of the page which covers a surface-form is divided into expression-sections, which in turn are divided into sense-sections if one expression holds more than one sense. If property-values such as pronunciation or etymology are shared, they can appear at the beginning of the respective section, if they differ, they are split up into their subsections. There are some smaller differences between the different language versions, such as the fact that the German one also lists translations to other languages (not to confuse with Wiktionaries in other languages that handle the same surface-form) and links to those that are available in the German Wiktionary, the English or French Wiktionaries for example do not treat translations into other languages. Besides such things, the different language-versions follow a similar structure.


The information in Wiktionary includes pronunciation (both IPA(/Sampa)-transcripts and audio documents), etymology, anagrams, synonyms, hypernyms etc., inflection tables, example sentences, translations (and, if existing, links to the according translation), links to Wikipedia-pages about the concept(s) behind the sense(s) of the expression(s) and links to entries in other Wiktionaries which also handle the respective surface-form. Wiktionary also has entries for compositions, acronyms, abbreviations, misspellings and simplified spellings.

Figure 1 shows how Wiktionary represents the "hamburger"-example.

Note that in some Wiktionaries the pronunciation of the two German words may be in front of the respective noun-sections for shared features such as pronunciation in the example above in order to avoid redundancy (see Figure 2).

Wikidata

Fig. 3: Example statement in Wikidata

Wikidata is an open-source database that stores structured data language-independently. The information about items are thus shared by every language version, since the labels of the concepts (that are linked via property links) and the property links themselves are represented in different languages – but anchored at the according ID of the item/property. Further, a third entity type is planned: queries. These shall be used for an automatic generation of lists such as a list of food additives or lists of rivers with the according information to them.

Every item can have a list of statements. The information that Berlin (Q64) has the status of a state in Germany for example is represented by a statement claiming “type of administrative division: state of Germany”. These claims, consisting of a property and a value, and potentially qualifiers, are accompanied by a (possibly empty) list of references. An example statement is given in Figure 3.

The properties (in Fig. 3 ”Population”) are entities that are explained in the property-section and can, just like other items, be created by the users. Examples are: date of birth (P569, date on which the subject was born), signature (P109, image of a person's signature) or ancestral home (P66, place of origin in China for ancestors of subject). The information currently available in Wikidata are added by users, partly manually and partly by bots. The underlying software is Wikibase. It is obvious that there cannot be an example entry for the expressions of ”Hamburger” yet, as the approaches of the possible Wikidata-entries for linguistic items is the topic of this document.

Motivation

It seems obvious that having access to structured linguistic data can be of great benefit for these 170 language-dependent Wiktionary-versions.

  • Firstly, it would reduce the editing-effort, as information can be drawn from the database automatically if desired.
  • Secondly, as the same holds for corrections that could then have an effect in all entries in all languages at once, a higher information quality can be achieved.
  • Thirdly, this may lead to more extensive entries also in Wiktionaries from smaller languages.
  • Fourthly, having a vast collection of free, structured linguistic data will be very useful for natural language processing applications, researchers, linguists and people “just personally” interested in linguistic structures that can be browsed easily.

Acknowledging the need for certain unstructured information as well - not least: having definitions and explanations of foreign words in an own language, which might and will diverge in context due to different cultural/language backgrounds - , there is not at all the wish to replace Wiktionary by Wikidata but only to help maintain and extend it via the means of offering both a structure, and a basis for anchoring information.

Comparison of the Structure of other Projects

In this section, the underlying structures of the projects WordNet, EuroWordNet and OmegaWiki shall be compared in order to display structural differences between them and Wiktionary.

WordNet

Figure 4: Example entry "Hamburger", WordNet

Struktur

WordNet is a free dictionary for English by the Princeton University. Every entry (a word or multi-word term) is associated with one or more so-called synsets. Those group together words that, in a certain context, are synonymous, see the two synsets for the expression “copper” as examples. Further, a gloss is provided, giving a short explanation/definition of the synset. In most cases, there are also sentences for example use.[3]

  1. S: (n) bull, cop, copper, fuzz, pig (uncomplimentary terms for a policeman)
  2. S: (n) copper, copper color (a reddish-brown color resembling the color of polished copper)

The synsets are linked to each other by ontological relations, mostly hyponym/hypernym-relations. This means that for instance the synset {photograph, photo, exposure, picture, pic} can be represented as a direct hypernym for {wedding picture}, {still} or {snapshot, snap, shot}, while functioning as a direct hyponym of {representation}. Also directed relations like “derivationally related form“ are included. The “sub-WordNets” of nouns, verbs, adjectives and adverbs are treated separately with only very few cross-POS-pointers. The structure, however, is the same in each one of them. In total, there are 117.000 synsets.

Terminological Contrasts and Comparison with Wiktionary

In regards to our terminology, one WordNet-item correlates with the term expression: one surface-form with a certain language (in WordNet: English) and one POS-tag. A WordNet-synset is similar to what we define as sense – a certain meaning of a linguistic entity. Here, however, it is represented by a set of synonyms, whereas Wiktionary represents a sense by attaching a gloss/definition to the linguistic entity.

WordNet thus tackles the task of synonymy between senses and also some other semantic relations in the manner of a monolingual thesaurus. One of the main differences to Wiktionary lies in the different representations of senses (synsets versus glosses/definitions and descriptions). Further, WordNet, in contrast to Wiktionary does not provide morphological or phonological information of words, does not approach locutions and does not offer translations.

The Hamburger-Example

Note that WordNet only covers English words, therefore there is no representation at all for the German words in the hamburger example. Furthermore, WordNet does not provide any entries for the capitalized word ”Hamburger” in English, either. Figure 4 therefore shows a sample entry for ”Hamburger” as it would be according to the typical WordNet-structure. As shown, there is only one synset in this case. Also, this one does not provide any synonyms, but only a gloss and relational information.

EuroWordNet

Figure 5: Example entry "Hamburger", EuroWordNet

Struktur

While WordNet only covers English words, with the commence of the EU-project EuroWordNet, also WordNets in Dutch, Spanish, Italian, German, French, Czech and Estonian were created and linked to each other, resulting in a multilingual database. Via the interlingual links that are stored in the Inter-Lingual-Index (ILI), wordnets from one language are linked to another. Since the purpose of these links is to match “equivalent” synsets in different languages, no relations between the single ILI-Records are established. This task remains in the single WordNets. This also allows an easy extension of them because no consensus over all groupings needs to be maintained.

Language-internal relations have been broadened with the start of the project, new ones were added and relations now have features such as conjunction or disjunction - "airplane" can have the meronyms "door", "jet airplane" and "propeller". The word "door" can have the holonyms "car", "room" or "airplane". Also, in EuroWordNet, links between synsets with different POS-tags are stored.

However, matching these synsets can be rather complicated – concepts might not exist in different languages or can, together with their in- and outgoing relations, be non-congruent. A concept might for example only be hyponomic relative to another in one language, but not in another. It is thus hard to conclude or infer relations from interlingual mappings.

Terminological Contrasts and Comparison with Wiktionary

The terms item, synset and gloss are equal to those in WordNet, as explained above. The POS-constraint between synsets from different languages, however, is softened: here, equivalence-links may also be between synsets with items with different POS-tags.

Just as WordNet, EuroWordNet uses synsets as sense-representation, which is a major difference to Wiktionary. EuroWordNet in contrast to WordNet is multi-lingual to the extent that it connects synsets of seven languages with each other. Another important difference to Wiktionary, however, lies again in the exclusion of morphological and phonological information.

The Hamburger-Example

Since the use of EuroWordNet is not, unlike the one of WordNet, free, the figure above is built according to what it should look like considering the structure and other examples.

OmegaWiki

Fig. 6: Example entries "Hamburger", OmegaWiki

Struktur

OmegaWiki is a multilingual open-source dictionary whose aim is to „describe all words of all languages with definitions in all languages, including lexical, terminological and ontological information”.

The internal structure relies on entries regarding one DefinedMeaning (DM), which is a combination of an expression together with its definition. This definition is regarded to be language-independent and therefore translated into the different languages. Speaking in the terms discussed in the terminology, one DefinedMeaning thus corresponds to a sense. Hence, in example 3 and 4, there are separate pages for the following because they are two distinct DefinedMeanings – in the first case, there is the expression “song” combined with the definition “A musical piece with lyrics…”, in the second one, the expression “song” is combined with the definition “The act of singing”.

  1. song: A musical piece with lyrics (or "words to sing"); prose that one can sing.
  2. song: The act of singing.

Further, there are entries for DMs in other languages that also have expressions with the same surface form. In this example, there is an entry for the Faroese word “song”, which translates to “bed“. This one, however, is represented on a different entry. There are lists per language with the different definitions a word can have, i.e. listing all DMs associated with surface-forms per language. However, these pages only list the existing DMs per expression with their information and it is not possible to share identical information (such as syllabication within one language) between different DefinedMeanings. These therefore need to be duplicated on all concerned entry-pages.

Besides these DefinedMeaning-pages, OmegaWiki also stores certain semantic and ontological relationships between the DMs. These include synonymy, antonymy etc. as well as hyponomy, translation to other languages etc.

Regarding these relations, it is possible to differentiate between exact relations and inexact ones. One example of these would be synonymy: While the English word “German” does not make any statement about the gender of the German person that is talked about, the German word “Deutsche” (as opposed to ”Deutscher”) also encodes the information “sex: female”. Since there is no word that would exactly translate into the English word “German” – where no information about the sex is represented – one could not use the hyponomy/hypernomy-relation to express the mapping between these DefinedMeanings, either. In the database, this is shown by the symbol “~”, meaning the translation is not an exact one. The user can decide, which language he/she wants to use OmegaWiki in. There are more than 300 interface-languages. If an entry/information does not exist in a language, things will be displayed in English.

Terminological Contrasts and Comparison with Wiktionary

OmegaWiki offers translations, glosses in different languages and information about semantic relations. In this matter, its approach is similar to the one of Wiktionary. One main difference, however, lies in the independence of the various language-versions of Wiktionary. In Wiktionary, expressions can be explained/defined in the respective language. In OmegaWiki, translations of concept-definitions which are associated with the respective concept are stored.

The Hamburger-Example

The example of German/English ”Hamburger” in OmegaWiki is shown in Figure 6.

As described above, there are different pages for each DM, however, as shown in the box on the left-hand side, there are no separate pages for German ”Hamburger”–person from Hamburg and English ”Hamburger”–person from Hamburg. Note also, how the entry title on the right hand side is ”hamburger” instead of ”Hamburger” and the German equivalent appears as a translation.

Overview

The above-mentioned projects WordNet, EuroWordNet, OmegaWiki and Wiktionary differ from each other to quite a large extent. Especially as they are partly pursuing different objectives, it is hard to say which ones are, in a general way, “better” than other ones when it comes to the underlying structures of them.

In reference to the chapters above, some of the main structural differences between the four dictionaries are illustrated in the table below.

Comparison of the different Projects
System Free Open Source Entry Scope No. of Languages: Expressions No. of Languages: Definitions Translation
WordNet yes yes one expression with all its senses, classified into synsets 1 1 --
EuroWordNet no no one expression with all its senses, classified into synsets 7 7 interlingual synset-links
Wiktionary yes yes one surface-form with all its expressions with all their senses 1062 (522 with >10 entries) 170 explanation in respective language
OmegaWiki yes yes one sense of an expression (Defined Meaning) 469 469 translation of concept-definitions

However, naturally there are some positive/negative aspects to each project, which could be seen as main criteria in the structure of dictionaries. These will be outlined in this section.

Representation of Translations and Synonymies

In the projects discussed here, there are two different approaches to this matter: One is to link linguistic material from different languages to an (ontological) entity. The other is to ensure translation and synonymy between senses. The advantage of the first one lies in the smaller number of links – in the second approach, there will in the worst case be links from each language to each other one, yielding in a quadratic complexity. However, this way a finer granularity of translations and synonymies can be achieved, resulting in a probably higher quality of information stored in the dictionary. It can be seen as a main advantage of online dictionaries that the problem of space is not as relevant as it is for paper dictionaries, so one could make advantage of this fact and opt for the second version of translation/synonymy-representation.

In the projects above, only Wiktionary uses this structure: both EuroWordNet and OmegaWiki make use of abstract entities that serve as “anchors” to the linguistic material (please see the illustrations of the structures in the respective chapters) and WordNet does not cover any other languages than English in the first place and therefore no translations at all. Regarding synonymy it does link between the sets directly, without recourse to an abstract entity.

Required Knowledge of Foreign Language and Language Specificity of Definitions

One of the most valuable factors of the structure of Wiktionary is the fact that in order to understand what the respective foreign-language material means, one does not need to have a high understanding of this language. This is obviously different in monolingual dictionaries like for instance WordNet, which does not cover any interlingual connections and can thus not be evaluated in this respect. In OmegaWiki, there can be translations of definitions in all languages, making it multi-lingual. However, the content of the definition is translated instead of being language-specifically formulated. Thus fine differences in the meaning may not be representable in certain cases. EuroWordNet also ensures the “definition” rather language specific due to links between synsets in different languages. However, here, a “translation” can only be represented if there actually exists an equivalent synset in the language at question. If this is not the case, there is no possibility for representing the meaning of a synset in a different language at all – this can be approached only vaguely via the relational information in the respective wordnets. However, this cannot be seen as a deficit of the structure of EuroWordNet, as the designated objective is to display equivalence relations between languages and not to give translations of foreign-language material.

The Scope of an Entry

It may be very handy to represent different senses of one expression collectively, for instance in those cases where a user wants to look something up and is not quite sure, what it refers to, as in these cases, the gloss may not be sufficient for deciding between the different senses. If they are collected, it reduces the manual search effort. Furthermore, all information that is shared between the different senses (these could include pronunciation, etymology, morphology etc.) can be displayed in a more effective way, with shared features displayed accordingly. There is no true disadvantage of displaying different senses of an expression grouped together but certain advantages that can make the structure clearer and more concise, and the look-up user-friendlier. In the list of the discussed projects, all make use of this entry-wide collection of senses in regards to the look-up, even though the representations vary – WordNet and EuroWordNet refer to them via synsets, OmegaWiki allows a display of either a DefinedMeaning (one sense) or an expression (with possibly various senses) on one page and Wiktionary even groups together different expressions that may belong to different languages as long as the surface form is identical. Since none of them prohibits the display of collected senses per expression when it comes to searching the database, this does not serve as a feature of differentiation between EuroWordNet, WordNet, OmegaWiki and Wiktionary but should be taken into account when it comes to structuring linguistic Wikidata-entries (even though lucidity may not be a main criterion as the Wiktionaries can process the Wikidata-information very efficiently). However, only Wiktionary allows for structural flexibility in respect of storing information that holds for more than one sense.

Covered Linguistic Material

WordNet and EuroWordNet only cover words or multiword expressions from a limited number of categories of speech. They do not cover phrases, semantically non-referring material, colloquial language or inflected forms. Wiktionary does cover them and OmegaWiki does partly and would at least have the possibility due to the underlying structure.

Features

Regarding the possibilities of representing various kinds of linguistic information, the different projects do not differ to a very large extent in some respects: There are sentences of example uses in all four of them, translations (except WordNet) and, obviously, some form of definitions. The same holds for relational information such as antonymy, hypernymy etc. Information about phonology or morphology (especially inflectional forms), however, is not featured in WordNet or EuroWordNet and also etymological information is only covered to a very limited extent by these two as well as by OmegaWiki. Of the four projects, Wiktionary is the only one accounting for this feature. Media files can be included in both OmegaWiki and Wiktionary.

Lemon

Since the lemon model might be a promising model for our purpose, its main structure will shortly be outlined.

Struktur

The purpose of ”lemon” is to offer a model for ”sharing lexical information on the semantic web ”. In our case, it may be useful for the structuring of Wikidata as it imposes a structure that offers just the right amount of granularity that we wish to represent in the third proposal, ie. it is language-dependent and differentiates between a surface form of a lexical entry and its sense, which refers to an ontology entry. In both relation types, multiple relations can be represented and can also be subcategorized into ”common form” versus ”variant” etc. The different categories are built as illustrated in the figure and will be explained separately.

Lexicon
A Lexicon contains all Lexical Entries of a certain language and labels them with the according language code.
Lexical Entry
A Lexical Entry represents one lexeme, i.e. a word or multi-word term in a certain language that has one or more forms and one or more senses.
Lexical Form
The Lexical Form of a certain Lexical Entry is described by its written representation. There can be various Lexical Forms of a Lexical Entry that may be categorized into Canonical Form – the usual written representation –, Other Form – which can be a different and less common spelling or for example an inflected form – and Abstract Form – a non-realizable form, for instance the stem of a word. Via properties, alternative forms can further be described. An example could be ”property: category plural”. Alternative written representations that are equally common can be represented accordingly. It is not necessary to decide for one variant.
Lexical Sense & Ontology
The Lexical Sense represents the relationship between the lexical entry and the ontology entry, thus, to what the entry refers. In the case of homonyms or polysemous words, one Lexical Entry refers to more than one Ontology Entry. Since one Ontology Entry can also be referred to by different Lexical Entries, there is a many-to-many-relationship between Lexical Senses and Ontology Entries.

Further features that may be of value for our plans include the possibility of representing either words or multi-word terms as lexical entries. It is also possible to store information about the decomposition into words and morphological compounds. Also it can be advantageous to be able to assign properties to the relations. The model also offers modules for automatic inflection generation, which will not be covered at this point but that may be interesting as soon as it comes to deciding if and how automatic information generation shall be handled.

Exempel

The following example, which also explains how translations can be handled, is taken from the lemon cook-book.

The left side of the illustration above shows three different Lexica (English, German, French), each of which has one Lexical Entry (”cat: LexicalEntry” in English, ”chat: LexicalEntry” in French, ”katze: LexicalEntry” in German) and each of these LexicalEntry-boxes points to a Lexical Form with the written representation of the entry including the language label. These relations carry the value ”canonicalForm”. As described above, there can also be alternative forms represented at various points in the system. Every Lexical Entry also points to a Sense and these Senses are all interconnected, carrying the label ”translationOf”. Thus, translation happens between Senses. These Senses would all point to the same Ontology Entry as explained above but is not illustrated in the figure.


Proposals

There are three main proposals that were made regarding the restructuring/extension of Wikidata in order to support Wiktionary.

Initial Proposal

The initial proposal was made by Denny Vrandečić and first announced on June 19, 2013. It is based on the introduction of two new entity types to Wikidata: expression and sense.

While the typical Wikidata-item may have a label in every language (the English label for Q1749 is “Copenhagen”, the Danish label of the same item is “København“ etc.), regarding an expression, there would only be one label altogether. Since in this proposal, the term (word or multiword term) itself together with its linguistic information is what is of interest, it seems clear that there may not be any translated word forms in the different languages when talking about the same expression. An expression itself is dependent on the language it belongs to, the English word “Berliner” is a different expression than the German word “Berliner”. The expression “Berliner” would thus be dependent on the morphological surface form (and not have a different label – the French translation Berlinois/e or anything similar). In the latter case, there are (at least) two different meanings to the expression that are attached to it – “person from Berlin” and “doughnut with a sweet filling” – regardless of their different etymologies. Likewise, it would be the same if they required a different pronunciation, hyphenation etc. as long as the spelling is identical. As explained in the terminology, the notion of sense is introduced.

These short descriptions (such as “person from Berlin”) are called glosses. Hence, the expression “Berliner (German)” has two different senses that can be referred to with the glosses “person from Berlin” and “doughnut with a sweet filling”, respectively. The expression “Berliner (English)” has one sense, which can be referred to with the gloss “person from Berlin”. In Wikidata, there would therefore be two pages: “Berliner (English)” with the section “person from Berlin” and “Berliner (German)” with the sections “person from Berlin” and “doughnut with a sweet filling”.

Linguistic properties would be registered as statements by the users and both an expression and a sense can have statements. While in the case of “Berliner (German)” the statement regarding hyphenation would be attached to the expression, the statements regarding for instance synonymy or translation would need to be associated with the according senses. This proposal does not plan any search-aliases for inflections. Every derivational term (plural forms, inflected verbs etc.) will be a discrete expression.

Alternative Proposal

The alternative proposal by User:Micru (David Cuenca) and User:Francis Tyers and served as a reaction to the first one and was announced on July 1, 2013. It is based on the introduction of two new entity types (defined meaning, bond) and one new data type (a paradigm).

One of the main differences to the initial is the splitting between expressions and their senses – while in the initial proposal, all senses of one expression are collectively listed on one page, in the alternative proposal, there will be one page per sense. Following a similar terminology, these senses will be language-dependent as well (that is: “Berliner (German) – person from Berlin” will be a different entity than “Berliner (English) – person from Berlin”). What is referred to as sense in the initial proposal is called defined expression in this proposal, similar to the terminology in OmegaWiki although not quite the same, as the OmegaWiki-DM is based on a translatable definition, while in this proposal, the defined expression can have its own definition in each language.

The second new entity type, a bond shall replace property-links to some extent, representing certain statements as results to automatic searches, thus partly being built automatically. This will happen whenever an automatic search/inference allows it. Examples could be the automatic linking of exact translations or exact synonymy. Since there are certain difficulties associated with these kinds of inferences (semantic drifts etc.), a differentiation between strong (for example exact meronymy) and weak links (for example near-synonymy) is proposed in order to handle these phenomena better. Paradigms are language-dependent sets of rules to automatically generate derived forms. In this proposal, those shall serve as aliases to the base form of the defined expression (and optionally be stored as “inflected forms”).

Third Proposal

Fig. 7: Example entries "Hamburger", Third Proposal
Fig. 8: Example entry third proposal

The third proposal emerged predominantly out of discussions about the initial and the alternative one. It was put forward on August 2, 2013 by Denny Vrandečić. This proposal uses a slightly different terminology which is introduced below. The terms sense and gloss, however, are defined the same way as in the terminology.

  • A lexeme, also known as word or lexical entry, is what is described on one page in the lexical part of Wikidata. A lexeme consists of a lemma, a lexical category, a language, a set of forms, a set of senses, and a set of statements.
    • The lemma is the canonical form or dictionary form of the lexeme, e.g. for verbs this is usually the infinitive form, for a noun the nominative singular, etc.
    • The lexical category, also known as the part of speech or word class, defines the lexeme to be either a noun, or a verb, or an adjective, etc. The set of possible values is open and taken from the Wikidata items.
    • The language of a lexeme is taken from Wikidata items, and thus an open set.
    • A form is a specific, fully conjugated or inflexed form of the lexeme. A form consists of a representation, a set of lexical properties, and a set of statements. A form always belongs to one (and exactly one) lexeme.
      • A representation is the actual string value realizing a given form, e.g. the string value "wrote" for the past tense of the lexeme for "write". All representations are indexed for search.
      • A lexical property describes the form, e.g. tense or number for verbs, case for nouns, etc. This is an open set and points to Wikidata items.
    • A sense is described by a gloss and has a set of statements. A sense always belongs to one (and exactly one) lexeme (and lexemes belong to one language only). Senses are not independent of lexemes.
      • A gloss is a short description (translatable in all languages of the Wikidata UI) of one sense of the given lexeme.

The terms Wikidata item, property, string value, qualifier, statement, and claim are taken from the Wikidata glossary and have the same meaning here. See also the further glossary.


Some of the most important alterations to the previous proposals are the following:

The ”basic” unit is the lexeme. It is not the expression as the initial proposal suggested and where each morphological form was a separate expression thus having a separate entry-page, nor the sense as the alternative proposal suggested (which may only be one part of the lexeme, in case the lexeme is polysemous/homonymous), nor the language-independent surface form as it is the case in Wiktionary.

Senses, forms and lexemes can have statements. This differs from the initial proposal to the extent that in the initial one, a separate entry for all derived forms was demanded. In the third one, inflections are ”alternative forms” that may but do not need to have statements separately from their lemma. While the alternative proposal suggested statements on sense-level (and depending on the implementation of inflections statements on either all or no inflected forms), in the third proposal, it is possible to decide where a statement is the most useful. This way, all necessary differentiations can still be drawn but shared information can be stored less redundantly.

Inflections are handled as aliases for search and do not need to have a separate entry. This is similar to the alternative proposal. However, in the third one, decisions about what may be computed automatically - for example via paradigms - are postponed to a stage where there is enough linguistic data in Wikidata for a more detailed discussion on this matter. Figure 8 shows the example entry, taken from the proposal, with more detail than the more schematic "Hamburger"-Example.

The Hamburger-Example

The "Hamburger"-example would in this case be represented as in Figure 7.

Översikt

The table shows a comparison of some details of the three proposals.

Comparison of the Wiktionary/Wikidata-Proposals
Proposal Entry Scope Inflection Handling Statements Storage in Wikidata
Initial one expression; each morphological form separately own entry for each inflection possible on both expression and sense via statements
Alternative one sense of an expression aliases to base form; stored as inflected form attached to sense via bonds
Third lexeme with all its senses attached to lexeme via form; can hold own statement possible on lexeme, sense and form via statements


Further Wikidata-Terminology

The following are taken from the Wikidata glossary and are shortened at some points.

claim
A claim is a piece of data about the entity on whose page the claim appears. A claim consists of a property (such as "Location") and a value (e.g., "Germany"), or some other relation or composite or missing value. A claim can have qualifiers, such as temporal qualifiers saying that the claim is valid within a specific time frame. Compared to the triplets used in linked data, a claim uses a property to express the predicate of a triplet and a value to express the object of a triplet. Claims form part of statements on item pages.
item
A Wikidata item is a page in the Wikidata main namespace that represents real-life item topic, concept, or subject. Items are identified by a prefixed id, or by a sitelink to an external page, or by a unique combination of multilingual label and description. Items may also have aliases to ease lookup. The main data part of an item is the list of statements about the item. An item can be viewed as the subject-part of a triplet in linked data.
property
A Wikidata property (in some languages translated to attribute) is the descriptor for a data value, or some other relation or composite or possibly missing value, but not the data value or values themselves. Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value.
qualifier
A qualifier is a part of the claim that says something about the specific claim, often in a descriptive way. A qualifier might be a term according to a specific vocabulary but can also be a variant descriptive phrase.
statement
A statement is a piece of data about an item, recorded on the item's page. A statement consists of a claim (a property-value pair such as "Location: Germany", together with optional qualifiers), augmented by optional references (giving the source for the claim) and an optional rank (used to distinguish between several claims containing the same property). Wikidata makes no assumptions about the correctness of statements, but merely collects and reports them with a reference to a source.
string
A string (short for character string) is a general term for a sequence of freely chosen characters interpreted as text (e.g. "Hello") — as opposed to a data interpreted as a numerical value (3.14), a link to an item (e.g. Q1234) or a more complex datatype (the set {1,3,5,7}). Wikidata will in addition to a string datatype support language specific texts; "monolingual-text" and "multilingual-text" as the value of a property.

Referenser

  1. http://en.wiktionary.org/wiki/Wiktionary:Statistics
  2. http://meta.wikimedia.org/wiki/Wiktionary/Table
  3. AboutWordNet