维基数据:维基词典的项目和提案比较

From Wikidata
Jump to navigation Jump to search
This page is a translated version of the page Wikidata:Comparison of Projects and Proposals for Wiktionary and the translation is 17% complete.

簡介

这项工作的目的是讨论一个新的提案,该提案基于之前的两个提案,通过提供结构化语言数据,然后可以在维基词典中表示,通过维基数据可能支持维基词典。 这包括回答有关维基数据当前结构的某些问题,因为并非所有决策都会自动出现。例子是维基数据中语言实体的范围和界限,衍生的,但不是同义词形式的处理,它们的结构化表示或可能包含不同子类别的条目的结构/构成本身。 此外,应涵盖类似项目的结构——WordNet、EuroWordNet和OmegaWiki——以及用于共享词汇信息的模型柠檬。

每个部分都会有一个示例项目,以说明它们之间的差异。每个案例中的例子都显示,不同的结构如何用英语表示汉堡,意思是来自汉堡的人和德语中的汉堡一词,可以表示来自汉堡的人或由汉堡组成的热三明治。煮熟的碎牛肉馅饼,切成薄片的小圆面包,有时也含有沙拉蔬菜,调味品或两者兼有(请注意,这可以翻译成汉堡包,但由于大写而与汉堡包不同)。在每个系统中,一个框将代表一个条目,即一个单独的项目页面。当然,在大多数情况下,这些盒子的内容将大大减少,以便获得更多的示意图。

术语

為了避免可能的誤解狀況,接著說明文件將會有簡短解釋述語的大綱。接下來幾章會更深入解釋術語,能到深入維基數據術語章節閱讀.在特定章節才會提及的術語只會在該章節提及。

表面形式
这个术语仅用于指形态表面形式(如何写一个单词),忽略语音表面形式(单词如何发音)等。这也意味着表面形式的“go”作为法语名词;棋盘游戏等同于“go”作为英语动词。然而,“波兰语”和“波兰人”这两个词并没有相同的表面形式,因为它们中只有一个是大写的。
gloss
A gloss is a short description of what the word/expression denotes but which does not need to be as detailed as a definition that can be used interchangeably with the word/expression. An example of a word together with its gloss is ”Fado – musical genre”.
expression
The term expression will refer to same surface forms that also share the feature-value for language and have the same morphological features, thus also belong to the same word category. The English adjective “blue”, referring to the color and the English adjective “blue” referring to a melancholic state of mind are for example the same expressions (just with different meanings) but the English verbs “tear” (to rend by holding or restraining in two places and pulling apart, whether intentionally or not; to destroy or separate) and “tear” (to produce tears) are not, due to different morphological values such as different inflection-forms: tear/tore/torn versus tear/teared/teared. The English noun “bike” and the English verb “bike” are not the same expressions due to different word categories and neither are the English noun “chat” (an exchange of text or voice messages in real time through a computer network) and the French noun “chat” with the same meaning, due to different languages.
sense
The sense of a word/multi-word-expression is the word/multi-word-expression combined with the meaning it is associated with. The English word “chat” for instance can have the meaning to be involved in an informal conversation, “chat” in French can either have the same meaning or have the one of “cat”, however their senses are different, as the expressions differ. A sense is dependent on the term it belongs to – to give another example: “Le Mépris” (film by Godard) and “Pierrot le fou” (film by Godard) are both described with the same gloss (“film by Godard”), but they do not share their senses. The sense in the first example is “Le Mépris” (film by Godard) and in the second “ Pierrot le fou” (film by Godard).
entry
The term entry denotes the basic presentation-unit in any of the different dictionaries and is in this respect equal to the term page. For Wikidata, an entry is a wiki page, and is the basic editorial unit, used to track edit histories, authorship, etc.

维基词典和维基数据

本节旨在介绍当前维基词典和维基数据的功能,以便为分析这三个提案奠定基础。这些概述之后是一个短暂的动机,为什么两个维基项目的可能联系可能是有益的。

维基词典

Fig. 1: Example entry "Hamburger", Wiktionary
Fig. 2: Alternative Example entry "Hamburger", Wiktionary

The approach of the open-source Wiktionary is to have a multi-lingual dictionary with entries from all languages. In the English Wiktionary there are entries about the German word “Unterstrich” as well as about the Hungarian word “tüzes” etc. All information in the entry (except the foreign-language vocable) are written in English and also English words have entries. In the Thai Wiktionary, they are written in Thai and so on. Currently, there are expressions from 1062 languages, of which there are 522 with more than ten entries[1], in the English version alone. At the moment, there are 170 language versions[2].


The entries are structured as follows: One entry handles all expressions from all languages according to one morphological surface-form. This means for example that both the German adjective “arm” and the English noun “arm” are associated with the same entry. The sub-sections are structured according to language (the respective Wiktionary’s language has priority when it comes to ordering the sub-sections) and POS, which means that the entry of the page which covers a surface-form is divided into expression-sections, which in turn are divided into sense-sections if one expression holds more than one sense. If property-values such as pronunciation or etymology are shared, they can appear at the beginning of the respective section, if they differ, they are split up into their subsections. There are some smaller differences between the different language versions, such as the fact that the German one also lists translations to other languages (not to confuse with Wiktionaries in other languages that handle the same surface-form) and links to those that are available in the German Wiktionary, the English or French Wiktionaries for example do not treat translations into other languages. Besides such things, the different language-versions follow a similar structure.


维基词典中的信息包括发音(IPA/Sampa:转录和音频文档),词源,字谜,同义词,上位词等,变形表,例句,翻译(如果存在则链接到相应的翻译), 链接到维基百科;关于表达的意义背后的概念的链接以及指向其他维基词典中的条目的链接,这些条目也处理相应的表面形式。 维基词典也有成分,首字母缩略词,缩写,拼写错误和简化拼写的条目。

图1显示了维基词典如何代表“汉堡包”。

请注意,在某些维基词典中,两个德语单词的发音可能位于相应名词部分的前面,用于共享特征,例如上例中的发音,以避免冗余(参见图2)。

维基数据

Fig. 3: Example statement in Wikidata

Wikidata是一个开源数据库,可独立存储结构化数据语言。因此,关于项目的信息由每个语言版本共享,因为概念的标签(通过属性链接链接)和属性链接本身用不同的语言表示;但是锚定在项目/属性的相应ID处。 此外,计划第三种实体类型:查询。这些应用于自动生成列表,例如食品添加剂列表或具有相应信息的河流列表。

每个项目都可以有一个语句列表。Berlin (Q64)具有德国国家地位的信息由声称“行政区划类型:德国国家”的声明表示。这些声明由属性和值以及可能的限定符组成,附带一个(可能是空的)引用列表。 图3给出了一个示例语句。

属性(在图3“人口”中)是在属性部分中解释的实体,并且可以像其他项一样由用户创建。例如:出生日期(P569,受试者出生的日期),签名(P109,一个人的签名图像)或祖先的家(P66,主题祖先在中国的原籍地)。 维基数据中当前可用的信息由用户添加,部分手动添加,部分由机器人添加。底层软件是维基数据库。 显然,“Hamburger”的表达式不能有一个示例条目,因为语言项目的可能的维基数据条目的方法是本文档的主题。

計劃動機

It seems obvious that having access to structured linguistic data can be of great benefit for these 170 language-dependent Wiktionary-versions.

  • Firstly, it would reduce the editing-effort, as information can be drawn from the database automatically if desired.
  • Secondly, as the same holds for corrections that could then have an effect in all entries in all languages at once, a higher information quality can be achieved.
  • Thirdly, this may lead to more extensive entries also in Wiktionaries from smaller languages.
  • Fourthly, having a vast collection of free, structured linguistic data will be very useful for natural language processing applications, researchers, linguists and people “just personally” interested in linguistic structures that can be browsed easily.

Acknowledging the need for certain unstructured information as well - not least: having definitions and explanations of foreign words in an own language, which might and will diverge in context due to different cultural/language backgrounds - , there is not at all the wish to replace Wiktionary by Wikidata but only to help maintain and extend it via the means of offering both a structure, and a basis for anchoring information.

Comparison of the Structure of other Projects

In this section, the underlying structures of the projects WordNet, EuroWordNet and OmegaWiki shall be compared in order to display structural differences between them and Wiktionary.

WordNet

Figure 4: Example entry "Hamburger", WordNet

Structure

WordNet is a free dictionary for English by the Princeton University. Every entry (a word or multi-word term) is associated with one or more so-called synsets. Those group together words that, in a certain context, are synonymous, see the two synsets for the expression “copper” as examples. Further, a gloss is provided, giving a short explanation/definition of the synset. In most cases, there are also sentences for example use.[3]

  1. S: (n) bull, cop, copper, fuzz, pig (uncomplimentary terms for a policeman)
  2. S: (n) copper, copper color (a reddish-brown color resembling the color of polished copper)

The synsets are linked to each other by ontological relations, mostly hyponym/hypernym-relations. This means that for instance the synset {photograph, photo, exposure, picture, pic} can be represented as a direct hypernym for {wedding picture}, {still} or {snapshot, snap, shot}, while functioning as a direct hyponym of {representation}. Also directed relations like “derivationally related form“ are included. The “sub-WordNets” of nouns, verbs, adjectives and adverbs are treated separately with only very few cross-POS-pointers. The structure, however, is the same in each one of them. In total, there are 117.000 synsets.

Terminological Contrasts and Comparison with Wiktionary

In regards to our terminology, one WordNet-item correlates with the term expression: one surface-form with a certain language (in WordNet: English) and one POS-tag. A WordNet-synset is similar to what we define as sense – a certain meaning of a linguistic entity. Here, however, it is represented by a set of synonyms, whereas Wiktionary represents a sense by attaching a gloss/definition to the linguistic entity.

WordNet thus tackles the task of synonymy between senses and also some other semantic relations in the manner of a monolingual thesaurus. One of the main differences to Wiktionary lies in the different representations of senses (synsets versus glosses/definitions and descriptions). Further, WordNet, in contrast to Wiktionary does not provide morphological or phonological information of words, does not approach locutions and does not offer translations.

The Hamburger-Example

Note that WordNet only covers English words, therefore there is no representation at all for the German words in the hamburger example. Furthermore, WordNet does not provide any entries for the capitalized word ”Hamburger” in English, either. Figure 4 therefore shows a sample entry for ”Hamburger” as it would be according to the typical WordNet-structure. As shown, there is only one synset in this case. Also, this one does not provide any synonyms, but only a gloss and relational information.

EuroWordNet

Figure 5: Example entry "Hamburger", EuroWordNet

Structure

While WordNet only covers English words, with the commence of the EU-project EuroWordNet, also WordNets in Dutch, Spanish, Italian, German, French, Czech and Estonian were created and linked to each other, resulting in a multilingual database. Via the interlingual links that are stored in the Inter-Lingual-Index (ILI), wordnets from one language are linked to another. Since the purpose of these links is to match “equivalent” synsets in different languages, no relations between the single ILI-Records are established. This task remains in the single WordNets. This also allows an easy extension of them because no consensus over all groupings needs to be maintained.

Language-internal relations have been broadened with the start of the project, new ones were added and relations now have features such as conjunction or disjunction - "airplane" can have the meronyms "door", "jet airplane" and "propeller". The word "door" can have the holonyms "car", "room" or "airplane". Also, in EuroWordNet, links between synsets with different POS-tags are stored.

However, matching these synsets can be rather complicated – concepts might not exist in different languages or can, together with their in- and outgoing relations, be non-congruent. A concept might for example only be hyponomic relative to another in one language, but not in another. It is thus hard to conclude or infer relations from interlingual mappings.

Terminological Contrasts and Comparison with Wiktionary

The terms item, synset and gloss are equal to those in WordNet, as explained above. The POS-constraint between synsets from different languages, however, is softened: here, equivalence-links may also be between synsets with items with different POS-tags.

Just as WordNet, EuroWordNet uses synsets as sense-representation, which is a major difference to Wiktionary. EuroWordNet in contrast to WordNet is multi-lingual to the extent that it connects synsets of seven languages with each other. Another important difference to Wiktionary, however, lies again in the exclusion of morphological and phonological information.

The Hamburger-Example

Since the use of EuroWordNet is not, unlike the one of WordNet, free, the figure above is built according to what it should look like considering the structure and other examples.

OmegaWiki

Fig. 6: Example entries "Hamburger", OmegaWiki

Structure

OmegaWiki is a multilingual open-source dictionary whose aim is to „describe all words of all languages with definitions in all languages, including lexical, terminological and ontological information”.

The internal structure relies on entries regarding one DefinedMeaning (DM), which is a combination of an expression together with its definition. This definition is regarded to be language-independent and therefore translated into the different languages. Speaking in the terms discussed in the terminology, one DefinedMeaning thus corresponds to a sense. Hence, in example 3 and 4, there are separate pages for the following because they are two distinct DefinedMeanings – in the first case, there is the expression “song” combined with the definition “A musical piece with lyrics…”, in the second one, the expression “song” is combined with the definition “The act of singing”.

  1. song: A musical piece with lyrics (or "words to sing"); prose that one can sing.
  2. song: The act of singing.

Further, there are entries for DMs in other languages that also have expressions with the same surface form. In this example, there is an entry for the Faroese word “song”, which translates to “bed“. This one, however, is represented on a different entry. There are lists per language with the different definitions a word can have, i.e. listing all DMs associated with surface-forms per language. However, these pages only list the existing DMs per expression with their information and it is not possible to share identical information (such as syllabication within one language) between different DefinedMeanings. These therefore need to be duplicated on all concerned entry-pages.

Besides these DefinedMeaning-pages, OmegaWiki also stores certain semantic and ontological relationships between the DMs. These include synonymy, antonymy etc. as well as hyponomy, translation to other languages etc.

Regarding these relations, it is possible to differentiate between exact relations and inexact ones. One example of these would be synonymy: While the English word “German” does not make any statement about the gender of the German person that is talked about, the German word “Deutsche” (as opposed to ”Deutscher”) also encodes the information “sex: female”. Since there is no word that would exactly translate into the English word “German” – where no information about the sex is represented – one could not use the hyponomy/hypernomy-relation to express the mapping between these DefinedMeanings, either. In the database, this is shown by the symbol “~”, meaning the translation is not an exact one. The user can decide, which language he/she wants to use OmegaWiki in. There are more than 300 interface-languages. If an entry/information does not exist in a language, things will be displayed in English.

Terminological Contrasts and Comparison with Wiktionary

OmegaWiki offers translations, glosses in different languages and information about semantic relations. In this matter, its approach is similar to the one of Wiktionary. One main difference, however, lies in the independence of the various language-versions of Wiktionary. In Wiktionary, expressions can be explained/defined in the respective language. In OmegaWiki, translations of concept-definitions which are associated with the respective concept are stored.

The Hamburger-Example

The example of German/English ”Hamburger” in OmegaWiki is shown in Figure 6.

As described above, there are different pages for each DM, however, as shown in the box on the left-hand side, there are no separate pages for German ”Hamburger”–person from Hamburg and English ”Hamburger”–person from Hamburg. Note also, how the entry title on the right hand side is ”hamburger” instead of ”Hamburger” and the German equivalent appears as a translation.

Overview

The above-mentioned projects WordNet, EuroWordNet, OmegaWiki and Wiktionary differ from each other to quite a large extent. Especially as they are partly pursuing different objectives, it is hard to say which ones are, in a general way, “better” than other ones when it comes to the underlying structures of them.

In reference to the chapters above, some of the main structural differences between the four dictionaries are illustrated in the table below.

Comparison of the different Projects
System Free Open Source Entry Scope No. of Languages: Expressions No. of Languages: Definitions Translation
WordNet yes yes one expression with all its senses, classified into synsets 1 1 --
EuroWordNet no no one expression with all its senses, classified into synsets 7 7 interlingual synset-links
Wiktionary yes yes one surface-form with all its expressions with all their senses 1062 (522 with >10 entries) 170 explanation in respective language
OmegaWiki yes yes one sense of an expression (Defined Meaning) 469 469 translation of concept-definitions

However, naturally there are some positive/negative aspects to each project, which could be seen as main criteria in the structure of dictionaries. These will be outlined in this section.

Representation of Translations and Synonymies

In the projects discussed here, there are two different approaches to this matter: One is to link linguistic material from different languages to an (ontological) entity. The other is to ensure translation and synonymy between senses. The advantage of the first one lies in the smaller number of links – in the second approach, there will in the worst case be links from each language to each other one, yielding in a quadratic complexity. However, this way a finer granularity of translations and synonymies can be achieved, resulting in a probably higher quality of information stored in the dictionary. It can be seen as a main advantage of online dictionaries that the problem of space is not as relevant as it is for paper dictionaries, so one could make advantage of this fact and opt for the second version of translation/synonymy-representation.

In the projects above, only Wiktionary uses this structure: both EuroWordNet and OmegaWiki make use of abstract entities that serve as “anchors” to the linguistic material (please see the illustrations of the structures in the respective chapters) and WordNet does not cover any other languages than English in the first place and therefore no translations at all. Regarding synonymy it does link between the sets directly, without recourse to an abstract entity.

Required Knowledge of Foreign Language and Language Specificity of Definitions

One of the most valuable factors of the structure of Wiktionary is the fact that in order to understand what the respective foreign-language material means, one does not need to have a high understanding of this language. This is obviously different in monolingual dictionaries like for instance WordNet, which does not cover any interlingual connections and can thus not be evaluated in this respect. In OmegaWiki, there can be translations of definitions in all languages, making it multi-lingual. However, the content of the definition is translated instead of being language-specifically formulated. Thus fine differences in the meaning may not be representable in certain cases. EuroWordNet also ensures the “definition” rather language specific due to links between synsets in different languages. However, here, a “translation” can only be represented if there actually exists an equivalent synset in the language at question. If this is not the case, there is no possibility for representing the meaning of a synset in a different language at all – this can be approached only vaguely via the relational information in the respective wordnets. However, this cannot be seen as a deficit of the structure of EuroWordNet, as the designated objective is to display equivalence relations between languages and not to give translations of foreign-language material.

The Scope of an Entry

It may be very handy to represent different senses of one expression collectively, for instance in those cases where a user wants to look something up and is not quite sure, what it refers to, as in these cases, the gloss may not be sufficient for deciding between the different senses. If they are collected, it reduces the manual search effort. Furthermore, all information that is shared between the different senses (these could include pronunciation, etymology, morphology etc.) can be displayed in a more effective way, with shared features displayed accordingly. There is no true disadvantage of displaying different senses of an expression grouped together but certain advantages that can make the structure clearer and more concise, and the look-up user-friendlier. In the list of the discussed projects, all make use of this entry-wide collection of senses in regards to the look-up, even though the representations vary – WordNet and EuroWordNet refer to them via synsets, OmegaWiki allows a display of either a DefinedMeaning (one sense) or an expression (with possibly various senses) on one page and Wiktionary even groups together different expressions that may belong to different languages as long as the surface form is identical. Since none of them prohibits the display of collected senses per expression when it comes to searching the database, this does not serve as a feature of differentiation between EuroWordNet, WordNet, OmegaWiki and Wiktionary but should be taken into account when it comes to structuring linguistic Wikidata-entries (even though lucidity may not be a main criterion as the Wiktionaries can process the Wikidata-information very efficiently). However, only Wiktionary allows for structural flexibility in respect of storing information that holds for more than one sense.

Covered Linguistic Material

WordNet and EuroWordNet only cover words or multiword expressions from a limited number of categories of speech. They do not cover phrases, semantically non-referring material, colloquial language or inflected forms. Wiktionary does cover them and OmegaWiki does partly and would at least have the possibility due to the underlying structure.

功能

关于表示各种语言信息的可能性,不同项目在某些方面没有很大差异:在所有四个语言中都有例句使用的句子,翻译(WordNet除外),显然,某种形式的定义。同样适用于诸如反义,上位等的关系信息。 然而,关于音韵学或形态学(特别是屈折形式)的信息并未在WordNet或EuroWordNet中出现,并且词源信息仅在非常有限的范围内被这两者以及OmegaWiki所覆盖。在这四个项目中,维基词典是唯一一个考虑此功能的项目。媒体文件可以包含在OmegaWiki和Wiktionary中。

柠檬

由于柠檬模型对于我们的目的可能是一个很有希望的模型,它的主要结构将很快被概述。

Structure

The purpose of ”lemon” is to offer a model for ”sharing lexical information on the semantic web ”. In our case, it may be useful for the structuring of Wikidata as it imposes a structure that offers just the right amount of granularity that we wish to represent in the third proposal, ie. it is language-dependent and differentiates between a surface form of a lexical entry and its sense, which refers to an ontology entry. In both relation types, multiple relations can be represented and can also be subcategorized into ”common form” versus ”variant” etc. The different categories are built as illustrated in the figure and will be explained separately.

Lexicon
A Lexicon contains all Lexical Entries of a certain language and labels them with the according language code.
Lexical Entry
A Lexical Entry represents one lexeme, i.e. a word or multi-word term in a certain language that has one or more forms and one or more senses.
Lexical Form
The Lexical Form of a certain Lexical Entry is described by its written representation. There can be various Lexical Forms of a Lexical Entry that may be categorized into Canonical Form – the usual written representation –, Other Form – which can be a different and less common spelling or for example an inflected form – and Abstract Form – a non-realizable form, for instance the stem of a word. Via properties, alternative forms can further be described. An example could be ”property: category plural”. Alternative written representations that are equally common can be represented accordingly. It is not necessary to decide for one variant.
Lexical Sense & Ontology
The Lexical Sense represents the relationship between the lexical entry and the ontology entry, thus, to what the entry refers. In the case of homonyms or polysemous words, one Lexical Entry refers to more than one Ontology Entry. Since one Ontology Entry can also be referred to by different Lexical Entries, there is a many-to-many-relationship between Lexical Senses and Ontology Entries.

Further features that may be of value for our plans include the possibility of representing either words or multi-word terms as lexical entries. It is also possible to store information about the decomposition into words and morphological compounds. Also it can be advantageous to be able to assign properties to the relations. The model also offers modules for automatic inflection generation, which will not be covered at this point but that may be interesting as soon as it comes to deciding if and how automatic information generation shall be handled.

Example

The following example, which also explains how translations can be handled, is taken from the lemon cook-book.

The left side of the illustration above shows three different Lexica (English, German, French), each of which has one Lexical Entry (”cat: LexicalEntry” in English, ”chat: LexicalEntry” in French, ”katze: LexicalEntry” in German) and each of these LexicalEntry-boxes points to a Lexical Form with the written representation of the entry including the language label. These relations carry the value ”canonicalForm”. As described above, there can also be alternative forms represented at various points in the system. Every Lexical Entry also points to a Sense and these Senses are all interconnected, carrying the label ”translationOf”. Thus, translation happens between Senses. These Senses would all point to the same Ontology Entry as explained above but is not illustrated in the figure.


Proposals

There are three main proposals that were made regarding the restructuring/extension of Wikidata in order to support Wiktionary.

Initial Proposal

The initial proposal was made by Denny Vrandečić and first announced on June 19, 2013. It is based on the introduction of two new entity types to Wikidata: expression and sense.

While the typical Wikidata-item may have a label in every language (the English label for Q1749 is “Copenhagen”, the Danish label of the same item is “København“ etc.), regarding an expression, there would only be one label altogether. Since in this proposal, the term (word or multiword term) itself together with its linguistic information is what is of interest, it seems clear that there may not be any translated word forms in the different languages when talking about the same expression. An expression itself is dependent on the language it belongs to, the English word “Berliner” is a different expression than the German word “Berliner”. The expression “Berliner” would thus be dependent on the morphological surface form (and not have a different label – the French translation Berlinois/e or anything similar). In the latter case, there are (at least) two different meanings to the expression that are attached to it – “person from Berlin” and “doughnut with a sweet filling” – regardless of their different etymologies. Likewise, it would be the same if they required a different pronunciation, hyphenation etc. as long as the spelling is identical. As explained in the terminology, the notion of sense is introduced.

These short descriptions (such as “person from Berlin”) are called glosses. Hence, the expression “Berliner (German)” has two different senses that can be referred to with the glosses “person from Berlin” and “doughnut with a sweet filling”, respectively. The expression “Berliner (English)” has one sense, which can be referred to with the gloss “person from Berlin”. In Wikidata, there would therefore be two pages: “Berliner (English)” with the section “person from Berlin” and “Berliner (German)” with the sections “person from Berlin” and “doughnut with a sweet filling”.

Linguistic properties would be registered as statements by the users and both an expression and a sense can have statements. While in the case of “Berliner (German)” the statement regarding hyphenation would be attached to the expression, the statements regarding for instance synonymy or translation would need to be associated with the according senses. This proposal does not plan any search-aliases for inflections. Every derivational term (plural forms, inflected verbs etc.) will be a discrete expression.

Alternative Proposal

The alternative proposal by User:Micru (David Cuenca) and User:Francis Tyers and served as a reaction to the first one and was announced on July 1, 2013. It is based on the introduction of two new entity types (defined meaning, bond) and one new data type (a paradigm).

One of the main differences to the initial is the splitting between expressions and their senses – while in the initial proposal, all senses of one expression are collectively listed on one page, in the alternative proposal, there will be one page per sense. Following a similar terminology, these senses will be language-dependent as well (that is: “Berliner (German) – person from Berlin” will be a different entity than “Berliner (English) – person from Berlin”). What is referred to as sense in the initial proposal is called defined expression in this proposal, similar to the terminology in OmegaWiki although not quite the same, as the OmegaWiki-DM is based on a translatable definition, while in this proposal, the defined expression can have its own definition in each language.

The second new entity type, a bond shall replace property-links to some extent, representing certain statements as results to automatic searches, thus partly being built automatically. This will happen whenever an automatic search/inference allows it. Examples could be the automatic linking of exact translations or exact synonymy. Since there are certain difficulties associated with these kinds of inferences (semantic drifts etc.), a differentiation between strong (for example exact meronymy) and weak links (for example near-synonymy) is proposed in order to handle these phenomena better. Paradigms are language-dependent sets of rules to automatically generate derived forms. In this proposal, those shall serve as aliases to the base form of the defined expression (and optionally be stored as “inflected forms”).

Third Proposal

Fig. 7: Example entries "Hamburger", Third Proposal
Fig. 8: Example entry third proposal

The third proposal emerged predominantly out of discussions about the initial and the alternative one. It was put forward on August 2, 2013 by Denny Vrandečić. This proposal uses a slightly different terminology which is introduced below. The terms sense and gloss, however, are defined the same way as in the terminology.

  • A lexeme, also known as word or lexical entry, is what is described on one page in the lexical part of Wikidata. A lexeme consists of a lemma, a lexical category, a language, a set of forms, a set of senses, and a set of statements.
    • The lemma is the canonical form or dictionary form of the lexeme, e.g. for verbs this is usually the infinitive form, for a noun the nominative singular, etc.
    • The lexical category, also known as the part of speech or word class, defines the lexeme to be either a noun, or a verb, or an adjective, etc. The set of possible values is open and taken from the Wikidata items.
    • The language of a lexeme is taken from Wikidata items, and thus an open set.
    • A form is a specific, fully conjugated or inflexed form of the lexeme. A form consists of a representation, a set of lexical properties, and a set of statements. A form always belongs to one (and exactly one) lexeme.
      • A representation is the actual string value realizing a given form, e.g. the string value "wrote" for the past tense of the lexeme for "write". All representations are indexed for search.
      • A lexical property describes the form, e.g. tense or number for verbs, case for nouns, etc. This is an open set and points to Wikidata items.
    • A sense is described by a gloss and has a set of statements. A sense always belongs to one (and exactly one) lexeme (and lexemes belong to one language only). Senses are not independent of lexemes.
      • A gloss is a short description (translatable in all languages of the Wikidata UI) of one sense of the given lexeme.

The terms Wikidata item, property, string value, qualifier, statement, and claim are taken from the Wikidata glossary and have the same meaning here. See also the further glossary.


Some of the most important alterations to the previous proposals are the following:

The ”basic” unit is the lexeme. It is not the expression as the initial proposal suggested and where each morphological form was a separate expression thus having a separate entry-page, nor the sense as the alternative proposal suggested (which may only be one part of the lexeme, in case the lexeme is polysemous/homonymous), nor the language-independent surface form as it is the case in Wiktionary.

Senses, forms and lexemes can have statements. This differs from the initial proposal to the extent that in the initial one, a separate entry for all derived forms was demanded. In the third one, inflections are ”alternative forms” that may but do not need to have statements separately from their lemma. While the alternative proposal suggested statements on sense-level (and depending on the implementation of inflections statements on either all or no inflected forms), in the third proposal, it is possible to decide where a statement is the most useful. This way, all necessary differentiations can still be drawn but shared information can be stored less redundantly.

Inflections are handled as aliases for search and do not need to have a separate entry. This is similar to the alternative proposal. However, in the third one, decisions about what may be computed automatically - for example via paradigms - are postponed to a stage where there is enough linguistic data in Wikidata for a more detailed discussion on this matter. Figure 8 shows the example entry, taken from the proposal, with more detail than the more schematic "Hamburger"-Example.

The Hamburger-Example

The "Hamburger"-example would in this case be represented as in Figure 7.

Overview

The table shows a comparison of some details of the three proposals.

Comparison of the Wiktionary/Wikidata-Proposals
Proposal Entry Scope Inflection Handling Statements Storage in Wikidata
Initial one expression; each morphological form separately own entry for each inflection possible on both expression and sense via statements
Alternative one sense of an expression aliases to base form; stored as inflected form attached to sense via bonds
Third lexeme with all its senses attached to lexeme via form; can hold own statement possible on lexeme, sense and form via statements


Further Wikidata-Terminology

The following are taken from the Wikidata glossary and are shortened at some points.

claim
A claim is a piece of data about the entity on whose page the claim appears. A claim consists of a property (such as "Location") and a value (e.g., "Germany"), or some other relation or composite or missing value. A claim can have qualifiers, such as temporal qualifiers saying that the claim is valid within a specific time frame. Compared to the triplets used in linked data, a claim uses a property to express the predicate of a triplet and a value to express the object of a triplet. Claims form part of statements on item pages.
item
A Wikidata item is a page in the Wikidata main namespace that represents real-life item topic, concept, or subject. Items are identified by a prefixed id, or by a sitelink to an external page, or by a unique combination of multilingual label and description. Items may also have aliases to ease lookup. The main data part of an item is the list of statements about the item. An item can be viewed as the subject-part of a triplet in linked data.
property
A Wikidata property (in some languages translated to attribute) is the descriptor for a data value, or some other relation or composite or possibly missing value, but not the data value or values themselves. Each statement at an item page links to a property, and assigns the property one or several values, or some other relation or composite or possibly missing value.
qualifier
A qualifier is a part of the claim that says something about the specific claim, often in a descriptive way. A qualifier might be a term according to a specific vocabulary but can also be a variant descriptive phrase.
statement
A statement is a piece of data about an item, recorded on the item's page. A statement consists of a claim (a property-value pair such as "Location: Germany", together with optional qualifiers), augmented by optional references (giving the source for the claim) and an optional rank (used to distinguish between several claims containing the same property). Wikidata makes no assumptions about the correctness of statements, but merely collects and reports them with a reference to a source.
string
A string (short for character string) is a general term for a sequence of freely chosen characters interpreted as text (e.g. "Hello") — as opposed to a data interpreted as a numerical value (3.14), a link to an item (e.g. Q1234) or a more complex datatype (the set {1,3,5,7}). Wikidata will in addition to a string datatype support language specific texts; "monolingual-text" and "multilingual-text" as the value of a property.