Wikidata:Lexicographical data/Development/Proposals/2013-07

From Wikidata
Jump to navigation Jump to search
This was an alternative proposal to the June 2013 proposal.

The Wikidata:Wiktionary proposal it is a good starting point, however, it requires some modifications to avoid the same problems that Omegawiki had, as outlined in the comments of the previous proposal, and to let Wiktionary communities flourish in their own language environment.


  • Users master one or a few languages and there is no need for a common language
  • Some equivalence between meanings/word senses can happen, but it is not a requirement
  • A single “sense” may have >1 translation, while several senses may have the same translation. ex. “estació balneària” = “bainuetxea”; “estació d’esquí” = eski estazioa
  • An expression forms a whole with its meaning and it is possible to cluster them
  • A meaning/word sense in one language may not have a single translation in another language: “oposició”, “laufabrauð”
  • There is no perfect solution, contributor time is limited and --- in a free/open-source environment --- solutions can be improved iteratively

Technical proposal[edit]

Two kinds of entity types (defined expression and bond) and a new data type (paradigm)

  • a defined expression is a per language, per meaning item that contains links to other defined expressions (related, synonym, etc) in the same or in different languages (exact, close match, cultural equivalent, etc). It might also contain keywords, domains, and pronunciation. The definition and etymology might be in other languages.
  • a bond is an automatically or manually generated item-query based on the links between a defined expression, keywords, domains, etc. for a given defined expression. “Strong bonds” are closely related matches in the same and other languages. “Weak bonds” are related matches either by meaning or word morphology in the same and other languages.
  • the data type paradigm is a set of rules to generate derivatives of defined expressions. Inflections, declensions, conjugations, etc, they are all several kinds of paradigm.

What are users expected to do:

  • Maintain their dictionary in their own language (no need to know other languages)
  • Link to languages they know (no need to add links to languages they don’t know)

What would users get back:

  • A semantic dictionary in their own language, with a proper structure for inflections
  • Automatically maintained cross-language equivalents (when possible) with minimal human intervention

Defined expression examples[edit]

An example of the conversion of the English Wiktionary white whale

Internal structure on Wikidata

  • Defined expression 1 (W111):
    • label (same for all languages, can have transliterations as aliases): white whale
    • description [eng]: animal (<-this is equivalent to glosses)
    • description [isl]: dýr
    • <definition>[eng] (formatted multilingual string): A cetacean, Delphinapterus leucas, found in the Arctic Ocean.
    • <definition>[cat]: Un cetaci, Delphinapterus leucas, que viu a l’Oceà Àrtic.
    • <etymology>[eng] (formatted multilingual string): From white +‎ whale.
    • <direct translation>[spa]: balena blanca (W0055)
    • <direct translation>[nld]: valgevaal (W888)
    • <represents concept> beluga whale (Q132072)
  • Defined expression 2 (W222):
    • label: white whale
    • description [eng]: obsession
    • <figurative sense from>: white whale (W111)
    • <etymology>[eng] (formatted multilingual string): reference to Herman Melville's 1851 novel Moby-Dick.
    • <same definition as>: obsession (W7787)
  • Defined expression 3 (W333)
    • label: white whale
    • description [en]: printing plate
    • <definition>[en] (formatted multilingual string): A printing plate, used to manufacture a particular sports card, that is then issued as a collectible itself.
    • <domain> trading cards (W0987) (<-these could be Q items instead)
    • <domain> manufacturing (W8690)

Notes to these examples:

  • defined expression 2 (W222) has no definition because it takes it from another item (transclusion) and at the same time forms a strong bond with W7787.
  • W111 has only some translations, and it also links to the concept it represents (Q item), which can be used as linking hub for well-established multilingual concepts. It can be that other defined expressions point to W111 as a translation or that W111 points to other defined expressions that are not linked to the Q item. The relationships will not be perfect and that is ok, they are created as suited and improved over time.
  • Each time a defined expression item is created, a corresponding bond item is automatically associated with it. This bond item contain several automatically generated queries than can be manually modified.
  • user generated Wiktionary-domain queries, like "manufacturing domain words related to animals", or "size related terminology", can be associated with the bond item either manually or automatically..

Bond examples[edit]

Bond items are a collection of queries associated with a defined expression item. The first query contained in the bond item is for defined expressions with the same or similar morphology. The second query is for related meanings of that particular defined expression in the same language. And the third one is for relations in other languages. [There can be a fourth one for phonetic resemblance or rhymes but it is not required on the first stage] For instance B222, the automatically generated bond item associated with W222 (white whale - obsession), would have these parts:

  • morphology
    • strong bond: W111, and W333
    • weak bond: “white” (B99), “whale” (B45)
  • same language meaning
    • strong bond: obsession-2 (W7787), monomania-2 (W768)
    • weak bond: compulsion (W934), fixation (W345), attention-related manias (B75302)
  • cross language meaning
    • strong bond: (ca) (cultural equivalent) fal·lera (W3445)
    • weak bond: (it) ossessione (W3444), (fr) obsession, etc.

Notes to these examples:

  • this is only an example. Strong bonds can also include sub-categories like opposites, etc.
  • although "monomania-2" appears on the example, this is just to indicate that it is a particular definition of monomania. In reality the "label" associated with that "monomania" should appear either as a superscript or on mouse hover.
  • The English Wiktionary "white whale" page is equivalent to "show all morphology strong bonds for 'white whale'", however in Wiktionary the presentation will be different and an elaborate user interface will be needed so the user doesn't need to interact with Wikidata.

Paradigm examples[edit]

The paradigm data type is a set of rules to generate derivative forms from the defined expression. In the praxis it will be a call to a Lua module that will generate the forms. Example:

  • <Ancient greek third declension noun> παράδειγμα|παραδείγμα|τ|παραδειγμά

Will be considered a call to the Lua module associated with the property <Ancient greek third declension noun> passing as parameters the string of text. This will generate all the inflections for the noun as shown in this table:

Each language will have its own set of paradigms for generating verb forms, plurals, etc. Some of them will have provisions for irregular forms, and it will be possible to generate them entirely manually if they cannot be generated automatically (i.e just a Lua module that outputs the same input parameters).

These generated forms will be used as aliases for the defined expression and will show up in searches and bonds. Optionally these script-gennerated inflected forms could be stored in the database as "inflected forms" (I items).

User experience on Wiktionary[edit]

From the user point of view, when browsing Wiktionary:

  • The structure of the page would be more or less the same as now in Wiktionary, perhaps more clear (word page > language(s) > defintion(s) > related words / translations (for each definition))
  • Language interwiki linking will be automatic, but the system for generating them will be different than Wikipedia. The interwiki link will be automatically created and linked as soon as there is relevant information about the word to be shown in a particular language.
  • It will be possible to show information in more than language, so if there is no definition in the user language, but there is information in another language the user knows, it will be shown in that language.

When adding new definitions to Wiktionary:

  • form based and with a flexible system of properties similar to Wikidata
  • possible to share definitions or etymologies from other words (transclusion)
  • possible to establish relationships with the specific meanings of other words using properties. The description will be used to desambiguate (same as now happens in Wikidata when connecting a property with an item that has the same name as others)

When adding translations to Wiktionary:

  • two possible ways of doing it: (1) connecting with the Q item will import the translations connected with that item from other languages, (2) directly specifying the word and desambiguating the meaning (here some knowledge of the target language might be useful).
  • the added translations will spread through the system and will show up in all relevant Wiktionaries.

When adding inflections:

  • advanced users will create the modules that generate inflection patterns (paradigms). The effect is the same as templates to generate inflections, but this kind of modules will be reused (for instance when showing information about a Greek word in the Russian Wiktionary, or that same Greek word in the Chinese Wiktionary).
  • in some cases it will be possible to use a property <same plural inflections as> "word"

How is this different from Omegawiki[edit]

Omegawiki has a rigid structure where users have to link words with meanings. This is not practical because sometimes the meanings either (a) they are not translated into the user's language, or (b) directly there is no equivalent meaning to link to. In the proposed system there is no structure, each word in each language for each meaning has an entity of its own. The system creates the structure depending on how much information users supply (connections with other words using properties) or sometimes reusing the concepts from Wikipedia (Q items) to act as central hub as seen convenient.

About this proposal[edit]

This proposal was prepared by User:Francis Tyers and User:Micru as an alternative to Wikidata:Wiktionary