Wikidata:Lexicographical data/Development/Proposals/2013-08/ne

From Wikidata
Jump to navigation Jump to search
This page is a translated version of the page Wikidata:Lexicographical data/Development/Proposals/2013-08 and the translation is 5% complete.

This is a draft proposal about how Wikidata could support Wiktionary. This proposal summarizes one set of views from the previous discussions and results. Previous proposals and discussions are at Wikidata:Wiktionary/Previous.

Please use the discussion page to propose changes to this proposal.

Terminology

Unfortunately, the terminology around dictionaries and lexical resources is easily confusing. Therefore we provide a terminology that should be used strictly and consistently throughout the proposal. In order to make it obvious, we will use the technical terms throughout in italics, like this.

  • A lexeme, also known as word or lexical entry, is what is described on one page in the lexical part of Wikidata. A lexeme consists of a lemma, a lexical category, a language, a set of forms, a set of senses, and a set of statements.
    • The lemma is the canonical form or dictionary form of the lexeme, e.g. for verbs this is usually the infinitive form, for a noun the nominative singular, etc.
    • The lexical category, also known as the part of speech or word class, defines the lexeme to be either a noun, or a verb, or an adjective, etc. The set of possible values is open and taken from the Wikidata items.
    • The language of a lexeme is taken from Wikidata items, and thus an open set.
    • A form is a specific, fully conjugated or inflexed form of the lexeme. A form consists of a representation, a set of lexical properties, and a set of statements. A form always belongs to one (and exactly one) lexeme.
      • A representation is the actual string value realizing a given form, e.g. the string value "wrote" for the past tense of the lexeme for "write". All representations are indexed for search.
      • A lexical property describes the form, e.g. tense or number for verbs, case for nouns, etc. This is an open set and points to Wikidata items.
    • A sense is described by a gloss and has a set of statements. A sense always belongs to one (and exactly one) lexeme (and lexemes belong to one language only). Senses are not independent of lexemes.
      • A gloss is a short description (translatable in all languages of the Wikidata UI) of one sense of the given lexeme.

The terms Wikidata item, property, string value, qualifier, statement, and claim are taken from the Wikidata glossary and have the same meaning here as there.

Notes

  • Transliterations in other scripts can be either handled by two separate lexemes or by a single lexeme with a statement on each form with the transliteration property pointing to a string value, with a qualifier describing the script. If the latter, transliterations will be indexed for search too.
  • Orthographic variants can be either done as two separate lexemes or by a single lexeme with statements on the appropriate level and qualifiers explaining the variant. If the latter, the variants will also be indexed for search.
  • Translations can be either done from sense to sense, or by a sense referencing a common Wikidata item. If the latter is done, the translations will be automatically displayed and kept up to date. This is only possible when the translation is symmetric and transitive, which is often not the case — but frequently enough to merit a specific implementation.

Example entry

  • (lexeme) L123 (won't be displayed)
  • (lemma) apple
  • (language) English (i.e. Q1860)
  • (lexical category) noun (i.e. Q1084)
  • (statement) pronunciation → IPA /ˈæpl̩/
  • (statement) syllable → "ap-ple"
  • (form) F272 (won't be displayed)
    • (representation) apples
    • (lexical property) plural (i.e. Q146786)
    • (statement) rhymes with → grapples (F404)
  • (sense/meaning) S2011 (won't be displayed)
    • (gloss) (en) tree of the genus Malus
    • (gloss) (de) Baum der Gattung Malus
  • (sense) S1989 (won't be displayed)
    • (gloss) (en) fruit of the apple tree
    • (gloss) (de) Frucht des Apfelbaumes
    • (statement) translation → Apfel (i.e. S9000, which is connected to W234, which has the lemma 'Apfel' and the language 'German')
    • (statement) hypernym → fruit (i.e. S239)
  • (linguistically related words)

इत्यादि

Note that this is a single entry, i.e. forms and senses do not have their own pages but are part of the lexeme they depend on.

Editing in Wikidata

Wikidata users are not required to be multilingual, and they can view any item or lexeme in their preferred UI language and also edit it there. This is mostly in the same vein that edits are done in Wikidata already. A bit more challenging will be to make the senses work, as they potentially require quite a number of translations. Here improvements in language fallback and work on the multilingual text datatype will provide additional help.

There will not be a site of its own for Wiktionary data, but it will be saved in Wikidata proper.

Usage in Wiktionary

Language links (Phase 1)

  • If one has the notion that links should exist between lexemes, then it becomes a problem that whereas Wikipedia and Wikidata have basically a 1-to-1 mapping of articles in a Wikipedia to items in Wikidata, this is not the case for Wiktionary. Wiktionary pages contain all lexemes that have the same representation or lemma across languages and lexical categories, while on the other hand, specific forms might have their own Wiktionary page, but there is no lexeme in Wikidata for it. It follows that if one desires to link lexemes, language links for the actual words in Wiktionary should not be moved to Wikidata and provided from there. On the other hand, there are a number of pages that can be linked through Wikidata, especially outside the main namespace, like the Wiktionary equivalent to the village pump etc. This can not be provided by such an extension.
    In order to resolve these perceived issues, two functionalities have to be developed:
    • an extension that creates language links automatically for Wiktionary. Only after this is deployed to Wiktionary, Wikidata Phase 1 functionality should be switched on (in order to avoid any motivation to create items for all current Wiktionary main namespace pages)
    • extend Wikidata with the functionality to provide arbitrary access to any Wikidata item from any page on the client. Currently, access to data is restricted to only the connected item: bug 47930.
  • If, on the other hand, one has the notion that interwiki links should be between pages, as they currently are, then it would be trivial to house inter-Wiktionary links on Wikidata, because all main-namespace links can be done trivially by merely linking to the page with the same name on the other Wiktionary.

Using the data (Phase 2)

Once a Wiktionary is connected as a client to Wikidata, it is possible to access and display any data from Wikidata on any Wiktionary article. Note that this is explicitly a possibility: any Wiktionary project can decide from word to word or language to language whether and how they want to use any or all of the data in Wikidata.

Especially for other than their primary languages in smaller Wiktionaries it is expected that they might decide to basically just visualize the data from Wiktionary in a more appropriate way. Wikidata will not for a long time achieve to be as visually coherent and concise as a page like Wiktionary can be. For larger communities especially regarding their primary language it is rather expected to use the from Wiktionary to check part of the entries that they already have, and create automatic reports listing errors. This way a layer of quality assurance is being provided by Wikidata for the Wiktionaries that is harder to breach than merely containing the entries in isolation.

Also it is not expected that everything represented in Wiktionary can be added to Wikidata. Complex etymologies, usage notes, discussions might elude Wikidata for a long while, if not forever.

Wiktionary remains Wikimedia's prime mode of disseminating lexical knowledge in the languages it is available in, just as the Wikipedias obviously remain their primate over the respective Wikidata items. Wikidata is a supporting project, which offers a few novel modes of help to the Wiktionary projects, which they can use if they so wish.

Possible further plans

Whereas the first implementation pass should not spend too much time on bringing the UI for forms into a too compact version. Once we have some data and see some usage patterns, this can be done by for creating tabular views per language and lexical category. Also an automatic creation of the forms based on morphological classes of the lexeme could be attempted then. Both extensions should wait for sufficent data to flow in in order to make the appropriate technical decisions.

Technical implementation details

Lexemes, forms, and senses are entities, but forms and senses are not being described in their own wikipages but as part of their containing lexeme. They are connected with the lexeme through a claim of has form or has sense. A lemma is (a required claim with a string value on the lexeme) or (the monolingual label). They have no description that can be set by the user, but it is automatically generated out of the lexical category and language. The lexical category and language are both properties that are used in claims on the lexeme, but they are required to be set to a concrete value. Lexical properties are also claims on the forms. The gloss is (a claim with a multilingual text value) or (a multi-lingual description) of a sense. The representation is a (required claim with a string value) or (the monolingual label) of the form. There can be only one lemma, lexical category, language per lexeme. There can be only one representation per form. There can be only one gloss per sense. Maybe there should be similar restrictions for transliterations and sense references (as per notes).

Acknowledgements

Seriously, too many. The proposal is heavily influenced by discussions with Duesentrieb, Micru, Francis Tyers, Lavour, Markus Krötzsch, Eloquence, EncycloPetey, 23PowerZ, User:-sche, and many others. It is also is based on previous work as done for OmegaWiki (thanks to GerardM and Kipcool) in the Wiktionaries, WordNet, etc., as analyzed by a number of researchers. Finally, the end result is heavily inspired by the Lemon model, and thanks to the researchers involved for their time to answer us questions. In short, this work is not my achievement in any way, I am merely editing and trying to bring it together in a concise form. And now that it is in a wiki, feel free to discuss and modify it. --Denny (talk) 12:01, 2 August 2013 (UTC)[reply]