Wikidata talk:Lexicographical data

From Wikidata
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.


On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2018/09.

Project
chat

Lexicographical
data

Administrators'
noticeboard

Development
team

Translators'
noticeboard

Request
a query

Requests
for deletions

Requests
for comment

Bot
requests

Requests
for permissions

Property
proposal

Properties
for deletion

Partnerships
and imports

Interwiki
conflicts

Bureaucrats'
noticeboard


Contents

10K Lexemes![edit]

Hello all,

I'm happy to let you know that we reached 10,000 Lexemes over the weekend :) I want to thank you for all the work you've been putting in this. I specially loved the nice challenges like the one about months, I hope there will be more in the future.

However, quality and diversity are more important than quantity, and the statistics remind us that many languages still need to be improved in order to catch up with English ;)

Let's keep in touch with development news, bug fixes, and hopefully soon some exciting steps forward about Senses and queries! Lea Lacroix (WMDE) (talk) 08:16, 6 August 2018 (UTC)

Hey, what's wrong with English having the most? Another challenge that might be interesting is to see how many languages we can get the most common 1000 words added for - for example for French there's a list here: http://french.languagedaily.com/wordsandphrases/most-common-words ArthurPSmith (talk) 14:09, 7 August 2018 (UTC)
I fully support @ArthurPSmith:, there are frequency lists available on the net, also in Wiktionaries. I myself follow this list. KaMan (talk) 11:26, 8 August 2018 (UTC)
  • To do things slightly differently, it would be interesting to provide lexemes for uses that aren't easily possibly with other tools. I think we still need to find good ways to link between Lexemes. Obviously, senses will help. As nouns could easily be generated from existing items, maybe verbs could be interesting to focus on. To wikify texts with lexemes, some other categories might need to be done first.
    --- Jura 17:08, 8 August 2018 (UTC)

Now that you have created so many lexemes, what do you use them for? Can anyone explain? Is there any purpose to this whole endeavour? --LA2 (talk) 20:42, 21 August 2018 (UTC)

@LA2: There is already 14K lexemes but it is very small number comparing to Wiktionaries. Lexemes are not usable yet due to lack of support in queries and no interface to Wiktionaries. The only real tool which benefits now from lexemes is Wikidata Lexeme Graph Builder which nicely presents etymological relations. You can find some more propositions of usage in Wikidata:Lexicographical data/Ideas of tools and Wikidata:Lexicographical data/Ideas of queries. There is also a lot of data in Q-items to be moved to lexemes space because they are lexicographical data. KaMan (talk) 08:45, 22 August 2018 (UTC)
@LA2: <edition conflict with KaMan> Hmmm at the moment nothing, I guess. Honestly the endeavour put in it is kind of a bet. I didn't checked yet the whole thing, but I'm pretty sure that'll wikidata's lexeme project could succeed to solve some inherents cross wiktionaries problems. One of them being the lack of homogeneity and the unreadability/usability from bots. Say that's "noun" is a noun sound shitty to us as human wiktionarian since we (I guess) all have this information on all our respective wiktionnary, and it's something of low value for human readers. But the Lexeme project, if I understand well, is to keep only the marrow without the other things (conventions, entry's architectures, presentations, etc.) specifics of our respective projects ; all this things help to make the read easy and exhaustive for our human readers but are absolutely shitty to deal with for bots and developpers which have to make lot of specifics customizations just to bypass this obstacles and be able to read & exploit the basic datas. Concretely let's take an example. At the moment, if I want to create a tool to run/check and use data from the 3 wiktionary projects, let's say fr/en/sv.wiktionary, I'll have to develop and adapt my tool on each wiktionary, to be sure it's match all the little differences between this projects or else my tool will probably run bad or not at all. In other words, I'll have to adapt my tool for each of this body : fr the tall one, en the fatty one and sv the tiny one. The nightmare. Lexeme project, if done smartly and well (that's another bet), could allow to just focus on the essential marrow of the sum of all the wiktionaries datas. In other words, the tall, the fatty and the tiny become just some kind of X-ray pictures that's ignore physiognomy and are easy to work and treat with informatic tools.
Seriously, I mean the only cross-wiktionnary tools that exist (and existed) are and were : 1) the ones which counts entries/articles and compare wiktionnaries (stats are fun to know but almost totally useless) 2) the bots which put the interwiktionaries links (also almost totally useless & uninteresting thing). Period. Why since 2005 so few cross wiktionaries tools were developped ? Because it's almost impossible to develop them, and even if they were developed, then it's hard to maintain them ; wiktionaries are too inconsistent and versatile. 13 damn years past and we still struggle to migrate and adapt conjugations and flexions templates from a wiktionary to another, and it's not because the flexions or the conjugations rules changed, no. It's just because the differences between project's standards. The same for gadgets and other tools that could benefits to absolutely ALL wiktionaries. E.g. fr:wikt:Module:traduction that's allow neophyte and total wiki-noob to add easily a traduction in few seconds without even seeing wikisyntax ; the en.wiktionary got this tool at least 5 years before we saw it on fr project ! 5 years ! And imho this tool should be use on default on all wiktionaries, all of them. I let you guess why it's impossible. So maybe, I'm too optimistic but Lexeme could eventually allow wiktionaries to have their sides linguistics tools that miss them so much at the moment.
TL:DR : It could be useful to some enthusiastics devs. V!v£ l@ Rosière /Murmurer…/ 09:56, 22 August 2018 (UTC)
  • Personally, I want to use lexemes to automatically generate text in a smarter way (a template where I want to use such an ability already exists and is in use on nnwiki). I've gone into some more detail here.
Another purpose of the lexicographical data project is the same as the purpose of the wiktionaries: to provide users information about words they are curious about. One crucial difference between the lexicographical data stored here and on the wiktionaries is, as alluded to above, that the data stored here is structured and stored in a more rigid manner than on the wikitionaries. I find this way of relating to and entering lexicographical data to be rewarding in its own right, while, at the same time, this Wikidata project provides a service similar to the one that the wiktionaries provide. --Njardarlogar (talk) 11:55, 23 August 2018 (UTC)

Once again we crossed 10K. This time it's 10K of nouns, see Wikidata:Lexicographical data/Statistics#Lexical_categories :) KaMan (talk) 06:21, 7 September 2018 (UTC)

20K Lexemes[edit]

And we just passed 20,000 today... ArthurPSmith (talk) 15:45, 10 September 2018 (UTC)

Maybe it's time to stop adding unlinked trash in foolish hunt for another milestone and to look for a way to organize it to some kind of dictionary.--Shlomo (talk) 20:33, 11 September 2018 (UTC)
@Shlomo: Actually ArthurPSmith you are answering to is adding lexemes in very organized method. He also completes lexemes with full set of forms. I'm also editing in organised way: follow three specialized dictionaries (fruits and vegetables, animals, chemical elements) and two frequency lists (1 and 2) and add all possible properties only waiting for senses. What would you suggest to add more to make it more look like dictionary? KaMan (talk) 11:17, 12 September 2018 (UTC)
I didn't intend any offend to ArthurPSmith or anybody else. But the fact is, that as long as we don't have senses and links to Wiktionaries (at least…) no lexeme can be correctly linked to its meaning, translations, synonyms, antonyms, hyperonyms etc. The duplicity control is AFAIK nearly impossible (I'll be glad, if somebody would correct me about this) and if two bots or semi-bots would start adding lexemes in the same language, the project will drown in redundances soon. I'm afraid that making good these omissions afterwards and correcting the incurred errors will be much more work than creating/importing the data later when the data model is more complex. It's great that there is some database on which the new implementations can be tested, but having another thousands vegetable or animal names won't bring us much further.--Shlomo (talk) 13:08, 12 September 2018 (UTC)
You're right. Ultimately it's more fun to just dive right in. That's been the historical approach on Wikimedia projects, and I think it's the right one. I'd rather wait to do any serious editing, but the editors currently creating Lexemes are solving the chicken-or-the-egg conundrum. Additionally, their edits are highlighting all sorts of problems that can consequently be solved earlier rather than later, when the onslaught starts. --Azertus (talk) 13:23, 12 September 2018 (UTC)
@Shlomo: I'm not sure about other languages, but at least for English there's not a problem with duplicates at the moment - I believe I've cleaned up all the obvious duplicate cases with the new merge tool (mentioned below), and there weren't really very many anyway. There may be more subtle duplication issues related to what is the meaning of "lexeme", but I think it's good to have a big corpus of examples to illustrate where we can go wrong. Lexeme search works fine anyway now, and merging is possible, so I don't think the duplication issue is a big deal at all. Bots should check for duplicates before adding of course - so far we have disallowed all bot edits in this space anyway. ArthurPSmith (talk) 19:52, 13 September 2018 (UTC)

Senses are coming soon, please try them on beta![edit]

Hello all,

Almost everything is in the title :) We've been doing a lot of progress on developing the necessary features for Senses. Since we're now polishing the last details, we would like to ask you to try and play with Senses on our beta system. L1 is more like a sandbox, L53, L56 and L57 are reflecting the structure of a real Lexeme, and you can of course create your own Lexemes and items.

Let us know if you see anything that goes wrong or unexpected, for example about:

  • structure of Senses
  • design of the interface
  • display of data depending on your language interface

You can let comments in this section, or directly create tickets on Phabricator. If you do the latter, please add the tag "Lexicographical data" add me as a subscriber.

Thanks a lot! Lea Lacroix (WMDE) (talk) 10:30, 27 August 2018 (UTC)

Hi @Lea Lacroix (WMDE):, I've created L60. All went fine but I have one question. Is it really needed that senses are after forms? My language is affected by large number of forms either for nouns, adjectives and verbs and it requires a lot of scrolling to get to senses. I would prefer senses before forms. Senses are somehow more important for me. KaMan (talk) 11:56, 27 August 2018 (UTC)
@Lea Lacroix (WMDE): Thanks for letting us try it out! The only question that came up for me was the ordering of glosses - it seems to be sorted in the order they were entered, rather than by language code? I think we should use the same ordering as the labels/definitions box (user-preferred languages, then by code) - or just directly order by code? ArthurPSmith (talk) 14:15, 27 August 2018 (UTC)
  • Nice.
    1. The edit box for the gloss seems very narrowest.
    2. Somehow I thought there'd be another box like "lexical categories" or "grammatical features", but maybe statements can do.
    3. Maybe a navbar to skip to the section headers for forms or senses on top of the icon would be helpful.
    4. I hesitate about the point mentioned by KaMan about the order of the sections (before or after forms).
      --- Jura 06:26, 28 August 2018 (UTC)
Thanks for these first answers! @KaMan, Jura1: I totally get this issue about having a very long Form section. We plan to work on a compacted view, but in the meantime, what do you think about the possibility to collapse the entire Form section with a show/hide button? Would that solve your issue? Lea Lacroix (WMDE) (talk) 10:31, 28 August 2018 (UTC)
@Lea Lacroix (WMDE): I would rather think about small ToC (Table of Content) at the beginning of lexeme page with direct link to sections with Forms and Senses. Ids already are there id=forms and id=senses so the ToC would be trivial and follow User Interface with ToCs on talk pages. KaMan (talk) 10:50, 28 August 2018 (UTC)
@Lea Lacroix (WMDE): I made small script to point out what would be sufficient https://wikidata.beta.wmflabs.org/wiki/User:KaMan/ToC_to_lexemes.js it produces https://ibb.co/fVzi3U I filled task about this in phabricator https://phabricator.wikimedia.org/T202983 KaMan (talk) 11:51, 28 August 2018 (UTC)
Thanks a lot for the script! It will be already helpful for regular editors in the meantime. Lea Lacroix (WMDE) (talk) 12:58, 28 August 2018 (UTC)
  • Not sure. As the feature isn't really finished, it's not much use. For French, generally there wouldn't be that many except for verbs. So for anything but verbs, there isn't much use in collapsing it. For verbs, the summary mentioned earlier would be helpful.
    --- Jura 04:01, 29 August 2018 (UTC)
  • I tried it and I found the use of codes counter-intuitive. How am I supposed to know which language is it? And what if there are no codes for a given language?--Micru (talk) 09:22, 29 August 2018 (UTC)
    • @Micru: we already had this discussion (and multiple times). Yes the interface is suboptimal, it should and will be improved. And no, there is always at least one code for each and every language you can imagine. Cheers, VIGNERON (talk) 22:22, 31 August 2018 (UTC)
      • @VIGNERON: Of course there is a code for any language, you just need to type mis-x-Q... but I wonder why you consider that an optimal solution, when it is not usable at all. --Micru (talk) 08:32, 1 September 2018 (UTC)
        • @Micru: I don't consider that an optimal solution! And if people use it, it is not « not usable at all ». Cheers, VIGNERON (talk) 08:54, 1 September 2018 (UTC)
          • @VIGNERON: I speak from personal experience. When I say that it is not usable, I mean that it is not usable for me, and that is the reason why I don't use it. When the system improves I will give it another chance.--Micru (talk) 20:45, 1 September 2018 (UTC)

Somehow, I can't find a way to add reference to a gloss. Is it a bug, a feature, or just my blindness?--Shlomo (talk) 19:32, 29 August 2018 (UTC)

Here's the things I noticed:

  • After clicking "add sense", nothing is focused. I would expect it to focus the language field.
  • The word "sense" is randomly capitalised in the "add sense" link. English doesn't capitalise all nouns like German does and it's inconsistent, since we don't capitalise other words like "item", "property", "statement". This seems to be an issue in other places for lexemes too, like the "add form" link.
  • After clicking "add" to add a gloss in another language, nothing is focused. I would expect it to focus the newly-created language field.
  • After clicking "edit" to edit existing senses, nothing is focused. The expected behaviour is trickier here because you might want to edit a specific language or add a new language, but either way, focusing nothing makes keyboard navigation very difficult.
  • No language names are ever displayed, only codes, which makes it very easy to use the wrong code without realising.
  • Languages are not sorted in any way. I would like to see the gloss matching the lexeme language at the top, but any that haven't been deliberately put at the top should be in alphabetical order.
  • There's no visual indication that a language code will not be accepted until you try to save and get an error.
  • It doesn't prevent duplicate languages, it just silently overwrites the existing gloss. One option would be to allow users to select any language, but display a message under the row saying that there is already a gloss for the language if they select one which has already been used and then prevent saving until the duplicate has been removed.
  • The list of different languages for a gloss quickly takes over the page. It should limit the number it displays by default.

- Nikki (talk) 00:01, 1 September 2018 (UTC)

Hello all,

Thanks for your very useful feedback! I tried to address all of your suggestions, let me know if I forgot something:

  • compact form, reordering or collapsing for Forms and Senses: this is something we need to think on the user experience level for the long term. I'm going to discuss about it with our UX team. In the meantime, the script suggested by KaMan could be adapted once the feature is deployed.
  • Display language names instead of language codes for Glosses: ticket created
  • Ordering glosses: ticket created
  • Possibility to display some glosses at the top and collapse the others: see this ticket
  • Add a reference for gloss: this is not available for now. Let's wait and see if there is a use for this, since gloss is not supposed content taken from elsewhere.
  • Some languages can't be added for glosses: the list of available language codes for glosses is the same as the one for labels and descriptions. Therefore, some languages won't appear there. The format "mis-x-..." is not working either.
  • Impossibility to check if a gloss language is right before saving: currently worked on (see ticket)
  • When a new gloss with an existing language is created, it silently fails: currently worked on (see ticket)
  • Inconsistency for capitals: will be fixed (see ticket)
  • Focus problems: ticket created

Thanks, Lea Lacroix (WMDE) (talk) 13:47, 4 September 2018 (UTC)

@Lea Lacroix (WMDE): Another thing I've noticed is that the "remove" links for each language's entry in a sense are not consistent with the ones used elsewhere (such as statements and forms). - Nikki (talk) 19:33, 8 September 2018 (UTC)

HTML language tags[edit]

<h1 id="firstHeading" class="firstHeading" lang="en">
 <span class="wikibase-title">
   <span class="wikibase-title-label">continuité</span>
   <span class="wikibase-title-id">(L17676)</span>
 </span>
</h1>

If I read HTML correctly, the applicable language tags on the section header for continuité (L17676) switchs from French to English when I switch GUI language. Supposedly by default, it includes "en". I think it should use lexeme language instead ("fr" here).

The same should probably apply to forms. --- Jura 09:12, 2 September 2018 (UTC)

Note the lexeme ID should still be in interface language, as different language uses different brackets.--GZWDer (talk) 17:02, 4 September 2018 (UTC)
  • @Lydia Pintscher (WMDE): would you arrange for it to be corrected? --- Jura 09:06, 8 September 2018 (UTC)
    • Hmmm yes indeed this seems suspicious. Do you have a case where this causes a visible issue? That'd make it easier for me to formulate the ticket. --Lydia Pintscher (WMDE) (talk) 19:02, 8 September 2018 (UTC)
      @Lydia Pintscher (WMDE): I already mentioned this thread in phab:T196228. KaMan (talk) 05:03, 9 September 2018 (UTC)
      • Ah thank you! --Lydia Pintscher (WMDE) (talk) 09:52, 9 September 2018 (UTC)
      • @Lydia Pintscher (WMDE): I can't think of any examples right now where it would be visible by default, but I have some additional CSS to change the way certain scripts are displayed which make it noticeable for me: When correctly language-tagged, ཆུ (L8249) is displayed using a bigger font size (since I generally find the default for Tibetan too small), كانون الثاني/كَانُونُ ٱلْثَانِي (L8660) displays using my preferred font for Arabic (instead of displaying using a mixture of two fonts) and I would also be able make the Mongolian script in ном/ᠨᠣᠮ (L7957) display vertically. And yes, the forms should be language-tagged too. - Nikki (talk) 20:46, 8 September 2018 (UTC)
        • (Although as GZWDer pointed out, it should be the individual lemmas which are language-tagged, if it's only the h1, I can't distinguish the text in Cyrillic script from the text in Mongolian script) - Nikki (talk) 20:49, 8 September 2018 (UTC)
      • Maybe the question isn't where it's visible, but where it isn't: imagine someone crawls html pages and uses the tags to determine the language. They get mis-structured data and possibly don't display words if one doesn't choose "en", but the language one is interested in. After all your efforts on language codes, it seems odd that Wikidata just outputs "en".
        BTW, I could fix the issue by changing the content language of an entity. https://www.wikidata.org/w/index.php?title=Lexeme:L19576&action=info .
        Makes we wonder if we shouldn't display the entire page in the language of the entity for users that are not logged-in. --- Jura 03:58, 9 September 2018 (UTC)
  • @Lydia Pintscher (WMDE): how do you want to proceed? Maybe content language could be set directly when the entity is created. --- Jura 05:37, 14 September 2018 (UTC)

Looks good, below what I found. Maybe there is more systematic tool to validate it. Except specified, this is about https://wikidata.beta.wmflabs.org/wiki/Lexeme:L1 when not logged-in.

  • Lemma: ok (once en, once de)
  • Lemma as value: ok (en and de parts combined on link)
  • Form: ok (en)
  • Form as value of a statement on a form: ? (no lang) [form used was in en, maybe this is assumed given the content language]
  • Item label as value: ? (no lang) [label is en label]
  • Grammatical features as label: ? (no lang) [label is en label]
  • Senses gloss: ok (pl, fr, en, de)
  • Sense as value: ? (no lang) [label is hu lemma and en gloss]
  • Monolingual text as value: ok (en) [this is on https://wikidata.beta.wmflabs.org/wiki/Lexeme:L54 ]
  • Page content model: ? (en - English)

It might be worth making sure that the form when used as value works consistently. The senses one is probably tricky to do. For all, I suppose that if the lang code was "mis" instead of "de" or "en", that would be there. I find it much better than before. Maybe the main question that remains open is the content language. Should it be the lexeme language? --- Jura 21:37, 15 September 2018 (UTC)

Thanks so much for checking! If you are looking at a French Lexeme with the UI set to English then we show you all the property labels etc in English. While we do show you French content we do show you French content in English. So to me it seems right to set the content language to English in this case. Others? --Lydia Pintscher (WMDE) (talk) 09:09, 16 September 2018 (UTC)
Would changing the page content model have any effect here? As far as I can tell, it ignores it and dynamically updates the page language to match the UI language, just like it does for items and I'm not aware of it causing any issues there. I agree with you that the page language should continue to dynamically update to match the UI language. I also agree with Jura that the forms and senses still need to be properly language tagged when used as values in statements, like we do for items and properties. - Nikki (talk) 10:44, 16 September 2018 (UTC)
If you use ?uselang= to display the page in a different language, you can check whether the bits which are currently displayed in the UI language are being tagged correctly. I personally pick a language with a completely different script (e.g. ja or zh), because that makes it easier to see which bits are still in Latin script. It seems that lexical category, lexeme language, grammatical features and item labels when used in statements are all correctly tagged. - Nikki (talk) 10:44, 16 September 2018 (UTC)

Roman numerals[edit]

There is small disagreement between @Jura1: and me about these characters. For example I (L19340). First of all - should it be lexeme or not? If yes then what language should it have: multiple languages (Q20923490), Roman numerals (Q38918), Latin (Q397)? And what about lexical category: symbol (Q80071), character (Q3241972), numeral (Q63116)? Then there is problem with Unicode - there is Unicode character to represent Roman numerals, should this be mentioned? If yes then as second form or another representation of first form with code language set to "mis-x-Q8819"? What do you think? KaMan (talk) 08:51, 8 September 2018 (UTC)

@KaMan, Jura1: there is defintely a lexeme somewhere, maybe on its own but maybe it's just a variant inside the lexeme "unus"? Not sure...
For the lang, it's clearly not latin (it is still used everyday in most language). I would say mis-x-Q38918.
I probably wouldn't mention the Unicode specific codepoint Ⅰ as Unicode itself say it shouldn't be used (For most purposes, it is preferable to compose the Roman numerals from sequences of the appropriate Latin letters. The Unicode Standard 6.0, p. 486). Or at least if it's mentionned, there should be some sort of ceveat to indicate that this character should'nt be used.
Cdlt, VIGNERON (talk) 10:20, 8 September 2018 (UTC)
I added "use" statements to I (L19340). Does this help? --- Jura 10:47, 12 September 2018 (UTC)
@Jura1: I like the idea but limited use (Q56598132) is strange, use is not limited here, it's strongly discouraged and almost forbidden. On Wikisource, I regularly remove and replace all Ⅰ by I, as this character didn't exist at the time. Cdlt, VIGNERON (talk) 10:03, 16 September 2018 (UTC)
If you don't think the quote supports the statement, please replace it with a better one. --- Jura 08:57, 18 September 2018 (UTC)

What is a lexeme[edit]

According to English wikipedia "a lexeme is an abstract unit of morphological analysis in linguistics, that roughly corresponds to a set of forms taken by a single word". So what is word? Again from en.wiki "a word is the smallest element that can be uttered in isolation with objective or practical meaning". In view of these definitions what about ´ (L19234), ` (L19235), ¸ (L19236), ˆ (L19237), ¨ (L19238)? Are they really lexemes? KaMan (talk) 15:34, 8 September 2018 (UTC)

  • It is debatable if we should have L-entities for letters and other characters in a given language or rather rely on items.
    The usefulness of L-entities depends on their consistency for a given language, indirectly, across languages. In general, I think a question to ask is if we can add interesting statements to these L-entities in a given language and link them from other lexemes. --- Jura 15:52, 8 September 2018 (UTC)
Is there some reliable source that describes or lists diacritical signs as separate lexemes of a particular language? I strongly doubt it, but if there is one, let the sourced ones have their "WD-lexemes" (L-entities, if you want). The same principle could be applied for punctuation marks, whitespace characters, emoticons, music notes, mathematical operators, astronomical & astrological signs, hallmarks, circuit diagram symbols, chess figures etc. etc. without having to undergo the same exhausting discussion million times.--Shlomo (talk) 22:46, 8 September 2018 (UTC)
I agree with Shlomo. So @Jura1: let's wait a little if they are sourced and if not I will request removal of them. KaMan (talk) 08:00, 9 September 2018 (UTC)
The ones above a fairly popular topic for French, even cedille independently of "c-cedille". --- Jura 08:05, 9 September 2018 (UTC)
Well, we have diacritical signs in Polish too and they are very "popular". The question is if reliable dictionaries mention them as separate lexemes in languages. KaMan (talk) 08:13, 9 September 2018 (UTC)
As for letters and other characters, they are not considered lexemes. Entries about letters can be alphabetized in print dictionaries. For most other things, this has to be included differently. Obviously, Wikidata isn't a print publication. I think we need to differentiate L-entity and lexeme. This is why the interface spells it "Lexeme" and not "lexeme". --- Jura 08:18, 9 September 2018 (UTC)
I do not discuss letters here. Letters are considered lexemes in my dictionaries. Diacritical signs not. KaMan (talk) 08:30, 9 September 2018 (UTC)
Why wouldn't your definition apply to these? L-entities are for "Lexemes" not entries in your dictionaries. --- Jura 08:37, 9 September 2018 (UTC)
Please note that our Wikidata:Lexicographical_data/Documentation#Data_Model defining L-entites refers to lexeme definition in Wikipedia I quoted. KaMan (talk) 08:47, 9 September 2018 (UTC)

Is question mark part of interrogative lemma?[edit]

Again following en.wiki "In morphology and lexicography, a lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words". So I assume lemma of combien@fr is "combien" because that's how French dictionaries mention it (https://fr.wiktionary.org/wiki/combien). But @Jura1: keeps adding question mark at the end of it in combien? (L1724) saying its usefull. Should it be there? KaMan (talk) 08:43, 9 September 2018 (UTC)

Based on the examples on the French Wiktionary page, I would say, no, it shouldn't be part of the lemma. Only one of the 17 examples has the question mark directly after the word and 5 of the examples don't have a question mark at all. - Nikki (talk) 14:29, 9 September 2018 (UTC)
  • I'm not sure if Wiktionary's (or 18th-century printers) technical limitations should limit us in the way we display a lemma. This is different from the forms part. --- Jura 14:56, 9 September 2018 (UTC)
I don't think diacritical marks should be included unless they are an intrinsic part of the word and always written that way. In the case of combien?, there is not always a question mark right after the word, so it shouldn't be included. —Rua (mew) 15:07, 16 September 2018 (UTC)

@Jura1: could you explain why you think that the main lemma of the lexeme combien? (L1724) should be « combien? » (and same question for the item How many? (Q54311997)). Cdlt, VIGNERON (talk) 12:12, 10 September 2018 (UTC)

@Jura1: ping again because you did not explain why you add "?" and now you started with "!" oh! (L21330) KaMan (talk) 14:57, 16 September 2018 (UTC)
  • I think it's a useful marker for human readable version (i.e. lemma) in French and possibly other Latin script languages. This even if the actual forms are used in indirect statement. I think it compares well with "-" for pre/suffix in lemmas or "*" in others. --- Jura 09:00, 18 September 2018 (UTC)
An affix will not appear on its own, and the hyphen indicates this. One could argue that representing an affix without a hyphen (or in some other way that indicates how it is used) would be incorrect or misleading because it does not appear on its own, unlike other lexemes.
The asterisk, as far as I understand it, is used to indicate that the lexeme is unattested - so here you could also argue that to not indicate that the lexeme is unlike other lexemes would be misleading.
In short, I would say that these two cases are fundamentally different from e.g. adding question marks to interrogative words. --Njardarlogar (talk) 09:44, 18 September 2018 (UTC)

Spelling variants[edit]

Hello, sorry if something like this has already been solved but I came to some cases of Czech words I don't know how to deal with. If I understand correctly spelling variants should be considered one lexeme and thus english words colour and color are one lexeme. At the same time it is not possible to enter multiple lemma with the same spelling variant and thus colour is marked en and color en-x-Q7976. But how should I mark following Czech spelling variants?

  1. verb infinitives ending with -t vs. verb infinitives ending with -ti (e.g. vozit × voziti): Well, it is not spelling variant but different form of lemma. Verb lemmas ending with -ti are considered somewhat obsolete but are still in use in specific cases and it is not that long since those were considered basic dictionary forms, so those can be found in early 20th century dictionaries. Those could be added in Forms section but in my opinion both should be in lemma part of the page.
  2. suffix -ismus vs. -izmus (e.g. symbolismus × symbolizmus): -ismus is older variant but still considered as basic and neutral.
  3. spelling variants that are considered equal (e.g. kurz × kurs): one can use whichever they want

None of the examples above are dependent on region and none of them could be considered strictly archaic. --Lexicolover (talk) 13:13, 9 September 2018 (UTC)

I have no idea. It's a similar problem for "yogurt" and "yoghurt" in English. - Nikki (talk) 14:21, 9 September 2018 (UTC)
  • For #2 and #3, I'm not sure if there is a clear answer. I take it that there is no difference in sense or pronunciation and this concerns numerous words.
    • A. Ideally, one would be able to enter both for each form on an L-entity, but, unless you find or create distinct x-codes for the s- and z- flavors, it can't be done. Somehow the constraint was introduced recently, maybe it can be dropped again.
    • B. An alternative would be to enter them as separate forms. Looking at some of the entries, it seems doable (2*14 forms), but still suboptimal.
    • C. Wiktionary seems to follow a third option: handle each flavor separately.
For #1, you could follow (a) as well, but personally, I wouldn't bother, as the lemma is there mainly for display purposes. --- Jura 14:47, 9 September 2018 (UTC)
Jura, lemma is not only for display purposes. For ethymology property derived from (P5191) it has strict meaning. KaMan (talk) 15:39, 9 September 2018 (UTC)
It's an editorial choice. If it mattered for that property, you'd need to opt for (C.) as Wiktionary --- Jura 15:50, 9 September 2018 (UTC)

@Lexicolover: I had similar problem with akowski/AK-owski (L3529). I would go for choosing one (more frequent, more modern) as base with language code set to "cs" and then apply to the other language code which points to feature highlited in second form like "cs-x-Q9751" (kurz) or "cs-x-Q9956" (symbolismus) or "cs-x-Q9893" (voziti). KaMan (talk) 15:55, 9 September 2018 (UTC)

Thank you all for your ideas. For #1 I decided to go with Forms in the end. And for the others I will think more about it. --Lexicolover (talk) 21:15, 13 September 2018 (UTC)

I've run into this problem now when considering Middle Dutch. It's a historical language, written when spelling was not yet standardised, and spellings can differ by region (reflecting a pronunciation difference), by author (reflecting the particular preferences of the author), by work (because an author's preferences could change) and even within a work. Modelling all this with lemmas/representations is a pretty horrible task, because you have to find a distinguishing item for each spelling variety. How would you distinguish a spelling used "in region R by author A in work W on line L of the manuscript"? There's no way there's ever going to be an item to tag the lemma with. Assuming that we don't want tons of alternative lemmas, we can pick one that's representative (e.g. a normalised spelling or pronunciation respelling), but that still leaves the problem for form representations. I have no clue how to handle those in situations like this. —Rua (mew) 19:21, 19 September 2018 (UTC)

To give you some idea, here's a particularly frequent and irregular verb: http://gtb.inl.nl/iWDB/search?wdb=VMNW&actie=article&id=ID78532 (forms are listed near the bottom). —Rua (mew) 19:25, 19 September 2018 (UTC)

@Rua: But language code is just string and it is not limited to single item in specyfying spelling variant. If you have item Q123 for region R, item Q456 for author A, item Q789 for work W, then you can write language code in form "nl-x-Q123-Q456-Q789". Moreover if one of them is not notable and has no item then you can always replace QXXXX with normal string describing variant so spelling used "in region R by author A in work W on line L of the manuscript" can have language code "nl-x-R-A-W-L". KaMan (talk) 07:02, 20 September 2018 (UTC)
It does not work, interface accepts only one id. KaMan (talk) 08:14, 21 September 2018 (UTC)
  • I think they should rather be separate forms. --- Jura 07:14, 20 September 2018 (UTC)
  • Conceptually, I don't see much of an issue with including 100 forms, each one with a different "attested in" statement. Wikidata GUI isn't necessarily optimal nor the benchmark. For items, it has gotten better recently, but was hard on items with many statements. For lexemes, apparently, some work will be done, but I'd rather count on summaries of forms done by query or LUA. Without a suitable tool, adding many forms is obviously complicated. This shouldn't be much of an issue once QuickStatements is available. --- Jura 07:14, 20 September 2018 (UTC)

Linking forms to lexemes[edit]

How would we handle a case where a form of a lexeme has a sense that is specific to the form? Can we make a lexeme for the form and link the two somehow? A case in point is lid (L17502), whose diminutive "ledeke(n)" (ledeken (L19638)) has its own distinct sense. Should we create the second lexeme and link both with an equivalent of exact match (P2888)? --Azertus (talk) 15:17, 9 September 2018 (UTC)

In my language diminutives are separate lexemes so I create base lexeme las (L19630) then diminutive lasek (L19632) and then diminutive of diminutive laseczek (L19633) and connect them all with property derived from (P5191) with qualifier instance of (P31)=diminutive (Q108709). KaMan (talk) 15:32, 9 September 2018 (UTC)
Should we at least use subject has role (P2868) instead of P31 as a qualifier? --Azertus (talk) 13:45, 10 September 2018 (UTC)
I don't know your language but in my wording of label of P31 sounds better. There was property proposition for such cases in Wikidata:Property proposal/mode of derivation and perhaphs we should return to this. KaMan (talk) 14:41, 10 September 2018 (UTC)

Prefix - lemma and form(s)[edit]

Looking at the sample: Lexeme:L11294, there is currently:

  • "inter-" as lemma
  • "inter-" as form F1 (same as the lemma)
  • "inter" as form F2

When screening words for the suffix, F2 is probably most useful ("inter"). When presenting the entity, "inter-" probably works best. F1 is mostly there for completeness sake, but maybe we could do away with it. --- Jura 09:01, 10 September 2018 (UTC)

I already felt unconfortable seeing two forms in such cases when prefix clearly has one form. I would go for one form per lexeme. In most cases my dictionaries gives "inter-" (apart Wiktionaries for example here: https://sjp.pwn.pl/sjp/inter;2561755.html) but I have one dictionary where plus is used as "inter+" (http://sgjp.pl/leksemy/#91337/inter) and it looks good and IMO "+" has better meaning than "-". KaMan (talk) 09:36, 10 September 2018 (UTC)
I don't think "-" (or "+") is actually part of the form. It's just added to the lemma for better readability. --- Jura 09:45, 10 September 2018 (UTC)
Ok, but has a meaning "this form does not occur standalone, but something is followed it" and while we are here to build something new with lexemes, OTOH we are not here to reinvent the wheel - if vast majority of dictionaries show form of prefixes as "inter-" then we should follow it. No doubts prefixes like "inter-" does not have two forms with and without "-". KaMan (talk) 10:29, 10 September 2018 (UTC)
I think you are mixing the "form"-field and the "lemma"-field. --- Jura 10:35, 10 September 2018 (UTC)
I mean both, for prefixes I mean that one (obligatory for every lexeme) form is identical to the lemma. If dictionaries does not specify any other form than lemma, then this is default, first and only form of this lexeme. KaMan (talk) 10:41, 10 September 2018 (UTC)
I take note of your POV. Thanks for sharing it. --- Jura 10:44, 10 September 2018 (UTC)
My POV is a form should exist if there is data that can be put in it. Here I must say, I don't see what data would requires two separate forms. Cdlt, VIGNERON (talk) 12:22, 10 September 2018 (UTC)
Maybe the two forms can be useful to express (using statements) that e.g. inter-be is composed of the forms inter- and be and that interact is composed of the forms inter and act? --Azertus (talk) 12:55, 10 September 2018 (UTC)
"inter-be" is different story becouse "inter-" here is not prefix to "be" but separate meaning. it's like 《be "inter-"》 and middle "-" separates stems. KaMan (talk) 15:19, 10 September 2018 (UTC)
+1 with KaMan. And it depends if we want to use combines (P5238) for morphology or etymology, but in the second case, "interact" is not composed of "inter"+"act", it's derived from "interaction". PS: this question concerns (at least) two lexemes : inter- (L11294) and inter- (L18029). Cdlt, VIGNERON (talk) 07:53, 11 September 2018 (UTC)
We do have derived from form (P5548) which allows to link F2 directly. So there is some debate whether F1 is needed in some cases. --- Jura 10:24, 11 September 2018 (UTC)
  • "Inter-" is not a form? What about a phrase "inter- and intranational"? --Infovarius (talk) 12:16, 12 September 2018 (UTC)
    @Infovarius: very good point! I withdraw my remark. Cdlt, VIGNERON (talk) 09:33, 13 September 2018 (UTC)

File use on forms[edit]

It seems that these don't appear. Apparently this was noticed before, but I can't find a ticket for it. @Lydia Pintscher (WMDE): Is there one?

Sample at Lexeme:L1234#L1234-F3 with https://commons.wikimedia.org/wiki/Special:GlobalUsage/Eleph_Tarangire_river.jpg . The same works when use is directly on the L-entity. --- Jura 10:24, 11 September 2018 (UTC)

That's a very good point, thanks for reporting it. There's a task now. Lea Lacroix (WMDE) (talk) 15:54, 11 September 2018 (UTC)
But images should not be used in forms... --Infovarius (talk) 11:56, 12 September 2018 (UTC)
Not all Commons media links are images, e.g. pronunciation audio. Also, there are use cases for images on forms, e.g. scripts with limited support on computers (because no fonts are available and/or because it's a script with complex rendering) often have image versions on Commons showing how the word should appear. - Nikki (talk) 12:23, 12 September 2018 (UTC)
  • pronunciation audio (P443) seems a fairly good sample for what should be on forms. BTW, it works on L-entities directly, but not on Forms of L-entities. --- Jura 05:33, 14 September 2018 (UTC)

List of suggested lexemes in the interface[edit]

A couple of observations regarding the entering of lexemes in the interface, e.g. for properties:

  • I am not sure how the list of suggested lexemes is sorted currently; but I think it would be more natural if the list had the language of the user interface first, then perhaps the languages in the babel box on the user's user page and finally an alphabetically sorted list (when sorting perfect matches, at the very least)
  • I think lemmas should have priority over forms; if you enter "ans" now, the first suggestion is ans (L10081-F2), although the lemma of ans (L20210) is a perfect match (though maybe the language of the lexeme should weight the most, per the point above).
  • the software should pay more attention to the precise letter that is being used: if you enter "åt" now, at (L3263) appears directly after the perfect match åt (L4400), whereas åtti (L980) and åtak (L2011) appear later in the list after yet more lexemes starting with the incorrect letter "a" instead of "å".

--Njardarlogar (talk) 19:11, 11 September 2018 (UTC)

auxiliary verb statements; help needed ;-)[edit]

Some help needed at L20388 (Lexeme talk:L20388); I found no example of the proper statements of the auxiliary verb of the kind (hungarian future tense is created by infinitive + "fog" auxiliary verb); I ask for help with the proper statements of that said auxiliary. As a sidenote, I can't find a way to look up all lexemes in Hunagrian language, nor subclasses (like Hungarian verb lexemes). Would be neat to have, doncha' think? ;-) --grin 09:23, 12 September 2018 (UTC)


Unrelated question, but is there a way (as of now) to attach a meaning to the lexeme? In case of homonymes without this it's getting really confusing, and in normal cases just hard. --grin 09:26, 12 September 2018 (UTC)

@Grin:
That's a nice trick, thanks! --grin 11:57, 12 September 2018 (UTC)
@Grin: with Special:RecentChangesLinked you can monitor changes in your language lexemes, see: Wikidata_talk:Lexicographical_data/Archive/2018/08#Monitoring_recent_changes_per_language_is_now_possible KaMan (talk) 12:03, 12 September 2018 (UTC)
Thanks! Just for the record, the full link is this. --grin 10:35, 14 September 2018 (UTC)
  • I attach meanings by linking items in external dictionaries. For this I proposed series of properties to Polish dictionaries. They serve for authority control for lexemes
KaMan (talk) 10:00, 12 September 2018 (UTC)
Thanks. I see Lea down here informed us about the planned Sense activation, which will resolve this part of my question. --grin 11:57, 12 September 2018 (UTC)

Senses will be released on October 18th[edit]

Hello all,

I'm glad to announce that the Senses will be available on wikidata.org on October 18th. With this last part of the Lexemes being available, we'll be able to add more meaning to the data and connect the Lexemes with each other better through new properties.

In the meantime, here's what you can do:

  • have a look at the data model for a few examples of what's possible to do with Senses
  • continue testing the senses on beta and giving feeback
  • have a look at the property proposals and discuss about the properties you would need to create once the Senses are enabled
  • ask questions or discuss about any matter on this page

Please keep in mind that the requests I phrased in my previous announcement are still valid: "We kindly ask you to not plan any mass import from any source for the moment. [...] We strongly encourage you to discuss with the communities before considering any import from the Wiktionaries. [...] I hope that we will all be kind and patient, both with other editors and with the software that may not work exactly as we want it to at the beginning :)"

Cheers, Lea Lacroix (WMDE) (talk) 09:31, 12 September 2018 (UTC)

Great news! :) As for "We strongly encourage you to discuss...." @Lea Lacroix (WMDE): I have asked community of Polish Wiktionary on 11 June about possibility of import and found no objections. I also got agreement from author of our automatic IPA system that he can add with his bot IPA to all missing Polish forms in wikidata lexemes. I wait with it until software stabilize and temporarily add it by hand. I also weekly inform Polish community about progress with lexemes, new properties, tools, and achievements. KaMan (talk) 11:03, 12 September 2018 (UTC)

Once senses will be released there will be increased interest from users of various languages. There is enough time to prepare for them nicely translated interface of lexicographical part of Wikidata. I encourage all to visit translatewiki.net and add all missing translations to "Wikibase Lexeme". For example link to not translated Swedish messages is https://translatewiki.net/wiki/Special:Translate?action=translate&group=ext-wikibaselexeme&filter=%21translated&language=sv (you can get list for any other language by replacing last language code in url with your own). KaMan (talk) 09:13, 20 September 2018 (UTC)

statement counts[edit]

Somehow statement counts in search results don't include statements on forms. The same probably goes for everywhere else.

Sample:

    Chwefror (L8435)
    Welsh noun
    0 statements, 1 form - 10:49, 12 September 2018

--- Jura 10:51, 12 September 2018 (UTC)

Yeah we discussed that iirc and decided to only count the statement on the Lexeme-level because it is semantically more correct. The other statements are on a sub-entity. We have the same issue with the number of statements in page information. If there is a strong desire to do it differently we can reconsider. --Lydia Pintscher (WMDE) (talk) 20:47, 15 September 2018 (UTC)
  • Would the count per sub-entity be available somehow? Hard to say now if most statements would be on subentities or not. For search results, it might be better to not show it. --- Jura 21:02, 15 September 2018 (UTC)

Lexemes about Planets[edit]

Hi y'all,

Last July, I suggested to create lexemes about months (Wikidata_talk:Lexicographical_data/Archive/2018/08#Lexemes_about_months). It raised some questions (not all solved) and a very good dynamics. Lexemes about months where created in (almost) 44 languages!

So, nudged by Jura1 and KaMan, here is a new suggestion: create lexemes about the 8 planets.

List[edit]

Here is the planets we already have (ordered by the Qid of the language) :

  1. French (Q150): mercure (L20398), vénus (L20396), terre (L9225), mars (L1185)
  2. German (Q188): Merkur (L22580), Venus (L22581), Erde (L6778), Mars (L22582), Jupiter (L22583), Saturn (L22584), Uranus (L22585), Neptun (L22586)
  3. Polish (Q809): Merkury (L20449), Wenus (L20876), Ziemia (L21062), Mars (L21355), Jowisz (L21633), Saturn (L21795), Uran (L22573), Neptun (L22769)
  4. Spanish (Q1321): Mercurio (L22606), Venus (L22607), Tierra (L22608), Marte (L22609), Júpiter (L22610), Saturno (L22611), Urano (L22612), Neptuno (L22613)
  5. English (Q1860): Mercury (L21960), Venus (L20397), Earth (L21961), Mars (L8627), Jupiter (L21962), Saturn (L21963), Uranus (L21964), Neptune (L21965), (also Pluto (L21966), Ceres (L21967), Vesta (L21968), Eris (L21969), Haumea (L21970), Makemake (L21971))
  6. Dutch (Q7411): Mercurius (L20389), Venus (L20390), Aarde (L20391), Jupiter (L20392), Saturnus (L20393), Uranus (L20394), Neptunus (L20395)
  7. Bengali (Q9610): বুধ (L20405), শুক্র (L20406), পৃথিবী (L20407), মঙ্গল (L20408), বৃহস্পতি (L20409), শনি (L20410), ইউরেনাস (L20411), নেপচুন (L20412)
  8. Breton (Q12107): Merc'her (L20903), Gwener (L20904), Douar (L20905), Meurzh (L20906), Yaou (L20907), Sadorn (L20908), Ouran (L20909), Neizhan (L20910)
  9. Malayalam (Q36236): ബുധൻ (L2601), ശുക്രൻ (L20399), ഭൂമി (L1482), ചൊവ്വ (L20400), വ്യാഴം (L20401), ശനി (L20402), യുറാനസ് (L20403), നെപ്റ്റ്യൂൺ (L20404)
  10. Kannada (Q33673):
  11. Danish (Q9035): Merkur (L22774), Venus (L22775), Mars (L22776)

Wikidata items: Mercury (Q308), Venus (Q313), Earth (Q2), Mars (Q111), Jupiter (Q319), Saturn (Q193), Uranus (Q324), Neptune (Q332)

Discussions[edit]

  • Once we get senses, we can move the item for this sense (P5137) statements to the new section. The interesting discovery for the above is that some find it useful to create separate l-entities based on sense (mercury the element, compared to Mercury the planet). --- Jura 05:46, 13 September 2018 (UTC)
    Yes, I already added this to maintenance queries for lexemes. KaMan (talk) 06:29, 13 September 2018 (UTC)
    Yes, this is indeed a perfect example of where senses will be helpful (thankfully there coming soon).
    But even with senses, I think we *need* to have separate lexeme for separate words. There is distinct information about mercury the metal, Mercury the planet and Mercury the god. this is different words that are not interchangeable, with different history and context. For instance the metal mercury, come from the god (derived from (P5191)) and has at least two senses: the element itself (mercury (Q925), synonym of quicksilver or hydrargyrum, but also by metonymy a synonym of thermometer and by extension of temperature, same for Mercury the god (the god itself or by extension a messenger). If there is only one lexeme for these lexemes, it would have a lot of senses and we would be unable to see the cluster of senses. Plus, others lexemes would benefit from separated lexemes, like "mercuric" and "mercurous" is combines (P5238) of mercury the metal (not the planet nor the god) and "mercurial" comes from the planet (and so on). If we use forms, I fear it will be quickly a mess. Cdlt, VIGNERON (talk) 09:44, 13 September 2018 (UTC)
    In Polish Mercury the god and Mercury the planet are of different gender so they have different set of forms so I cannot hold them in one lexeme. KaMan (talk) 10:05, 13 September 2018 (UTC)
    You don't have to. There's nothing wrong in having one lexeme in English while two/three lexemes in Polish.--Shlomo (talk) 12:24, 13 September 2018 (UTC)
    @Shlomo: indeed but isn't there a problem having one lexeme on Wikidata when there is two or three lexemes in a language? For instance, if you only have one lexeme, how do you say that one lexeme (form?) comes from the other? Cheers, VIGNERON (talk) 13:12, 13 September 2018 (UTC)
    But we have Senses. So we would place etymology statements to each sense separately. And do you want to post a question "should we separate homonyms into separate Lexemes?" But the answer is no because if they have the same grammatical category they can't be created both. --Infovarius (talk) 13:28, 14 September 2018 (UTC)
    @Infovarius: You could place etymology to each sense separately, but you can't do the same with forms (and their respective properties). Separate Lexemes for homonyms is the correct way, at least in the cases where the linguists of the concerned language think of them so.--Shlomo (talk) 18:04, 16 September 2018 (UTC)
    @Infovarius: « if they have the same grammatical category they can't be created both. » what ? why ? That is already done, see tour (L2330), tour (L2331) and tour (L2332) for instance. Cdlt, VIGNERON (talk) 18:20, 16 September 2018 (UTC)
    @VIGNERON: If there are two or three lexemes in a language, we should create two or three lexemes here as well. The point is (was), that Mercury (deity) and Mercury (planet) can be considered two lexemes in one language (e.g. Polish) even if the corresponding translations into other language (e.g. French) are considered "only" different senses of one lexeme. Different languages can have different approaches and Wikidata should reflect it.--Shlomo (talk) 18:04, 16 September 2018 (UTC)
  • @Jura1, VIGNERON: Lexemes in French without a capital letter don't refer to a planet (example with vénus). We have to create other lexems, don't we? Tubezlob (🙋) 15:47, 14 September 2018 (UTC)
    • @Tubezlob: Yes, I think so (but it's not just a question of capital initial). For instance in « Aussitôt un mercure m’est dépêché, je vole sur ses pas. » (s:fr:La Belle Alsacienne/2) is it a separate lexeme or just a form of Mercure the god? Here I'm not sure, but the god, the planet and the metal is clearly (at least) 3 lexemes for me. Cdlt, VIGNERON (talk) 16:01, 14 September 2018 (UTC)
      • I think it's an editorial choice .. Some forms don't apply to all senses. --- Jura 16:05, 14 September 2018 (UTC)
        • I agree with Jura, it should not be forgotten that writers can use figure of speech. But in the context of La Belle Alsacienne (mercure as a noun, for a messenger), it refers to the sense no 2 in the TLF. Tubezlob (🙋) 16:15, 14 September 2018 (UTC)
      • It's good choice of samples to sort this out. For French, I mostly avoided the question for now and generally didn't create more than one lexeme per lexical category. I think if we take all factors mentioned by VIGNERON, we end up with one lexeme per sense. This may be easier for some things and avoids using too many qualifiers, but probably requires to cross-references these lexemes more. Besides, homographs aren't really easy to navigate. Not sure if there would be much benefit in having 10 instead of 1 for Warszawa (L2041). Factors obviously differ for each language. --- Jura 16:52, 14 September 2018 (UTC)
        • I use rule one lexeme per lexical category and gender and main etymology (families of senses with the same gender and thus forms are together). KaMan (talk) 17:02, 14 September 2018 (UTC)
          • @Jura1: « one lexeme per lexical category » but here mercure the metal is a common noun and Mercure the god and the planet are proper nouns, so not the same lexical category Face-wink.svg. If there is no objection, I'll create new lexemes for the planets in French. Cdlt, VIGNERON (talk) 18:23, 14 September 2018 (UTC)
            • Lexical category for both is noun (Q1084), as for a messenger. --- Jura 04:15, 15 September 2018 (UTC)
              • As long as we use proper noun (Q147276) as a lexical category for some entries, it would be incongruent to not use it for other entries that also concern proper nouns. The way I see it, when we use noun (Q1084) as the lexical category, we implicitly mean common noun (Q2428747). --Njardarlogar (talk) 09:15, 15 September 2018 (UTC)
                • One can be seen as a parent for the other. Maybe it should include "has <..>" in a statement and point to the proper noun. While for a given language, it might be possible to choose an optimal approach for many entities, we probably need to find a way that makes it possible to use both: either one lexeme per sense or one lexeme for all. --- Jura 07:58, 16 September 2018 (UTC)
                  • @Jura1: yes, one is parent of the other (as stated in the items) but it doesn't mean it the same thing, it's two different lexical categories. For me, here we have a situation very close to tour (L2330), tour (L2331) and tour (L2332) (which has been discussed at length several times). Cheers, VIGNERON (talk) 10:07, 16 September 2018 (UTC)
                    • I think for these too, you should attempt to formulate a version that combines all in one. We don't necessarily have all information beforehand and lexemes, just like items, should be able to grow incrementally. --- Jura 09:05, 18 September 2018 (UTC)
                      • @Jura1: wait what? do you really mean merge tour (L2330), tour (L2331) and tour (L2332) in one lexeme? how would it even be possible? all dictionaries and grammar books always describe them as three separate lexemes for centuries, I don't see why and how Wikidata could merge them. I can attempt to imagine a merge, but it would be very heavy (for instance, the gender will be repeated in qualifier at least 100 to 200 times) and it clearly wouldn't be functional (I don't see how I could cluster the 10 senses for tour (L2330) to distinguish them from the 10 senses for tour (L2331) and to distinguish them from the 5 senses for tour (L2332), especially ). Cdlt, VIGNERON (talk) 11:27, 18 September 2018 (UTC)
                        • I don't think you should actually merge these items, but we need to have a model that functions when some or all of it were merged. --- Jura 15:05, 18 September 2018 (UTC)

Should we even add proper nouns at all?[edit]

I'm not sure I understand the purpose of adding proper nouns as lexemes - surely each proper noun is better covered via an item, i.e. the Q namespace? It would have only the one sense, and you wouldn't have a plural form. I guess for some languages there may be other forms to worry about, but I definitely don't see much point in English. ArthurPSmith (talk) 19:00, 17 September 2018 (UTC)

I think you answered your own question. Proper nouns inflect, have pronunciations on all of those inflected forms, have grammatical gender, etymology, etc. There have been some discussions on English Wiktionary to treat them as noun (Q1084) instead, but in the end nothing changed. Maybe people on Wikidata will think differently. —Rua (mew) 19:11, 17 September 2018 (UTC)
If we don't have English lexemes for planets, how would we handle the etymology for words like wikt:en:マーキュリー? :) Lexemes for proper nouns are pretty boring in English, but they're still valid lexemes. - Nikki (talk) 20:12, 17 September 2018 (UTC)
@ArthurPSmith: we absolutely need proper nouns as lexemes. Especially as one Q can have several names, so several lexemes (inside one language, obviously), it not a 1 to 1 relation. And a proper name can have plural(s) and multiple senses, even in English (as already said, mercury is both the metal and by extension the temperature, like in "The mercury is climbing to 10 degrees") ; and a bunch of others lexical informations that wouldn't fit in a Q item. Cdlt, VIGNERON (talk) 11:40, 18 September 2018 (UTC)
But mercury the metal is not a proper noun. Only Mercury (meaning the planet) is a proper noun. So sure, we should have a lexeme for 'mercury', but I don't see a huge need for one for 'Mercury'. The etymology argument that Nikki raises perhaps is a reason. My biggest concern though is proper nouns are essentially infinite in number - the name of every human being, city, organization, building, street, celestial object, etc. We haven't set out notability requirements for lexemes at this point - if we allow any proper noun to be a lexeme, then we're going to have to require some level of notability as well or wikidata will be overwhelmed. Maybe etymological or other lexical notability is sufficient? ArthurPSmith (talk) 15:02, 18 September 2018 (UTC)
@ArthurPSmith: I would say that lexeme is notable if it can be sourced by notable dictionary. KaMan (talk) 15:17, 18 September 2018 (UTC)
Notability doesn't work for lexemes. We should include even obscure lexemes that don't appear in dictionaries. And lexemes that do appear in dictionaries aren't necessarily real words that people actually use. English Wiktionary has its own criteria for inclusion, not based on notability, for this very reason. Actual use is what should count, not appearance in a dictionary. —Rua (mew) 15:49, 18 September 2018 (UTC)
I agree with you (I think) regarding lexemes other than proper nouns. But would you extend this to proper nouns as well? What about multi-word names for things (like one of my favorites, the "United States Army Tank Automotive Research, Development and Engineering Center" (Q7889528))? Where does it end? ArthurPSmith (talk) 18:43, 18 September 2018 (UTC)
The example you gave is definitely over the top and not desirable, but you have to consider languages with inflections. It is possible that someone will encounter a proper noun in an inflected form of some kind, and unless we have a lexeme and a form for it, they won't find anything. Is that something we want? And if we want it in some cases but not others, then which cases? —Rua (mew) 19:23, 18 September 2018 (UTC)
I think it's less about which proper nouns we want and more about what information we want. We basically want the lexemes to be useful in some way, by being needed for structural reasons (e.g. etymology) or by providing extra information about the word (e.g. grammatical information), or something similar. For example, if someone wants to create lexemes for hundreds of villages in England with no other information other than the name, then that's not useful to us and should be strongly discouraged. But if someone wants to create lexemes for hundreds of villages in England and add the etymology of the name (with a reference), then that is far more interesting. For things which often have different names in other languages (like solar system planets), I think the lexemes will be useful as a way of providing translations once we have senses. - Nikki (talk) 22:18, 18 September 2018 (UTC)
OTOH @Jura1: all the time is creating hundreds of French lexemes "with no other information other than the name" (thought there is mandatory lexical category) assuming that it can be extended incrementaly in the future. For Polish language hundreds of villages in Poland have sense because they all have flexion and pronounciation (which can be added by bot we established in Polish wiktionary). Every village has nouns to describe its citizen and flexion for citizens can be builded automatically (I already included it in Module:Lexeme-pl). We have Dictionary of names of towns and inhabitants with inflection and language tips (Q55798440) so anybody can extend it. So starting from empty village lexeme we can build whole family of statements and other lexemes. If it has sense in one language why not in the other? KaMan (talk) 09:13, 19 September 2018 (UTC)
Right now, we can't add lexemes properly because we don't have senses. I expect that, at some point in the future once lexeme support is more complete, empty lexemes will become candidates for deletion in the same way that empty items are. - Nikki (talk) 14:30, 19 September 2018 (UTC)
What about empty lexemes that are linked to by other lexemes, e.g. for etymology? —Rua (mew) 19:12, 19 September 2018 (UTC)
Oops, I meant to also say "unused" too. Empty lexemes which are used by other lexemes would generally be fine since they would meet the third criteria of the current notability policy. It would obviously still be a good idea to add more detail to them though. :) - Nikki (talk) 21:35, 19 September 2018 (UTC)~
@Nikki: Hmm, but then there is problem with Q-items used to describe spelling variant in language code. For example normalized spelling (Q56669831) looks unused but its Q-id can be used as string in language code. KaMan (talk) 07:09, 20 September 2018 (UTC)
@Nikki: I don't see the point in deleting empty lexemes. For starters, a lot of the work for the lexicographical project is just to find the lexemes and give them entries here. Second, the most interesting thing about a lexeme regarding deletion I would say is whether or not it deserves an entry here - i.e. can we attest it according to our rules for attestation? For this purpose, I think it would be useful to have a lot of standard resources (like dictionaries) for every language, where we can make a quick and basic check about the validity of any lexeme in the language. Similarly, we could require that every lexeme is attested either indirectly (e.g. in a dictionary) or directly (in a given work). This would rule out all empty lexemes, but also lexemes with all forms entered that lack attestation. --Njardarlogar (talk) 17:19, 20 September 2018 (UTC)
An empty lexeme is by definition empty, so there would be no information about where the lexeme is attested. It's up to the people who claim it exists to show that it exists by adding verifiable data to it (which makes it no longer empty), not for everyone else to somehow show that it doesn't. - Nikki (talk) 08:29, 21 September 2018 (UTC)
The approach on en.Wiktionary is to always require a verification step before deleting something. If no attestation is found during that verification, it is deleted. That way you don't have to provide verification in advance, which is a pain and would slow everything down tremendously. Someone once remarked that if everything currently lacking citations were deleted on en.Wiktionary, most of the dictionary would be deleted, which can't be the intention. So challenge first, then delete if unverified after a while. —Rua (mew) 08:51, 21 September 2018 (UTC)
Hmm, I took a look at what somebody entered for Venus and I guess having one lexeme for a proper noun with 4(+?) different senses is helpful. Maybe notability requirements for proper nouns should include having more than 1 sense linkable to a wikidata item? ArthurPSmith (talk) 15:13, 18 September 2018 (UTC)

Noun: one or several lexemes[edit]

difference in1 lexeme2+ lexemes
senseadd several sensesadd applicable sense to lexemelink other(s) with homograph lexemeduplicate forms on each
etym.add etym. to each senseadd etym. to lexeme baselink other(s) with homograph lexemeduplicate forms on each
genderadd gender to each senseadd gender to lexeme baselink other(s) with homograph lexemeduplicate forms on each
common/properadd several sensesuse lexical category "noun"add applicable sense to lexemelink other(s) with homograph lexemeduplicate forms on each
caps/lowercaseadd several formsqualify forms to applicable sensesadd applicable sense to lexemelink other(s) with homograph lexemeadd only applicable forms
singular/pluraladd several formsqualify forms to applicable sensesadd applicable senseif possible link other(s) with homograph lexemeadd only applicable forms
pronunciationadd the same form twicequalify forms to applicable senses, add prononciationadd applicable senseif possible link other(s) with homograph lexemeadd form and applicable pronunciation
forms/spellingadd several forms or alternate formsqualify forms to applicable sensesadd applicable senseif possible link other(s) with homograph lexemeadd only applicable forms

Following the discussion above, here a summary of some ways a noun could be split between one or several entities. --- Jura 19:09, 19 September 2018 (UTC)

Difference in spelling? —Rua (mew) 19:13, 19 September 2018 (UTC)
I'd include that in "forms" above. I updated the table. --- Jura 07:03, 20 September 2018 (UTC)

I think we should also consider how lexemes are described in Wiktionaries. I think the aim of lexicographical data is also to serve data to other Wikimedia projects and we should somehow fit with our model to models used by Wiktionaries. KaMan (talk) 07:26, 20 September 2018 (UTC)

  • I don't see the impact on this table. Besides, I suppose you are aware of the limits set on using Wiktionary or other dictionaries models. As for re-use, Wiktionary currently can't retrieve any Wikidata, but capabilities are independent of the model, at least if the different lexemes are suitably interlinked.
The table attempts to outline how the two approaches look like. One language might just use one for a given difference, but across languages, they might differ and I think we should bear that in mind. --- Jura 07:36, 20 September 2018 (UTC)
To me it's clear that words (not just nouns) should NOT be split simply on sense - many senses are very similar or have a common origin. However, we DO split them already if there is a difference in lexical category: cross not only has different meanings as noun, verb, and adjective, it also serves a different language role and has different forms related to those roles, etc, so splitting into multiple lexemes makes perfect sense. Lexemes should also be clearly split, even if they have the same lexical category, if they differ in common pronunciation. For example "tear" (a rip in cloth) vs "tear" (water from the eyes). Does that mean we should generally have separate lexemes whenever there are two or more distinct etymologies for a lexeme? I think if the senses associated with one etymology are clearly distinct from those associated with the other, then yes, it would make sense to separate at that level - that way we have a clear grouping of senses together. I assume the gendered noun issue is related to and would be resolved with this etymology split (but obviously it's not a problem for English so I'm not so familiar with it). So what about proper nouns? In some sense they have a distinct grammatical role, but the only difference in forms (for English at least) is that they are always capitalized. Use of the plural may be rarer, but it certainly happens, for example for parts of names but also when a proper noun is used as an archetype (how many "Einsteins" are there?). But I think there's a big difference in etymology - the origin of a proper noun is in the process of naming the thing involved, not necessarily related to any meaning of the words used. So if we are splitting based on etymology, proper nouns definitely should have their own separate lexemes. And of course they do have at least the capitalization difference for forms (in English). ArthurPSmith (talk) 15:26, 20 September 2018 (UTC)
  • Thanks for your input. I added singular/plural and pronunciation (also discussed below) to the table. --- Jura 15:50, 20 September 2018 (UTC)
  • @Jura1: What do you specifically have in mind with regards to "qualify forms to applicable senses"? That sounds complicated. I should note there are also cases where the same string in the same lexical category with the same pronunciation still has different forms associated with different etymologies (and senses) - for example in English lie (L4181) meaning putting something in a horizontal position vs lie (L4180) to tell a falsehood. That seems a clear case where they need to be separate lexemes. I think splitting lexemes in general based on (significant, not disputed) etymology makes the most sense here. ArthurPSmith (talk) 18:36, 20 September 2018 (UTC)
  • I made english (L2373) with lowercase "english". If you decide to add capitalized "English" there too, you'd need to add qualifiers to the form, to indicate if it applies to lowercase or capitalized form. As mentioned before, for some languages, the applicable solution might only be in one column. --- Jura 18:45, 20 September 2018 (UTC)

Please enable search for lexemes[edit]

Currently lexemes aren't searched by default when users who aren't logged use search. Can this be activated? ----- Jura 08:58, 13 September 2018 (UTC)

I just tested, even logged out, I can look for lexemes, both on Special:Search with any method that I described here, and while editing another Lexeme. Can you describe precisely what you can't do? Lea Lacroix (WMDE) (talk) 10:22, 13 September 2018 (UTC)
Even logged-in, Special:Search/schmilblick doesn't return the Lexeme. When looking in the Lexeme namespace, it's appearing. I think it's not a problem of logged-in vs logged-out, but the fact that the namespace needs to be mentioned, either by starting the search word with L: or by selecting the namespace on the search page. Lea Lacroix (WMDE) (talk) 11:11, 13 September 2018 (UTC)
Yes, if you click on the "advanced" tab, you can see that by default it only searches the main and property namespaces when logged out or when a logged in user hasn't changed the settings (using the checkbox on the "advanced" tab when searching). Now that we have a third namespace for data, the default search should be updated to search all three namespaces. - Nikki (talk) 11:49, 13 September 2018 (UTC)
it's probably some site configuration setting. --- Jura 05:39, 14 September 2018 (UTC)
Unfortunately, it's not as easy as it seems. Items and Lexeme search work differently (because they don't have the same components, items have label and description, lexemes have language and lexical category, etc.) For now, it's not possible to search in both by default. this task gives more details.
For now, just add L: before the keyword and you'll find the results in the Lexeme namespace. Lea Lacroix (WMDE) (talk) 06:55, 14 September 2018 (UTC)
I don't see how a random visitor could find "just add L: before the keyword".
The ticket doesn't seem to be about Special:Search/schmilblick, but auto-completion on schmilblick.
It already works for properties, e.g. Special:Search/internetowym słowniku id. Would you check with your engineers? --- Jura 10:43, 14 September 2018 (UTC)
Properties have a very similar structure as items (with label, description) when Lexemes are different. That's why it is more difficult to search through both namespaces in the same time.
More research on this is planned, but it's not the main priority for now: a form of search is available, now we want to focus on deploying Senses and getting lexicographical data in the query service. Lea Lacroix (WMDE) (talk) 08:48, 19 September 2018 (UTC)
  • Sounds good. In the meantime, could you create a ticket noting the user request? We need to make sure it gets correctly added to the pipeline. I assume that's your main role, isn't it? --- Jura 10:29, 19 September 2018 (UTC)
phab:T204813. Lea Lacroix (WMDE) (talk) 10:39, 19 September 2018 (UTC)
Thanks. To comment on @Smalyshev (WMF):'s options at https://phabricator.wikimedia.org/T204813#4600585 . In the short term, I'd favor #1 (enable common search for Lexeme+Item). I don't think relevancy is much of an issue for the above sample. Search results aren't necessarily optimal even when looking just for items. Currently, one could get the impression that Wikidata has no content when it actually does. --- Jura 19:06, 20 September 2018 (UTC)
That will essentially remove all the improvements we've made for Wikidata search over the last several quarters (they'll still be available, but useless because nobody will be using this search mode as it's not enabled by default and unobvious how to enable it). I think it will be very much of an issue. Smalyshev (WMF) (talk) 19:11, 20 September 2018 (UTC)
Does that mean I shouldn't enable search for lexemes by default for my personal preferences because it renders results for item search less relevant? I wonder if users are aware of that. Currently, I find Special:Search/schmilblick helpful when searching P+I+L. --- Jura 19:15, 20 September 2018 (UTC)
@Jura1: if you compare results with P+I and P+I+L you'll notice that the ranking is different and the highlighting is missing. For simple cases (exact match, one word, same language) it may be still serviceable, but in general default search is substantially worse than specialized one (that's why we took time to create the latter!) so I think defaulting to it would be wrong. Yes, that presents a challenge of how to deal with Lexemes. I'll keep thinking about it and welcome others to do the same, we'll find some solution to it. Smalyshev (WMF) (talk) 20:26, 20 September 2018 (UTC)
That's odd. I de-activate "L" by default then. Maybe a link in the search results to query lexemes only would help. In the short term, that seems fairly easy to implement. In the medium term, including separate results for lexemes would be nice, similarly to what itwiki does in their default output (Wikidata items in addition to articles). --- Jura 11:07, 23 September 2018 (UTC)

Help a disgruntled Wiktionary editor[edit]

I'm a long-time English Wiktionary editor, and when Wikidata first came along the potential for storing structured lexical data seemed very enticing. It always seemed silly to me to duplicate all the work on every Wiktionary, and store everything in a format that doesn't enforce any structure but leaves it up to convention (which is not even strictly defined, resulting in many difficult-to-parse entries). However, when the prospect of using Wikidata on en.Wiktionary came into view, many editors displayed a knee-jerk reaction against it, wanting to retain local control for better or worse, and locked down all attempts to use Wikidata before it was even off the ground. Proposals to use Wikidata even in minor ways, like linking senses to the Wikidata items they represent, were shot down and never went anywhere. The seeming opposition to any innovation whatsoever on en.Wiktionary got me thinking my efforts would be better spent here, where the data I enter actually means something (in both an information sense and a human value sense).

I have very little experience with Wikidata, however, and I'm not sure how a typical en.Wiktionary entry would map to Wikidata lexemes, or even whether it can map at all. I have looked briefly at the documentation and data module, but many questions still remain. The language I have worked with the most recently is Northern Sami, which is probably rather obscure and doesn't have many Wikidata lexemes, or even any. So I'll give an example Wiktionary entry for Northern Sami: guossi. This entry has many of the things that are typical on en.Wiktionary, and I wonder how and to what degree these things can be done in Wikidata:

  • It has an etymology that states it is inherited from an ancestral language, Proto-Samic, giving the reconstructed form in that language. It is also added to the category Northern Sami terms inherited from Proto-Samic. Does Wikidata support etymological inheritance from earlier forms (in the same or a different language), and does it have reconstructions?
  • It has a pronunciation that is generated automatically by a template, based on the spelling (or in this case, a respelling). The number of syllables is counted, and it is categorised in Northern Sami 2-syllable words. Is automatically-generated pronunciation possible? It would be highly desirable for languages like Northern Sami that have a rather regular spelling, but also allow for specific exceptions like in this case.
  • The lemma is defined as a noun, and the headword-line below the "Noun" header shows a respelled version of the written form, that includes a special mark. This mark is not part of the normal written form, but is often included in reference works such as dictionaries. Can Wikidata handle the difference between dictionary/technical spellings and everyday orthography? Consider as another example the use of macrons to mark long vowels in Latin.
  • The lemma is given an inflection, using a template that automatically generates forms. I have gathered from this project's documentation that automatically generating forms isn't possible yet (I hope it is very soon!). But Wiktionary is able to generate the forms, using Lua, based on the type of template and the parameters it is given. In this case, se-infl-noun-even indicates that the inflectional type of the word is "even" (an inflectional type specific to the Sami languages), and the argument guosˈsi gives the stem of the word from which the Lua code can derive all the forms. It also derives a category, Northern Sami even i-stem nouns. If forms cannot be auto-generated, can inflectional class information at least be added, so that users have at least some information about the forms, and so that a potential future form-generating script can use this information later? Also, being able to get a listing of all lemmas belonging to a certain inflectional class is important, can Wikidata do this?
  • A derived term is listed, again including a mark used in dictionaries but not in normal spelling. The mark is automatically removed by the linking template to generate the page name (the actual written form). Wiktionary currently requires manually listing derived terms, but if the etymology of the other term is known, and it specifies a derivation from a base term, can this be figured out automatically?
  • Finally, a link to an external site that contains information about related terms in different languages and such. Wikidata could presumably take over this role eventually, but linking to the external site may still be useful. Could something like an "Álgu id" be added to a Lexeme?

If I could start adding entries here, rather than at Wiktionary, that seems more productive, since Wiktionary could easily use the data here (if only they chose to...), while Wikidata can't easily use Wiktionary's "data". So if someone could answer my questions that would be wonderful. I really hope I can do as much here as I do on Wiktionary. Rua (talk) 01:04, 14 September 2018 (UTC)

@Rua:
KaMan (talk) 06:34, 14 September 2018 (UTC)
(edit conflict) Hi @Rua:,
Lexemes on Wikidata are still pretty new, so I'll give my personal answer (as mine, not always right):
First, for the mapping between the Wiktionnary (WT) entry and a Wikidata (WD) lexeme, it's indeed a bit complicated. WT entry is based on a lemme, so an entry can contains several lexemes. And in the other hands, one lexeme is split into several WT entry. For instance en:wikt:cat#English and en:wikt:cat#Irish is the same entry but two different lexemes. And en:wikt:cat#English and en:wikt:cats#English is two entries but the same lexeme, which is cat (L7) here.
  • if there is sources, no problem for having reconstruction (see
  • it's not possible right now but it would probably be created soon (as soon as someone is motivated to create this tool ;) )
  • here is something where the community doesn't agree on how to do it exactly but yes, Wikidata can handle different spelling, the question is only what is the best ay to handle it (mainly this or that, I prefer the second).
  • again, it's not possible right now but it would probably be created soon. Here is half exist as we already have Wikidata Lexeme Forms by Lucas Werkmeister)
  • are speaking of having a graph of etymology like this one ?
  • yes, it can and it should link to this site! either with a direct property (specific property to be created or the general described at URL (P973)) and/or in references.
Cdlt, VIGNERON (talk) 07:09, 14 September 2018 (UTC)
Thank you for your replies! Let me see if I can answer them in turn.
  • @KaMan: For etymologies, Wiktionary distinguishes between inheritance, borrowing, calque and other derivation. A term can be borrowed and inherited from the same language, like how French has inherited many terms from Latin but has also borrowed terms from Latin. I think this is a useful distinction to make, how can I make it on Wikidata?
  • The "conjugational class" property is useful, but I don't think making a distinction between conjugation and declension is very useful. Northern Sami doesn't make this distinction, there are just generic inflectional classes that apply to both verbs and nominals. Can I request "inflectional class" or would that conflict with "conjugational class"? And would this property be machine-readable so that a bot or script can use the data to generate forms?
  • @VIGNERON: The problem with treating them as different "spellings" is that the forms with the extra marks aren't really spellings that are normally used, they are dictionary spellings. They show extra information about pronunciation that is missing from the normal spelling. I guess you can compare them to enPR for English, although that goes much further than just adding a few extra marks here and there to the normal spelling. I see that forms allow for a "pronunciation" property; is this property required to be in IPA? If not, then perhaps this respelling can be added that way. But maybe a "pronunciation respelling" property is more valuable.
  • The "Lexeme Forms" tool looks useful, but not really for Northern Sami. Northern Sami nouns and verbs have lots of forms, way too many to enter them all manually.
  • The graph builder is similar, but I'm talking specifically about something to answer the question: "Which other lexemes in this language are derived from this one?". So not including other languages. For that, there would be either "...inherited from this one" or "...borrowed from this one", a distinction which Wikidata apparently does not make yet.
And I have another question: would it be useful to have a "Wikidata for Wiktionarians" (or specifically English Wiktionarians) page to explain a lot of this information for future newcomers like me? Wiktionary has Wiktionary for Wikipedians already. Rua (talk) 10:53, 14 September 2018 (UTC)
I created Wikidata:Property proposal/Álgu ID, I hope I did it right. Rua (talk) 11:22, 14 September 2018 (UTC)
@Rua:
  • There is proposition of property to describe derivation, see: Wikidata:Property proposal/mode of derivation, if you like this idea you can support it by adding {{s}} ~~~~ at the end
  • "inflectional class" would not conflict with "conjugational class" but can be hard to distinguish for some, however I looked into https://en.wikipedia.org/wiki/Northern_Sami#Inflection_types and I cannot think of any other name than "inflectional class". Perhaps it will be usefull in other langauges too.
  • Page "Wikidata for Wiktionarians" would be very useful in my opinion thought it can be also done by extending Wikidata:Wiktionary page.
KaMan (talk) 11:27, 14 September 2018 (UTC)
For the record, I brought up the need to specify general inflection classes earlier, but nothing came out of it then. Maybe something can come out of this now.
The list of Northern Sami lexemes provided in the link above is incomplete for some reason, there is also Sápmi (L2800) (furthermore, for Lule Sami (Q56322) we have Sábme (L2900), and for Southern Sami (Q13293) Saepmie (L3500), but neither of these appear in the relevant lists). --Njardarlogar (talk) 11:46, 14 September 2018 (UTC)
I created
  • sāmē (L20895), did I do that correctly? Does the asterisk need to be included? Rua (talk) 12:16, 14 September 2018 (UTC)
I added reconstructed word (Q55074511). I don't know convention for Proto-Samic language in etymology literature but for Proto-Slavic language there is established convention in dictionaries to use asterisk in reconstructed lexeme labels. Polish Wiktionary has it hardcoded in rules: https://pl.wiktionary.org/wiki/Wikis%C5%82ownik:Zasady_tworzenia_hase%C5%82#S%C5%82owa_rekonstruowane KaMan (talk) 12:31, 14 September 2018 (UTC)
I think that's a general convention for all reconstructions, so I'll add it. What about the "mis" code? The Proto-Slavic entry has more after it, but I don't know how it works. Rua (talk) 12:36, 14 September 2018 (UTC)
@Rua: I think "mis" is ok but I better ask @VIGNERON: if it's sufficient. id in Proto-Slavic entry only points to Proto-Slavic language itself. It's identifier of Wikidata. KaMan (talk) 12:47, 14 September 2018 (UTC)

Special:WhatLinksHere shows incomplete results for lexemes[edit]

KaMan's point of view[edit]

@Lea Lacroix (WMDE), Lydia Pintscher (WMDE): It was reported to be fixed (phabricator:T195302) but it looks like Special:WhatLinksHere still does not show all lexemes. Here is one report about this list. Another example is

  • mrchy (L5600) which should be visible on this list. This last list contains 7 entries while according to statistics it should contain 31 entries. Should I create new task for this or attach to the previous task where it was reported to be solved? KaMan (talk) 12:10, 14 September 2018 (UTC)
  • It needs to be edited. Ideally someone would run a bot through all earlier lexemes. It could also set the content language for existing lexemes. This would also refresh the html header to include the lemma for all where it's missing. --- Jura 04:26, 15 September 2018 (UTC)
    But one of the subtasks (phabricator:T198301) in earlier task on phabricator was to loop over all existing lexemes and make purge so empty edit should not be needed. Perhaps this subtask failed. I don't know. KaMan (talk) 05:09, 15 September 2018 (UTC)
I replied on the ticket. It looks like we might have a bug. ·addshore· talk to me! 07:20, 17 September 2018 (UTC)

Rua's point of view[edit]

I found that sámegiella (L558) was missing from Special:WhatLinksHere&target=Q33947&namespace=146, but it appeared once I edited it. Because I didn't know it existed earlier, I created L20899 which had to be merged once I discovered the duplicate. Is this related to lexemes not showing up in searches? Oddly, sámegiella (L558) did show up in a search, that's how I found it. Rua (talk) 19:11, 14 September 2018 (UTC)

@Rua: Task phabricator:T198301 was rerun and now everything should be ok. KaMan (talk) 08:19, 18 September 2018 (UTC)

Porting or reusing Wiktionary inflection modules[edit]

I've done a lot of work on Wiktionary to produce modules for inflecting the various Sami languages (Q56463), such as wikt:Module:smi-common (common code used for all Sami languages), wikt:Module:se-common (common code for Northern Sami (Q33947)) and wikt:Module:se-nominals/wikt:Module:se-verbs. I would like to use this code to generate forms on Wikidata, but it would be a shame to have to duplicate all the code and maintain it in two places. Is there a way that Wikidata can reuse Wiktionary's modules? The modules are structured in such a way that the process creating the forms is clearly separated from the part that presents it in a table to the user, so it would be easy to add code that converts the table of forms to JSON or some other format that Wikidata understands. Rua (talk) 12:44, 14 September 2018 (UTC)

Grammatical features that appear more than once[edit]

I thought of the case of Turkish nouns. Turkish nouns have grammatical number (Q104083), case (Q128234) and possessive forms. Within the property of "possessive" there is grammatical person (Q690940) and grammatical number (Q104083). Grammatical number appears twice. For example, teyzeleriniz means "your maternal aunts", and is composed of teyze "maternal aunt" + ler (plural suffix) + no case ending (nominative case) + iniz (second-person plural possessive). How would one denote the grammatical features for such a form? Rua (talk) 13:00, 14 September 2018 (UTC)

I already asked this question in Wikidata talk:Lexicographical data/Archive/2018/06#Pronominal forms. I tried to apply one of the proposed solutions at שָׁלוֹם/שלום (L384), but I'm not quite satisfied with this method. If you or anybody else have/has a better idea, I'd be happy to discuss it.--Shlomo (talk) 15:34, 14 September 2018 (UTC)
That discussion went on a tangent for sure! Is it possible to add qualifiers to grammatical features? If so, then a solution like "possessive" with the person and number as its qualifiers would work. Rua (talk) 15:39, 14 September 2018 (UTC)
@Rua, Shlomo: What if we would not use grammatical features but properties to describe such complicated case? I mean use on form has quality (P1552) with two values: first value singular (Q110786) with qualifier of (P642) noun (Q1084) and second value plural (Q146786) with qualifier of (P642) personal pronoun (Q468801)? KaMan (talk) 16:27, 14 September 2018 (UTC)
@KaMan: Technically, it could be possible, but I don't like the idea of (mis)using the properties (N.B. a property created for another purpose) for description of grammatical features, if we already have a separate position for them.--Shlomo (talk) 18:30, 16 September 2018 (UTC)
  • It's probably a simpler form of the problem, but for se figurer (L10633) one could easily define the forms of the verb on figurer (L10632) and all forms of the reflexive pronoun (Q953129) on se (L9159). --- Jura 04:21, 15 September 2018 (UTC)
    • That's what English Wiktionary already does. One of the senses is just marked "reflexive". Rua (talk) 11:03, 15 September 2018 (UTC)
      • That would be directly on figurer (L10632)? The disadvantage I see is that forms are only partially defined there. --- Jura 07:53, 16 September 2018 (UTC)
    @Jura1: IMHO it's not a simpler form of the problem addressed above, but a completely different problem. And again, there can be different approaches in different languages. The German linguists think of reflexive verbs as reflexive forms of a verb lexeme which is common for active, passive and reflexive forms. Russian dictionaries have one common lexeme for active and passive forms (if they both exist) but a separate lexeme for the reflexive one.--Shlomo (talk) 18:30, 16 September 2018 (UTC)
    That's because in German the reflexive pronoun remains a separate word, while in Russian it's fused with the verb and written as one word. Spanish and Italian are like Russian in that respect. —Rua (mew) 15:57, 18 September 2018 (UTC)
    Well, not exactly. In Spanish, the reflexive pronoun is fused only in some forms (irse vs. me voy). In Czech or Polish, the pronoun always remains a separate word like in German, but the reflexive verb is a separate lexeme like in Russian. The point is, we have to use in every language the system accepted by the linguists of that language and we shouldn't try to create a unified rules for all languages of the world (+ Klingon…) --Shlomo (talk) 17:38, 22 September 2018 (UTC)

New script to create new entities (including lexemes)[edit]

I have created a (very very immature and buggy) script at User:GZWDer/newentity.js. To use it:

  1. Add importScript( 'User:GZWDer/newentity.js'); to your common.js (or, alternatively copy this to browser console)
  2. Click "Create entity" at sidebar
  3. Copy e.g. {{en-noun|herald}} to the box and click "Create"
  4. Done. The dialog will not close (you may create another one).

Note the template {{En-noun}} depends on Module:Lexeme-en which is currently very immature too. To support more languages and part of speeches, you may create or improve similar modules.

It does not support creating multiple entities at once, feel free to improve (fork or completely rewrite) it. --GZWDer (talk) 04:00, 15 September 2018 (UTC)

Note you must check the plural generated by the module. the module did some replaces, but will not be always correct. For creating nouns with irregular plurals, use {{en-noun|taxon|taxa}}. Also it currently does not support uncountable nouns.--GZWDer (talk) 04:23, 15 September 2018 (UTC)
Verbs are also supported: {{en-verb|bully}}. Note for real life issue I'm not able to comment or answer questions until January 2019.--GZWDer (talk) 04:59, 15 September 2018 (UTC)
Thank you @GZWDer:! I will definietly test it today with Polish module. I have added your script to phabricator:T202282. Before you go to your break, are you able to make also a script which adds forms to existing lexeme? KaMan (talk) 05:51, 15 September 2018 (UTC)
It works great! I can temporarily workaround adding to existing lexemes by creating new one and merge with old one. This way I created L21350 and then merged with dostatni (L21348) and I got 84 forms in one step. Finally! Thank You very much GZWDer. I will add your script to Wikidata:Tools/Lexicographical_data page. KaMan (talk) 07:33, 16 September 2018 (UTC)
@GZWDer: Thank you for creating this great tool! I have created Module:Lexeme-ja and it works well with this tool (熱い/あつい (L21547), 鬱陶しい/うっとうしい (L21612), 雅/みやび (L21613), 複雑/ふくざつ (L21614), etc). If possible, could you please create a tool that edits the existing lexeme in the similar way? --Okkn (talk) 19:02, 16 September 2018 (UTC)

Is phoneme a lexeme?[edit]

https://www.wikidata.org/w/index.php?title=Special%3AWhatLinksHere&target=Q8183&namespace=146 KaMan (talk) 16:53, 15 September 2018 (UTC)

I would say they're not, phonemes don't have meanings. - Nikki (talk) 08:59, 16 September 2018 (UTC)
I would say the same. KaMan (talk) 10:57, 16 September 2018 (UTC)
+1, and what data would be stored on this page? I don't see any useful lexicographic information. Cdlt, VIGNERON (talk) 12:15, 16 September 2018 (UTC)
I would think not, though I think we could absolutely attest lexemes without meanings. --Njardarlogar (talk) 12:46, 16 September 2018 (UTC)
Like boo? —Rua (mew) 18:54, 17 September 2018 (UTC)
Two scenarios that came to my mind are: 1) an extinct language or language variety where we can attest a word but not infer its meaning, or 2) words used by one or more authors that are attestable but whose meanings are not inferable (e.g. because they are nonsensical). --Njardarlogar (talk) 08:50, 18 September 2018 (UTC)
And why lang=fr? I'd say it should be lang=mul but this code is not applicable, Lea Lacroix? --Infovarius (talk) 19:42, 18 September 2018 (UTC)

Language code not recognised: sga[edit]

I'm not able to enter a new lexeme for Old Irish, because it does not know the language code of Old Irish, which is sga. It asks me to enter the language code in addition to the language item (Old Irish (Q35308)), which is redundant. But then if I put in sga as the language code, it says "the supplied language code was not recognised". Rua (talk) 17:26, 15 September 2018 (UTC)

I just had it happen again with Middle English (Q36395). Rua (talk) 17:30, 15 September 2018 (UTC)

  • You need to type "mis" in the field that shows up. Result is something like Lexeme:L20161. --- Jura 18:20, 15 September 2018 (UTC)
    • But these languages have real ISO codes, why can't I use those? ang is recognised for Old English... Rua (talk) 18:29, 15 September 2018 (UTC)
      • There are five groups: (1) the ones one can use, (2) the ones that are available for labels or monolingual strings, (3) the ones that haven't been requested for that yet, (4) the ones langcom doesn't like, and (5) the ones that aren't valid IETF lang tags.
        Beyond (3) and (5), it's not clear to me what goes where, but it might explain the current system.
        Currently, we can just make sure the correct item is at the top of the L-entity. Items have the advantage of being flexible. --- Jura 18:59, 15 September 2018 (UTC)

Idea: arrange forms into paradigms[edit]

Right now you can specify multiple forms for the same grammatical features, in case there are multiple alternatives. However, it is also possible that there are two entire alternative sets of forms, or paradigms. For example, on en.Wiktionary there's the entry ráhpis, which has two separate inflection tables illustrating the paradigms: one follows the odd inflection (Q56633433) and the other follows the contracted inflection (Q56633449). While it is possible to simply enter these as alternative forms with the same grammatical feature, it is useful to indicate that certain forms belong together. A speaker is likely to use forms from only one paradigm for example, and unlikely to mix forms from different paradigms. It may even be considered bad style to mix, like it is bad style to mix color and colour in the same work.

So I think it may be useful if forms could be grouped into paradigms, which then in turn are part of lexemes. —Rua (mew) 19:36, 15 September 2018 (UTC)

Instead of introducing a whole new datatype/layer, this could be implemented with a property on each form that says which paradigm it belongs to. It would just be a string or integer; forms with the same value would be considered to belong together. —Rua (mew) 19:54, 15 September 2018 (UTC)

Mutually exclusive grammatical features on forms?[edit]

I just noticed that portray (L12) has one form marked as both "simple past" and "past participle". But these are mutually exclusive, the form is either the past tense or the past participle, but not both at the same time. The forms for both of these just happen to be identical. Intuitively, these should be separate forms, rather than being combined into one. Is that correct?

Aside from that, why is "simple present" and "simple past" used here? Why not just present tense (Q192613) and past tense (Q1994301). Are these "simple" tenses somehow special? —Rua (mew) 09:44, 17 September 2018 (UTC)

@Rua: As for two features in single form IIRC that was choice of @ArthurPSmith: and discussed in this thread: Wikidata_talk:Lexicographical_data/Archive/2018/07#Some_thoughts_on_what's_missing. KaMan (talk) 10:07, 17 September 2018 (UTC) (Hmm, to self, but looks like ArthurPSmith did not edit portray (L12))
I was originally just following precedent on this (regarding all of Rua's questions) but I'm open to changing it. I think a bot can relatively easily fix these if we do reach a consensus on how to do it in the future. In fact, recently I have been using Lucas's form tool which automatically creates separate forms for the two different past forms even if they are identical strings. As to why to use "simple present" etc. I think the main reference would be en:Simple present - this may be peculiar to English, but the point is to distinguish these forms from the other present tenses used in the language (present perfect, present progressive). ArthurPSmith (talk) 14:51, 17 September 2018 (UTC)

New items for Germanic conjugation classes[edit]

For those working on Germanic languages, the following items are now available. Some already existed, but I created most of them just now.

You can add these to lexemes with has conjugation class (P5206). You should add these based on the current state of things in a language, not the historical/original state. That means that a verb that was once strong but is now weak should be considered weak. —Rua (mew) 16:27, 17 September 2018 (UTC)

Thanks, but has conjugation class (P5206) is dedicated to language elements, not lexemes, see example at Property:P5206#P1855. Perhaps you mean conjugation class (P5186) KaMan (talk) 16:50, 17 September 2018 (UTC)
Oops yes, sorry. Do what I mean, not what I say! —Rua (mew) 17:00, 17 September 2018 (UTC)
@Rua: You can always add both, one with a end period (P3416) (and of normal rank) and another with a start period (P3415) (and of preferred rank), so that whenever SPARQL starts to work on lexemes, using "wdt:P5186" only returns the value with preferred rank. Mahir256 (talk) 18:45, 17 September 2018 (UTC)
I'm sorry, I can't follow what that is in response to. —Rua (mew) 18:47, 17 September 2018 (UTC)
@Rua: This is in response to "You should add these based on the current state of things in a language, not the historical/original state.". One of the main purposes of the rank system is to distinguish historical information from current information when there are multiple values for a given property. Mahir256 (talk) 18:51, 17 September 2018 (UTC)
Ah, I got you now. I agree that we can do that, but only to a point. To use English as an example, there's a point where the language is no longer considered English, but rather Middle English. So no usage for what is considered modern English can go back before 1500. —Rua (mew) 18:53, 17 September 2018 (UTC)

I created vriezen (L21685) and added one of the items to it. But it also has an additional irregular feature, Grammatischer Wechsel (Q1542100). I've added it as a qualifier to the conjugation, but I figure it could also be added directly as a property of the lexeme itself. What would be better? —Rua (mew) 17:07, 17 September 2018 (UTC)

Indicating the pronoun in conjugation forms?[edit]

In Dutch, there are different second-person singular forms depending on which pronoun you're using. Spanish has something similar. There is currently no item to use as a grammatical feature for this, but I don't think an item would make sense, because the pronouns are lexemes and not items. How can I indicate that the form is restricted to use with certain (one or more) pronouns? Is there a property I can use? —Rua (mew) 18:33, 17 September 2018 (UTC)

I actually have even more issues when creating the forms for zijn (L21707). Some forms are used only colloquially, some are archaic, and some are only used when the pronoun immediately follows the verb form. How do you indicate all this stuff?? —Rua (mew) 18:52, 17 September 2018 (UTC)

@Rua: For colloquially or archaic forms I use instance of (P31) with colloquial form (Q55228835), obsolete form (Q54943392), former form (Q56247521). See Template:Lexicographical properties for more values. KaMan (talk) 03:55, 18 September 2018 (UTC)
@Rua: For case when "the pronoun immediately follows the verb form" I recall something similar in examples listed on page requires grammatical feature (P5713). KaMan (talk) 12:39, 18 September 2018 (UTC)
I don't think that's applicable here. What I mean is that when you say "you are" in Dutch, you say jij bent. But when you swap the two words around, like in a question, it becomes ben jij. The -t ending is dropped, but only when the subject pronoun jij immediately follows it. This applies to all verbs, this is just one example. —Rua (mew) 13:59, 18 September 2018 (UTC)
If it isn't included in the form directly, I think we should include it in some other way. --- Jura 15:01, 18 September 2018 (UTC)
It should be included in the form, but how? —Rua (mew) 15:54, 18 September 2018 (UTC)
Maybe with requires grammatical feature (P5713) on the form. You may need to make some item for the rule.--- Jura 18:32, 18 September 2018 (UTC)

List of properties used for Lexeme[edit]

Where can I find complete list of properties used for Lexemes? Wikidata:List of properties/Wikidata property for lexemes seems incomplete. -Nizil Shah (talk) 05:59, 18 September 2018 (UTC)

@Nizil Shah: I try to keep Template:Lexicographical properties up to date. KaMan (talk) 06:30, 18 September 2018 (UTC)


Count number of vowels[edit]

What would we need to add to Wikidata to be able to count the number of vowels in a reliable way? This even if lemmas and forms don't have IPA statements. The idea is that a query could do that for a given language (e.g. French). One could query for words with 2 or 3 vowels. --- Jura 08:49, 18 September 2018 (UTC)

@Jura1: do you have an example where the number of voyelle is not trivial - or at least simple - to find? I guess are you speaking of vowel (Q36244) (vowel, the sound) and not vowel letter (Q9398093) (vowel, the letter). Does it worth to ask to put this information when it as easy and far better to enter the IPA. Cdlt, VIGNERON (talk) 12:12, 18 September 2018 (UTC)
  • I don't really know if it's trivial. Here is a list of names: [1]. Can you add a column with the number of vowels? If you volunteer to fill IPA when needed, that might even be better. --- Jura 14:55, 18 September 2018 (UTC)
    • @Jura1: Obviously not, this is Qid, for each of these names there can be multiples lexemes with multiples pronunciation (as you can see, for most of these names the native label (P1705) is very often in "multiple languages"). A good example is Berger (Q1260304) who can be "shepherd" in French \bɛʁʒe\ or "mountaineer" in German \ˈbɛʁɡɐ\ ; that said, in both case, there is obviously 2 vowels. I can help adding IPA on lexemes, but my point is: if someone can count the vowel, then the same person is probably able to add the IPA, it's the same level of knowledge but the second is more useful and include the first. Cdlt, VIGNERON (talk) 17:44, 18 September 2018 (UTC)
  • The count can be done under the assumption that's in French (given the statement on the item). If L-entities were available, we could do the same with lexemes. The idea of queries is that the query server does the count. --- Jura 18:28, 18 September 2018 (UTC)
  • As an input, I think the query should mainly use a selection of words (labels here, but generally lemmas and forms when this will be possible). The language here needs to be given, but would be defined on the entity for lexemes. The nice thing with the name list provided above is that it can indeed by analyzed with different languages. Maybe a more interesting one would be "Jean". As definition for vowel, the query should just take vowel (Q36244). I think this is an application this generally possible with dictionaries, even without IPA. Accordingly, it's something Wikidata and Wikidata query server should be able to do. @KaMan: how would you go about? --- Jura 05:59, 19 September 2018 (UTC)
    I don't know. I admit I'm less interested in this type of application. I'm more interested in building dictionaries with centralised data. KaMan (talk) 06:34, 19 September 2018 (UTC)
    • Well, it's part of that. If you have a French print dictionary at hand, I can find for you where you get the necessary information there. Here we need to find out how to include it in Wikidata. --- Jura 06:38, 19 September 2018 (UTC)
  • The necessary steps are probably
    • (1) define the vowels
    • (2) define corresponding graphemes
    • (3) prioritize the graphemes (longer first?)
    • (4) > count
Seems like something that can generally be done with information included in print dictionaries and something query server should be able to handle with structured Wikidata. --- Jura 07:23, 20 September 2018 (UTC)
  • Adding brackets (samples: "⟨o⟩" and "⟨eau⟩") could help distinguish them. Let's see if I can manage to insert them. --- Jura 10:40, 23 September 2018 (UTC)

Sample application: generate a print dictionary?[edit]

As a proof of concept, I think it would be interesting to attempt to generate printable mini-dictionaries. Maybe a monolingual one, a bilingual one, and a specialized one for a small list of words (e.g. 100), but including the usual other tables included in dictionaries. --- Jura 08:49, 18 September 2018 (UTC)

@Jura1: you should add that to Wikidata:Lexicographical data/Ideas of tools. Cdlt, VIGNERON (talk) 11:43, 18 September 2018 (UTC)
  • Eventually, but I think it would need some preparation from the editorial side before (content and structure) --- Jura 14:57, 18 September 2018 (UTC)

Do grammatical features have an order?[edit]

I noticed that when entering grammatical features, if you then reload the page, the features are shown in a different order. Is there any particular order to them, such as by item ID? —Rua (mew) 19:24, 18 September 2018 (UTC)

@Rua: no, there is no order in the grammatical features, nor in the data in general. Do you need an order? In general, order doesn't matter in data and different order can be inferred from other data, like children of a person, when you reused the list of children you can sort them by alphabetical order or birthdate. Cdlt, VIGNERON (talk) 20:17, 18 September 2018 (UTC)
I am currently working on implementing pywikibot code for forms, and was wondering if I could use a set (which is unordered by nature) to store grammatical features, or if I should use a list (which is ordered). Your reply answers that, thank you! —Rua (mew) 20:22, 18 September 2018 (UTC)

A few questions![edit]

Hi guys, I have one question. I'm still trying to understand how Wikidata and Wikdictionary can work together.

1. I noticed there is experiment (L110), and experiment (Q101965). One is a lexeme (which is another concept I'm trying to understand), and the other is an entry about it. What is the different? Do they fullfill different roles? Is it possible to link experiment (Q101965) to experiment (L110) with a property?

2. Is there a list of lexemes by language? If so, where?

3. Can scripts be written using lexemes, like Mbabel does with properties?

Thanks a lot for the answers. Tetizeraz (talk) 20:58, 18 September 2018 (UTC)

1. Not yet, but in a month or so we will be able to, when senses are introduced.
2. You can sort of get one if you go to the item (Q page) for a language, then go to the "What links here" page for that item, and then filter the results by namespace so that only pages in the Lexeme namespace are shown.
3. I don't know about that.
Rua (mew) 21:09, 18 September 2018 (UTC)
Thanks for the reply Rua (talkcontribslogs). Would you mind another question? What is senses which you mention in your first answer?
The things that define the meanings of the lexemes. They are currently missing from Wikidata, but are planned to be added in October. —Rua (mew) 22:13, 18 September 2018 (UTC)
@Tetizeraz:
1. experiment (L110) is about single word in single language while experiment (Q101965) is about language independant concept. You can connect them using property item for this sense (P5137). This property should be placed under sense in experiment (L110) but feature of senses is not ready yet so we use this property inside lexeme. You can read what is sense on documentation page where data model is described, see Wikidata:Lexicographical data/Documentation
2. As Rua answered, You can find all Portuguese lexemes on this page https://www.wikidata.org/w/index.php?title=Special%3AWhatLinksHere&target=Q5146&namespace=146
3. Usage of lexemes on other projects is not possible for now but is on the agenda for the future (mainly towards Wiktionaries). There is a list what will be added in the future, see Wikidata:Lexicographical_data/Documentation#What_will_be_added_in_the_future
KaMan (talk) 07:27, 19 September 2018 (UTC)

Thanks Rua (talkcontribslogs), Jura1 (talkcontribslogs), KaMan (talkcontribslogs) for the answers! :) Tetizeraz (talk) 03:31, 21 September 2018 (UTC)

I got a en.Wiktionary template to generate Wikidata forms[edit]

I did a bit of coding on en.Wiktionary and got a regular inflection template to generate JSON suitable for Wikidata. You can see it here: https://en.wiktionary.org/wiki/Special:ExpandTemplates?wpInput={{se-infl-verb-even%7Cealli%7Coutput=Wikidata}} . If you expand the template with output=Wikidata, you get Wikidata forms in JSON format. Without it, you just get a regular Wiktionary HTML table with forms.

I think this is very useful for multiple reasons:

  • No need to maintain code in two places at once; it stays at its place of origin and gets reused.
  • No need to learn a new system, just use the templates you are already familiar with if you are a Wiktionary editor.
  • A JavaScript gadget on Wikidata could run "expandTemplates" remotely on Wiktionary, then use the returned JSON to generate new forms on Wikidata. I don't know if there are cross-site scripting barriers that prevent this, though.
  • A bot could get a Wiktionary page, find the template, add output=Wikidata to it, then expand it, and get JSON from which to generate forms on Wikidata. This could make importing of forms very easy!
  • If the template is able to generate other things like pronunciations automatically, it could convert these to claims and add them, too.

For those who want to see how it works, look at wikt:Module:se-verbs and scroll down to the bottom. Currently there is still a problem, in that the generated representations are in dictionary spelling, not the spelling used by writers of Northern Sami. Because it's not clear how to handle these yet, I can't generate Northern Sami forms just yet, but for languages that have no such thing as dictionary spellings, it should be fine. —Rua (mew) 21:24, 18 September 2018 (UTC)

If these points are possible at all (and I think not, except 4) the main flaw I see that this template is at en.wikt. Why not ru.wikt or other Wiktionary? Then I'd suggested to have it at Wikidata and reuse it in every Wiktionary. But this returns us to filling Wikidata lexemes which of course could be done as in your 4th point. --Infovarius (talk) 04:34, 19 September 2018 (UTC)
The point is to reuse the templates that already exist, instead of copying them to Wikidata and having to maintain both. You can't use templates from another wiki, so Wiktionary can't use Wikidata templates. But you can call expandTemplates. —Rua (mew) 09:27, 19 September 2018 (UTC)
Also, I got point 3 to work. See https://www.wikidata.org/w/index.php?title=User:Rua/common.js&oldid=748795819 . If you put that in your common.js, it will display an alert window with the expanded wikitext, including the JSON for all the forms. The trick seems to be to include the origin parameter, which allows for cross-site scripting among Wikimedia sites.
@GZWDer:, do you think you could do something with this knowledge, i.e. allow users to enter templates from particular Wiktionary sites, and expand them remotely through this method? —Rua (mew) 09:58, 19 September 2018 (UTC)

Search box currently on Lexeme:L123/Lexeme:L1234[edit]

To look up words, the form and lemma lookup on these sandbox pages are most helpful.

MediaWiki has a form extension that allow to set up search forms. Could the same be done for these autolookup forms? This way we could place them elsewhere than in the sandbox. --- Jura 07:12, 19 September 2018 (UTC)

Hello Jura,
I don't really understand what you're asking for.
  • What is the "search box currently on the sandbox" you're refering to? Is that about the tip of editing a statement and looking for a Lexeme or Form in the value field?
  • What exactly do you want to achieve? What action would you like to be able to do (more easily)? Looking for something inside an existing page, or looking for a Lexeme or a Form in the whole database?
Thanks, Lea Lacroix (WMDE) (talk) 10:24, 19 September 2018 (UTC)

The idea is to have the function there is to find forms or lemmas on statements (as you suggested we use on L123) in a form like the one above. --- Jura 10:43, 19 September 2018 (UTC)

I am not sure I understand, but is Ordia able to help? https://tools.wmflabs.org/ordia/Finn Årup Nielsen (fnielsen) (talk) 09:05, 20 September 2018 (UTC)
The search box there does the same as the one above. I'm looking for one that users could place and that would work like the one at the top right corner of the website, but for lexemes, not items. --- Jura 15:02, 20 September 2018 (UTC)

New form lookup speed (newly created lexemes)[edit]

Somehow it seems to take a long time till forms of new lexemes show up in autocomplete (to use the form in a statement). This takes longer than the lemma or newly added forms to preexisting lexemes. --- Jura 08:15, 19 September 2018 (UTC)

Is that a matter of seconds, minutes, hours? Can you give us a few examples of the time you encountered while editing? Lea Lacroix (WMDE) (talk) 10:27, 19 September 2018 (UTC)
  • It didn't show up when I created carpere (L22578) and wanted to add "carpe" in a qualifier at carpe diem (L22577) . Can you give it a try and tell me if you can confirm? --- Jura 10:35, 19 September 2018 (UTC)
  • It seems it was a problem of leading whitespace. Removing that fixed it. [2]. Maybe past problems were similar. --- Jura 11:32, 19 September 2018 (UTC)
  • The same problem was with items yesterday. I waited several hours before new items (or even new labels) can be used in statements. And PetScan just ignore adding such new items. --Infovarius (talk) 11:24, 21 September 2018 (UTC)

Language-specific considerations and conventions?[edit]

On Wiktionary there are often arrangements and conventions that apply per language, and these are documented in pages like wikt:WT:About Dutch or wikt:WT:About Middle Dutch. They cover whatever is necessary to make things consistent for that particular language, such as what spelling standards to use, what additional diacritics, and anything else. We should probably make something like this for languages on Wikidata as well. For example, for the Middle Dutch lexemes we have so far, I followed the spelling normalization standards of English Wiktionary, but it's quite possible that someone finds a different spelling in their dictionary (because Middle Dutch spelling is variable) and decides to change the spelling of some lemmas. For example they might change seggen (L22722) to say secgen. This leads to inconsistency and the potential for edit warring, so it would be nice if the preferred/desired shape a lexeme should have is documented somewhere. —Rua (mew) 09:37, 20 September 2018 (UTC)

@Rua: See Wikidata:Lexicographical data/Documentation/Languages KaMan (talk) 09:43, 20 September 2018 (UTC)
Ah, I had missed that. Thank you! —Rua (mew) 09:44, 20 September 2018 (UTC)

I created Wikidata:Lexicographical data/Documentation/Languages/dum. If anyone has feedback, can you leave it on the talk page? Thanks in advance! —Rua (mew) 10:59, 20 September 2018 (UTC)

Can't add a lemma with two items in the code[edit]

I'm trying to create lemmas for the suffix -mus/-eamos (L22801), but I'm not able to enter all the lemmas I want. There are two different forms of the lemma, one when the suffix is attached to a word with even inflection (Q56633409) and another when attached to a word with odd inflection (Q56633433). However, the second variant also has a variant -eamọs with Pronunciation respelling (Q7249970), and I have entered a pronunciation respelling in guossi (L22017) with this item in the lemma code. So I thought I would enter the variant for -mus/-eamos (L22801) by combining the items, as se-x-Q56633433-Q7249970. But this is rejected, so I can't enter it. What should I do? —Rua (mew) 14:07, 20 September 2018 (UTC)

  • Interesting experiment, but I think you can only add one at the time. Maybe https://www.wikidata.org/w/index.php?oldid=749416521 --- Jura 14:24, 20 September 2018 (UTC)
    • I thought of this, but there are pairs of lemmas that are written the same, while being distinguishable by the pronunciation respelling. Consider the triplet bassi (L22826) ("holy", adjective), bassi (L22827) ("holiday", noun), bassi (L22828) ("cleaner, janitor", noun). The difference in the length of the consonants, indicated by the pronunciation respelling, more clearly distinguishes these. Without it, the two nouns would both just say "bassi, Northern Sami noun", and then which do you choose? I already have this problem with identical-looking suffixes like -dit (L22794) versus -dit (L22795). So if the pronunciation respellings are going to be added only on forms, then I hope that there will be a better way to distinguish lemmas. —Rua (mew) 14:40, 20 September 2018 (UTC)
@Rua: Indeed, I just tried it on sandbox (L123) and cannot add it either. Sorry for my earlier comment, I supposed it's possible. I don't know what to do in your case :( KaMan (talk) 14:51, 20 September 2018 (UTC)
  • Lemma and lexical category aren't meant to fully disambiguate. Only checking senses can sort this out. I should probably add "pronunciation" to the table above. --- Jura 14:54, 20 September 2018 (UTC)
  • BTW, if you want to reference the exact form using the suffix, you need to make two separate forms (F1 and F2). BTW, maybe this should be two lexemes anyways. --- Jura 15:22, 20 September 2018 (UTC)
    • They're the same thing, with the same etymology and meaning. All that differs is which one you use, and that depends on the word you attach it to. They are allomorph (Q1124301). —Rua (mew) 17:42, 20 September 2018 (UTC)
  • If the code you are trying to build matches one that is an IETF language tag, you could just make an item for that and add it. --- Jura 12:57, 23 September 2018 (UTC)

Bug: can't add lexeme into etymology[edit]

I'm trying to add combines (P5238) to čála (L23029) with the values čállit (L23028) and -a/-at (L22784). But after selecting the property, if I fill in čállit into the value box, it finds no results. If I fill in the lexeme ID manually, it still says "no match found", even though a lexeme with that ID clearly exists. Because no match is found, I'm unable to publish the property. This seems like a bug? —Rua (mew) 19:35, 20 September 2018 (UTC)

I'm getting it with suola (L23031) too now. —Rua (mew) 19:38, 20 September 2018 (UTC)

I managed to edit čála (L23029), but suollit (L23030) still can't be found. —Rua (mew) 20:46, 20 September 2018 (UTC)

Sometimes the auto-complete search/lookup indexing is delayed a bit, that might explain it if you were trying to do this within a minute or two of creating the lexeme you wanted to use as the value? ArthurPSmith (talk) 20:48, 20 September 2018 (UTC)
It's been over an hour already. —Rua (mew) 20:58, 20 September 2018 (UTC)
Ok, now suollit (L23030) works too. Still, why won't it even allow me to fill in the ID? It obviously exists, it doesn't even have to do any searching, yet it tells me it doesn't exist? —Rua (mew) 21:00, 20 September 2018 (UTC)

Requesting permission for bot import: Proto-Uralic and Proto-Samic lexemes[edit]

Now that Álgu ID (P5903) and Uralonet ID (P5902) have been created, I would like to do an import of all lexemes on English Wiktionary for these two languages. Only lemmas would be imported, and the Álgu and Uralonet IDs if they are present in the Wiktionary entry. If a Proto-Samic lexeme derives from a Proto-Uralic one and is marked as such in Wiktionary, the bot will add a property for that. I have User:MewBot, a bot account that I have used a lot on Wiktionary, but it doesn't have a bot flag yet. —Rua (mew) 12:28, 21 September 2018 (UTC)

How many it will be? We now have 8063 of English lexemes. English and French lexemes are added by hand but in large packs. All in all how can we test if we are ready for import if we do not do any import finally. @Lea Lacroix (WMDE), Lydia Pintscher (WMDE): ? KaMan (talk) 12:53, 21 September 2018 (UTC)
The categories are wikt:Category:Proto-Uralic lemmas and wikt:Category:Proto-Samic lemmas. —Rua (mew) 13:24, 21 September 2018 (UTC)
With Proto-Uralic there is a bit of an issue in that it really contains several languages, Proto-Uralic (Q288765) proper, Proto-Finno-Ugric (Q2499870) and Proto-Finno-Permic (which doesn't even have an item on Wikidata). Linguists assign forms to one of these based on which branches it can be found in, but they are largely treated as one language otherwise. The reason is that these three "languages" are pretty much the same phonologically and very few changes from one to the next can be reconstructed, so the main differences are in reconstructable vocabulary. Uralonet only has entries for the "oldest" reconstruction that can be made, so if there is Proto-Uralic, there will be no Proto-Finno-Ugric or Proto-Finno-Permic. English Wiktionary has completely merged these languages, treating them all as one Proto-Uralic language and treating Finno-Ugric and Finno-Permic as dialects of it. If entries are imported, then this would be the case for Wikidata's new lexemes too. —Rua (mew) 13:33, 21 September 2018 (UTC)

I have gotten the Proto-Uralic/Finno-Ugric part to work, it's creating entries with the right part of speech and adding the Uralonet ID if there is one. I only did a few edits to make sure it all works, using my own account. —Rua (mew) 18:44, 21 September 2018 (UTC)

Creating Proto-Samic entries now works too. The Álgu ID is added, and a property for the derivation is added if a parent lexeme exists. I think it's all ready to run? —Rua (mew) 19:40, 21 September 2018 (UTC)

We were asked by Lea in Wikidata talk:Lexicographical data/Archive/2018/05#About_mass_imports_and_tools not to do any mass-imports yet and she reiterated that a couple of weeks ago in #Senses_will_be_released_on_October_18th, so you should wait until she says it's ok before actually running it. In the meantime, make sure you've read our page about bots at Wikidata:Bots. You could already add {{bot|Rua}} to your bot's user page, redirect its talk page to yours (unless you want to keep them separate) and start the request for permissions. - Nikki (talk) 20:38, 21 September 2018 (UTC)

Ok, I added a request Wikidata:Requests for permissions/Bot/MewBot and I'll wait for a response from the project leaders. —Rua (mew) 09:39, 22 September 2018 (UTC)
Hey :) It's fine from our side now as a trial if: 1) it's not going to make Wiktionary people hate us 2) there are no legal issues 3) you believe you are ready to handle it socially 4) you're aware that APIs etc might still change so tools built on top of it like bots might need to be adjusted. --Lydia Pintscher (WMDE) (talk) 13:08, 22 September 2018 (UTC)
Wiktionary people already hate Wikidata, that's kind of why I'm here. But I did a lot of work creating these entries on Wiktionary and I'm not about to let it go to waste. As far as legal issues go, mere lemmas, forms, etymological relationships and links to external databases are not copyrightable information, they are simple facts about the language. Importing senses and other prose text would be another matter, but I'm not doing that. —Rua (mew) 13:12, 22 September 2018 (UTC)
Oh, and regarding tools, I submitted this to the pywikibot team so that it can be used easily by anyone using pywikibot in the future. If anything changes, me or someone else can submit the appropriate patch to pywikibot to bring it up to date. My own experiences with Gerrit are rather sour though, it's totally unintuitive and I have no clue how to even post messages like on GitHub, or contact the project owners in any way, so I'd prefer it if someone else did in the future. —Rua (mew) 13:20, 22 September 2018 (UTC)

Examples with phrases: Do we have any?[edit]

I suppose we can have phrases in Lexeme-Wikidata, such as "like a bull in a china shop" [3]. However, I do not see any statement about them. I found multiword entries planetary nebula (L2883), nuclear magnetic resonance spectroscopy (L10213) and magnetic resonance microscopy (L10216) with no statements. I do not recall any properties targeted for phrases. — Finn Årup Nielsen (fnielsen) (talk) 13:06, 21 September 2018 (UTC)

@Fnielsen: Like w aksamitnych rękawiczkach (L17516) or Układ Słoneczny (L2954)? KaMan (talk) 14:49, 21 September 2018 (UTC)
@KaMan: Thanks! I am wondering about the use of combines (P5238). I thought it was only for compound words, — not phrases. — Finn Årup Nielsen (fnielsen) (talk) 15:21, 21 September 2018 (UTC)
@Fnielsen: We can introduce new property in the future of course but in the meantime to not lost information better use combines (P5238) than nothing I think. OTOH in my language label of combines (P5238) fits use for phrasemes and I like to keep number of etymological properties minimal so it stays simple in use. KaMan (talk) 15:29, 21 September 2018 (UTC)

Hypocorism[edit]

Hi. Do we have any property that can be used for hypocorism (Q1130279)? And if not I am not sure if it should be given name → hypocorism or hypocorism → given name property. --Lexicolover (talk) 16:01, 21 September 2018 (UTC)

Hmmm, perhaps can use derived from (P5191) with mode of derivation (P5886) set to hypocorism (Q1130279). KaMan (talk) 16:35, 21 September 2018 (UTC)
Well, in my understanding hypocorism is not a process but rather the result that can be achieved in different ways. --Lexicolover (talk) 19:29, 21 September 2018 (UTC)
@Lexicolover: Hmm, then I would use derived from (P5191) or combines (P5238) set to these "different ways" and instance of (P31) set to hypocorism (Q1130279). But if you feel unconfortable with it then there can be always new property. KaMan (talk) 04:36, 22 September 2018 (UTC)

Idea: when creating a lexeme, show if matching lexemes already exist[edit]

When you create a lexeme it is often useful to know if someone else has already created it. This avoids creating duplicates that have to be merged again later. To alleviate this, I think it would be useful if the Special:NewLexeme page would automatically try to find existing lemmas that match what you typed, and show them to you in some way, with their language and lexical category. That way you can judge for yourself if these lexemes match what you are about to create, or are different. Perhaps the search could be further limited by language and lexical category, once you selected them. —Rua (mew) 15:24, 22 September 2018 (UTC)

I agree, such feature on Special:NewLexeme would be great. It is already build-in in Wikidata Lexeme Forms and it is very helpful. There is need to check checkbox to indicate it's not duplicate. KaMan (talk) 15:42, 22 September 2018 (UTC)
  1. prefill F1 with lemma
  2. allow fill by url
  3. allow direct entry for "mis"-languages
  4. allow entry of gloss for S1
  • There are couple of fixes still needed (and hopefully in the pipeline) for NewLexeme. I wonder where they are at? --- Jura 12:55, 23 September 2018 (UTC)

Why not use Wiktionary or other online dictionaries?[edit]

With lexemes Wikidata sort of reinvents the wheel. Except many Wiktionaries there is OmegaWiki as well. They do more or less the same thing. Very likely there are other sites on the internet performing a similar task. I believe cooperation will be profitable for all of us.  Klaas `Z4␟` V:  10:21, 23 September 2018 (UTC)

We know this. I want to cooperate with Wiktionaries and other databases if possible. There is already series of properties which links us to external dictionaries. I myself get support from Polish Wiktionary community. Or have you @KlaasZ4usV: something specific in mind? KaMan (talk) 10:41, 23 September 2018 (UTC)

To be honest just want to start a general discussion about using lexemes here on Wikidata, KaMan. Thanks for your contributions! Klaas `Z4␟` V:  10:54, 23 September 2018 (UTC)

@KlaasZ4usV: Well previously en.wikipedia had date of birth some man in infobox, and all other wikipedias had no date of birth. When Wikidata appeared and imported this from en.wikipedia than it could be shared and presented in all wikipedias. With lexemes it is exactly the same way in relation to Wiktionaries. Once inflexction or etymology is imported from one Wiktionary then all language version Wiktionaries can reuse this data too. Wikidata is Commons to data. Lexemes in Wikidata is Commons to lexicographical data. And because it's database with own query service there can be tools and queries writen which was impossible of hard with textual data of Wiktionaries. There is already one tool which shows power of lexemes hardcoded as database. KaMan (talk) 11:15, 23 September 2018 (UTC)

OmegaWiki once started at a Wikimania conference as a wiki for all wiktionaries together. Now they support privately hosted more than 600 languages in a database like the one on which Wikidata is built. We use the latter and Commons as well. See also their @OmegaWiki twitter-account. Klaas `Z4␟` V:  12:49, 23 September 2018 (UTC)

Looking at http://www.omegawiki.org/Special:RecentChanges I don't see much movement there. Perhaps strong connection with rest of Wikidata (and its large community) decided. KaMan (talk) 12:59, 23 September 2018 (UTC)

Too few contributors. Most of potential new users stay with Wiktionaries and other sites of WMF. Most of OmegaWiki-users would like to join WMF again. Why they left is for me a mistery. Klaas `Z4␟` V:  13:27, 23 September 2018 (UTC)

@KlaasZ4usV: And this is where we differ to omegawiki probably. Wikidata maintainers don't want to just take Wiktionary users. As can be read in roadmap developers want to enable "Editing data directly from Wiktionary". KaMan (talk) 13:52, 23 September 2018 (UTC)

Language specific data dumps?[edit]

Would it be possible to create simple daily dump files per language (QID)? A tsv could include:

  1. lemma, language-QID, lid, langcode
  2. form, l-fid, lid, langcode
  3. l-sid, langcode, gloss (once we get senses)

This at least for languages with more than a given number of entities (e.g. 1000). Could make it easier to select words to add in English. --- Jura 10:51, 23 September 2018 (UTC)

tsv? Isn't it better an archive like .zip, .gz etc? They very likely exist already. Only thing to do is to make them available for download. Klaas `Z4␟` V:  13:07, 23 September 2018 (UTC)
  • tsv.zip if you prefer ;) --- Jura 13:10, 23 September 2018 (UTC)
LOL. Data-dumps are made for backup/recovery purpose in mostly a standardized format depending on the RDBMS. Ask any developer. Klaas `Z4␟` V:  13:22, 23 September 2018 (UTC)
Well, currently there is none and the full dump is likely unpractical for me. --- Jura 13:32, 23 September 2018 (UTC)

There is ticket about dumps, see Wikidata:Database_download#Lexicographical_data @Jura1: perhaps you should describe your needs in the ticket on phabricator. KaMan (talk) 13:28, 23 September 2018 (UTC)

  • That seems to be a technical thread. I think most users could live with the three above. --- Jura 13:32, 23 September 2018 (UTC)


condamné (L24601)[edit]

Just wondering who added that .. ;) Interestingly, it was just four months ago. --- Jura 13:59, 23 September 2018 (UTC)