Wikidata talk:Lexicographical data/Archive/2018/08

From Wikidata
Jump to navigation Jump to search

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

New tool: Wikidata Lexeme Forms

I created another tool for lexicographical data: Wikidata Lexeme Forms. It shows you a form to create a new lexeme with a standard set of forms, e. g. the declensions of a German or Latin noun. It should be possible to support most languages and lexeme types – if the data model for your language is already clear, please let me know! (English verbs don’t seem to be ready yet, as discussed above, so I didn’t add those yet. For English nouns, the tool probably doesn’t make sense anyways, since English lost case forms in nouns.) --Lucas Werkmeister (talk) 11:38, 13 June 2018 (UTC)

Yes, it is the way it's should be done. But what to do with ~hundred Russian declinations? And I thought that German nouns has much more paradigms... --Infovarius (talk) 13:34, 13 June 2018 (UTC)
@Infovarius: well, different declinations don’t really matter to the tool, I think, since it asks you to enter all the different forms manually. The tool only needs enough information to form a grammatically correct sentence around the form, that’s why it’s split up by grammatical gender in German and Latin. (Other languages might need other criteria here.) Does that help with the situation in Russian as well? --Lucas Werkmeister (talk) 14:41, 13 June 2018 (UTC)
  • An interesting start. Looks like we lack a away to state the applicable declension in Wikidata and store (or retrieve) the relevant parts.
    --- Jura 13:46, 13 June 2018 (UTC)

@Lucas Werkmeister: I've tried it to create Lexeme:L2879. It works quite well. Just one question: why having 3 entries masculinum, femininum, neutrum. One entry and then a selector for the gender would be better to use, no? PS: among the Latin cases, the vocative case (Q185077) and locative case (Q202142) are missing (not the most used Latin cases but it would be nice to have them). Cdlt, VIGNERON (talk) 15:11, 13 June 2018 (UTC)

@VIGNERON: regarding three entries: to some extent, development so far has been governed by what is better to implement, not to use ;) but I’m not sure what this selector would look like. A dropdown selector on the page where you enter the forms directly would require a lot of work, I think (to dynamically update the page each time – currently, the tool doesn’t use JavaScript at all), and I’m not sure if it would make much sense: if you started entering some forms and then switch the gender, what should happen? But I could probably group the entries on the index page (one entry for “German/Latin noun” and then three sub-entries for the genders) – would that be an improvement?
regarding cases: hm, yeah, I’m not sure about that… should all lexemes have forms for those cases? If I understand correctly, vocative and locative don’t really apply to most nouns. --Lucas Werkmeister (talk) 15:56, 13 June 2018 (UTC)
@Lucas Werkmeister: Ok, I understand. An other and simpler solution is indeed to group, for example you could put the entry like this:
  • deutsches Substantiv [without link]
    • Maskulinum
    • Femininum
    • Neutrum
Locative is indeed rare and it could be skipped but (as far as I know) the vocative always exists (very often the same as the nominative but not always; identical or not, it should the vocative form should be stored in the lexeme).
Cdlt, VIGNERON (talk) 18:42, 13 June 2018 (UTC)
@VIGNERON: Okay, vocative added. (You can use the new “advanced” mode, where forms may be left empty, for words that really don’t have a vocative form.) --Lucas Werkmeister (talk) 23:00, 13 June 2018 (UTC)
@VIGNERON: I’ve now grouped the entries on the index page by their language codes, which is the simplest solution for now. Perhaps I’ll improve it later. --Lucas Werkmeister (talk) 13:41, 17 June 2018 (UTC)
@Lucas Werkmeister: Singular and Plural are capitalized in German. -- IvanP (talk) 21:02, 13 June 2018 (UTC)
@IvanP: WTF, indeed… I thought it must be used as an adjective in „Nominativ [Ss]ingular“ etc., because otherwise I don’t understand how that construction works (just two nouns next to each other?). Thanks! --Lucas Werkmeister (talk) 23:04, 13 June 2018 (UTC)

It would be nice to enhance the form, so that alternative forms (for a German noun typically genitiv sg. masculinum & neutrum with endings "-s" or "-es") can be entered simultaneously. Surely it can be added behindhand, though I don't expect too many tool-using additors to do it, considering the tool is intended (and expected) to make the job easier, not harder ;) --Shlomo (talk) 06:07, 14 June 2018 (UTC)

@Shlomo: Hm, I’m not sure what a good user interface for this would look like, to be honest :/ --Lucas Werkmeister (talk) 15:34, 14 June 2018 (UTC)
@Lucas Werkmeister: As for German, I suppose it would be enough to add another input field for Genitiv Singular and for Dativ Singular; something like:
Genitiv Singular
 Das Eigentum des [Hundes_____].
 Das Eigentum des [Hunds______].
Dativ Singular
 Das gehört dem   [Hund_______].
 Das gehört dem   [Hunde______].
In the long run, it would surely be better to find a more general solutuion. There are languages with more than two variant forms, languages where variant forms can appear in different cases, languages where variants are not 100 % equivalent, so that the hint sentence has to vary as well.--Shlomo (talk) 07:48, 18 June 2018 (UTC)
@Shlomo: I finally found a good user interface for this :) it’s really simple: you enter „Hund/Hunde“ in the input fields, and two forms will be created. (I’ve updated the placeholders to indicate this where it makes sense.) --Lucas Werkmeister (talk) 20:41, 21 July 2018 (UTC)
@Lucas Werkmeister: I've tested this new feature on aronia (L7313) and it works great, it made work very easy, thank You. KaMan (talk) 09:03, 22 July 2018 (UTC)

Update: the tool now supports an “advanced” mode (click the corresponding button next to the “submit” one) where you can leave out some forms (e. g. for words that only have singular forms) and also specify a lexeme ID so that the forms are added to that lexeme instead of a new lexeme being created. --Lucas Werkmeister (talk)

@Lucas Werkmeister: I suggest an option to create a Lexeme without stating a grammatical gender for pluralia tantum such as Großeltern (cf. Genus von Pluraliatantum). -- IvanP (talk) 18:50, 15 June 2018 (UTC)
@IvanP: good point, does the proposal at User:Lucas Werkmeister/Wikidata Lexeme Forms/German#New template: deutsches Substantiv (Pluraletantum, kein Genus) look okay to you? --Lucas Werkmeister (talk) 12:24, 17 June 2018 (UTC)
@Lucas Werkmeister: Yes. -- IvanP (talk) 12:30, 17 June 2018 (UTC)
@IvanP: ✓ Done, thanks for the suggestion! --Lucas Werkmeister (talk) 13:02, 17 June 2018 (UTC)

I’ve started sketching out a process to add more templates to the tool. The general page about the tool (for now) is at User:Lucas Werkmeister/Wikidata Lexeme Forms, and to add support for your language, you take everything in the “adding support for a new language” section and copy it into a new subpage of the page, and then you replace all the explanations and examples (the “definition” parts of the definition lists – : in Wikitext) with the appropriate values you fill in the input box there and follow the instructions. (And at some point in this process please ping me as well, of course!) See /English and /German for two examples (discussed below and above this message, respectively). --Lucas Werkmeister (talk) 12:24, 17 June 2018 (UTC)

Contrary to what I said above, I think it would make sense to add support for some English templates after all, so that people can at least look at the tool in a language they’re likely to understand :) I’ve sketched out a very basic model for English nouns on User:Lucas Werkmeister/Wikidata Lexeme Forms/English#New template: nouns, based on some noun lexemes I’ve seen (e. g. lemon (L1921)) – does that look okay to you? --Lucas Werkmeister (talk) 12:24, 17 June 2018 (UTC)

@Lucas Werkmeister: I've finished /Polish with first basic Polish template. Please let me know if there is anything to fix. Before next Polish template it would be valuable to add possibility to add claims to forms which is not possible now with current template. Only claims of lexems are allowed. KaMan (talk) 08:22, 20 June 2018 (UTC)
@KaMan: Thank you very much! Adding claims to forms makes a lot of sense, I’ve added that now – do you want to edit the Polish template or is it only necessary for future templates? --Lucas Werkmeister (talk) 13:18, 21 June 2018 (UTC)
@Lucas Werkmeister: for future templates only KaMan (talk) 13:25, 21 June 2018 (UTC)
@KaMan: alright, thanks! The template is live at toolforge:lexeme-forms/template/polish-noun/, can you try it out? --Lucas Werkmeister (talk) 13:48, 21 June 2018 (UTC)
@Lucas Werkmeister: I've tried it out and it works great both in easy and advanced mode. Thank You. I'll let you know when I prepare next Polish templates. KaMan (talk) 07:48, 22 June 2018 (UTC)
@KaMan: great, thank you! --Lucas Werkmeister (talk) 12:32, 22 June 2018 (UTC)
Update: I’ve added the template for English nouns, without possessive case for now (but do let me know if you have opinions on that). --Lucas Werkmeister (talk) 16:24, 12 July 2018 (UTC)

@Lucas Werkmeister:, I've created next Polish template. Please let me know whether it's fine with claims for forms. KaMan (talk) 14:33, 24 June 2018 (UTC)

@KaMan: thanks! I’d like to understand what this means, though: the first “noun” template can be used for most nouns (including non-personal masculine ones), is that correct? Should its label perhaps be updated to clarify, or will someone who speaks Polish understand it anyways (e. g. because personal nouns are fairly rare)? --Lucas Werkmeister (talk) 21:41, 24 June 2018 (UTC)
@Lucas Werkmeister:, You're right. I've clarified label for the first template that it's basic and simplest declension. KaMan (talk) 07:21, 25 June 2018 (UTC)
@KaMan: Thanks! Sorry that I’m so late to reply, and unfortunately I’m also going offline for a few days now – but I’ll update the tool next Tuesday or Wednesday. --Lucas Werkmeister (talk) 20:53, 29 June 2018 (UTC)
@KaMan: sorry, I need one more thing – the identifier for the template (the part that’s used in the URL). For the first one I guessed polish-noun, but now I’m not sure what the identifier for the second template should be, so I think it would be better if you provided it… (I’ve updated the first template on /Polish already.) --Lucas Werkmeister (talk) 21:04, 4 July 2018 (UTC)
@Lucas Werkmeister: identifier added. KaMan (talk) 07:24, 5 July 2018 (UTC)
@KaMan: thanks, template is live now. --Lucas Werkmeister (talk) 14:31, 8 July 2018 (UTC)

@Lucas Werkmeister: I have on Special:Preferences#mw-prefsection-watchlist enabled adding pages I create to my Special:Watchlist. But pages I created with your tool are not added to watched pages. Could it be fixed or this is effect of using external tool. KaMan (talk) 08:38, 4 July 2018 (UTC)

@KaMan: hm, that’s odd – I tried creating a lexeme via the wbeditentity API on Special:ApiSandbox and it ended up on my watchlist, but when created via the tool (which uses the same API) it didn’t. Not sure what could be the reason… perhaps OAuth? I’ll look into this, thanks for reporting! --Lucas Werkmeister (talk) 21:04, 4 July 2018 (UTC)

@Lucas Werkmeister: something strange happened with my new Polish template. On template page Form 2 and Form 10 both have genitive case (Q146233) in grammatical features but when noun is created there is DR Congo at the 2004 Summer Olympics (Q146223) instead. I fixed it by hand. KaMan (talk) 09:36, 9 July 2018 (UTC)

@KaMan: sorry, I typo’ed the item ID :( should be fixed now. --Lucas Werkmeister (talk) 14:18, 9 July 2018 (UTC)

@Lucas Werkmeister: I created third Polish template. Can You add it in your spare time? Thanks in advance. KaMan (talk) 12:26, 12 July 2018 (UTC)

@KaMan: done, thank you! Hopefully I didn’t typo any item IDs this time :) --Lucas Werkmeister (talk) 16:14, 12 July 2018 (UTC)
@Lucas Werkmeister: unfortunatelly there is six typos this time. Here is the change I had to fix it to be in line with template: https://www.wikidata.org/w/index.php?title=Lexeme%3AL6494&type=revision&diff=710070701&oldid=710069560 KaMan (talk) 09:17, 13 July 2018 (UTC)
@KaMan: oof, that sucks, I’m sorry. Better now? --Lucas Werkmeister (talk) 11:41, 14 July 2018 (UTC)
@Lucas Werkmeister: Nope, I had to apply again six fixes: https://www.wikidata.org/w/index.php?title=Lexeme%3AL6597&type=revision&diff=710701749&oldid=710699998 In forms F9..F14 there should be potential form (Q54944750) in claims like in form F8. KaMan (talk) 12:53, 14 July 2018 (UTC)
@KaMan: grr, I forgot to actually apply the fix before restarting the service. Try again please? --Lucas Werkmeister (talk) 13:39, 14 July 2018 (UTC)
@Lucas Werkmeister: Works fine now, thank You. KaMan (talk) 14:42, 14 July 2018 (UTC)

@Lucas Werkmeister: Hi, I created User:Lucas Werkmeister/Wikidata Lexeme Forms/Finnish. It includes a template for Finnish nouns. The template still missing comitative case (Q838581), because I don't yet know how to model it. However, due to the amount of grammatical cases in Finnish, it would be helpful if the ones listed in the template could be added to the tool. Shinnin (talk) 11:28, 16 July 2018 (UTC)

@Shinnin: thank you, the template should be live now. Please try it out and check that I didn’t typo any of the item IDs! --Lucas Werkmeister (talk) 19:42, 16 July 2018 (UTC)
@Lucas Werkmeister: the item IDs seem to be correct. Thanks for adding them. On another topic, when editing an existing lexeme kirjasto (L6795) in the advanced mode, I get the warning message stating that L6795 has the same language and lemma as the one I'm 'trying to create'. This happens when I try to add 'kirjasto' as a nominative singular form to L6795. I'm editing a lexeme, not creating a new one, so it's confusing to get the error message in this situation. Shinnin (talk) 10:58, 17 July 2018 (UTC)
@Shinnin: thanks, that’s a great suggestion! Implemented. --Lucas Werkmeister (talk) 20:29, 17 July 2018 (UTC)

@Lucas Werkmeister: Three Norwegian Nynorsk templates should now be ready for implementation. --Njardarlogar (talk) 07:40, 25 July 2018 (UTC)

@Njardarlogar: thanks, they’re live now! --Lucas Werkmeister (talk) 20:06, 25 July 2018 (UTC)
All seem to be working as intended. Thumbs up. --Njardarlogar (talk) 08:58, 26 July 2018 (UTC)

@Lucas Werkmeister: I have added by hand the 46 Basque cases to Lexeme:L8720, and then read about this tool. Could it be used for Basque language? -Theklan (talk) 08:20, 1 August 2018 (UTC)

@Theklan: sure, I don’t see why not. Instructions for adding language support are over on User:Lucas Werkmeister/Wikidata Lexeme Forms. --Lucas Werkmeister (talk) 09:59, 1 August 2018 (UTC)

New features deployed in July

Hello all,

The development team is still working actively to improve the interface of Lexemes, and get the Senses ready. Some tasks are in the background, some other provide new features that you can try. Here are the main things that are now live on Wikidata:

  • Language, Lexical Category, and Grammatical features of a Lexeme now show up in Special:WhatLinksHere. Here's the example with verb. While waiting for better search features and integration in the Query Service, this can already help you finding more easily the Lexemes that you're looking for. (ticket)
  • "L" is now an alias for the "Lexeme" namespace, which means that links like L:L123 work. It can also be used in templates. (ticket)
  • Adding two lemmas with the same spelling variant is now blocked when editing, an error message will appear: "It is not possible to enter multiple lemma with the same spelling variant." This is because adding several lemmas in a Lexeme should be used to display different scripts for example. If you find an example that would not suit this rule, please share it here so we can discuss about possible solutions. (ticket)

If you find any bug or issue, feel free to add a comment on the related ticket, or write on this talk page.

If you're curious about knowing what the developers are currently working on, you can easily find out by following the "development" section of the Weekly Summary :)

Cheers, Lea Lacroix (WMDE) (talk) 12:10, 31 July 2018 (UTC)

@Lea Lacroix (WMDE): Thanks for the update! On the first item - you mean when an item is used as a language, lexical category or grammatical feature it now shows up in WhatLinksHere? Because at first I thought you meant the lexemes in WhatLinksHere would show their language, lexical category etc. but they just display like normal. Anyway, yes, this looks very helpful, thanks! ArthurPSmith (talk) 16:59, 31 July 2018 (UTC)
Yes, thanks for rephrasing. Indeed, for now the Lexemes appear in Special:WhatLinksHere with only the first Lemma and the L-number. If you wish for another way to display them, let me know. Lea Lacroix (WMDE) (talk) 07:47, 1 August 2018 (UTC)
It would be nice if Language, Lexical Category, and Grammatical features of a Lexeme would show up in Special:WhatLinksHere.--Micru (talk) 08:14, 1 August 2018 (UTC)

Missing languages

North Levantine Arabic (Q22809485) (apc) - includes Lebanese Arabic (Q1516642) - is missing.

I've tried to add:

--Kolja21 (talk) 11:42, 2 August 2018 (UTC)

Wikidata:Lexicographical data/Glossary

I have created this page. I don't know whether it should be incorporated to the main Wikidata:Glossary.--GZWDer (talk) 20:00, 2 August 2018 (UTC)

Hello GZWDer, it's a good thing that each terms related to Lexem definition is clearly defined, but shouldn't it be included with Wikidata:Lexicographical data/Documentation#Data Model, in order to have the right information at one place? Djiboun (talk) 09:04, 3 August 2018 (UTC)

Lexems with numerous Forms

In some languages, some lexems have a high number of forms, like:

and probably also in other languages.

So the tool Wikidata Lexeme Forms created by @Lucas Werkmeister: even if it's a very useful tool for creating lexems with a low number of forms, is not appropriate for such lexems with numerous forms.

Here are some ideas that could simplify our work (they perhaps already exist or are being developped):

  • Develop a new tool for creating such lexems and all their forms, for example, as suggested by @Theklan: by a prefilling based on some parameters and rules (some Wiktionaries have already set these rules like Structure des verbes en français for a verb (Q24905) in French (Q150));
  • Find a way to change several forms at the same time. I don't have a precise case in mind, but it could be to change all the forms which are singular (Q110786);
  • Create, or improve the Wikidata:Lexicographical data/Statistics created by @GZWDer: a statistical tool to get the number of lexems by languages but also the number of forms (average, min, max) and when it will be ready, the number of senses.

I hope this will help. Djiboun (talk) 10:27, 3 August 2018 (UTC)

I agree that lack of the tool to enter complex regular declinations and conjugations is noticeable. I completly stoped filling forms for Polish verbs and adjectives because of that. In Polish Wiktionary we have template for adjectives to produce more than 40 forms from four parameters. I dream about porting it to Wikidata somehow but I don't know how. KaMan (talk) 11:05, 3 August 2018 (UTC)
Symbol support vote.svg Support The first idea would be the best. Having something like @Lucas Werkmeister:'s Wikidata Lexeme Forms but with a prefilled option that have to be validated before saving (to check errors) would be the best. Spanish verbs have also a huge variation, and filling them by hand would be a nightmare, specially thinking on most of them being regular verbs. -Theklan (talk) 11:10, 3 August 2018 (UTC)
Symbol support vote.svg Support. Lucas is great already but a level 2 with "prefilled suggested forms" would be an even greater improvement! (in the same tool ir in an other one, it doesn't really matter). Having a human confirming that the automatically suggested forms are indeed correct is an important point I think. For changing several data at the same time, it is something that will be possible (in a near future I hope) with tools like QuickStatements. And yes, we need more statistics. For me an other big question is: where to store the rules? (there is no clear place right now... and I fear that some people will reinvent the wheel again and again :/ ). Cheers, VIGNERON (talk) 13:56, 3 August 2018 (UTC)
My idea is:
  • Create a number of modules that outputs lexeme entity objects in JSON. e.g. {{en-noun|utility}} will generate a lexeme object about the English noun "utility" with form "utility" and "utilities".
  • Create a script to create an entity (including lexeme, but also useful for item and property) base on the output of some wikitext (above).

--GZWDer (talk) 15:35, 3 August 2018 (UTC)

My first attempt

I created Lexeme:L9696 for a Belarusian adjective with 27 forms (венецыянскі). Now what? Is there any link from that object to the existing pages in Wiktionary? In Russian/Belarusian/Ukrainian, there is a tradition to print accents in dictionaries (венецыя́нскі). Should this be used here? I see from вода (Lexeme:L189) that statements have been added that link to audio recordings of each form, but accents are not used. --LA2 (talk) 00:41, 4 August 2018 (UTC)

@LA2: I suppose we should have some property like "word with stress" or "positionn of stress", but we don't have any yet. --Infovarius (talk) 19:52, 4 August 2018 (UTC)
In these languages, all words and forms (with more than one syllable) have stress, and the stress can be on different syllables in different forms of the same word (пишу́, I write, пи́шешь, you write).
I assume now that we have some words in Wikidata, this should be used to generate entries in Wiktionary, just like Wikipedia's fact boxes can draw some data from Wikidata, right? How far away is that? Can Swedish Wiktionary create entries for Norwegian words by magically drawing data from Wikidata? Or will this be possible a year from now? --LA2 (talk) 20:23, 4 August 2018 (UTC)

Wikidata:Lexicographical data/SPARQL

I inspect every lexeme created so have some observations about common behaviors of editors. I have created this page to collect propositions of queries once SPARQL support will be introduced. Feel free to add own propositions. KaMan (talk) 07:22, 4 August 2018 (UTC)

Excellente idea! Some of these propositions are big (1 and 2 especially) and will probably time out. We probably need an other tool to fix them. Cdlt, VIGNERON (talk) 10:20, 4 August 2018 (UTC)
Should maybe be merged with Wikidata:Lexicographical data/Ideas of queries? --Lydia Pintscher (WMDE) (talk) 16:26, 4 August 2018 (UTC)
@Lydia Pintscher (WMDE): My proposition is about maintenance and repairing queries while "Ideas of queries" is rather about cool queries. But I merged it anyway to not multiply subpages. Thanks for pointing it out KaMan (talk) 16:39, 4 August 2018 (UTC)

d'ailleurs à merveille

Should we have lexemes for these? I just made the two above. If not, what alternative do you suggest?
--- Jura 17:08, 2 August 2018 (UTC)

I linked the parts with derived from (P5191). Maybe there is a better property.
--- Jura 15:22, 3 August 2018 (UTC)
@Jura1: Yes, I see no reason why we shouldn't/couldn't have these lexemes. I just replaced derived from (P5191) by combines (P5238). Cdlt, VIGNERON (talk) 09:59, 4 August 2018 (UTC)
Sounds good. I will update the others and added a sample on P5238.
--- Jura 10:34, 4 August 2018 (UTC)
  • Here is the start of list of types: Wikidata:Lists/locutions/types. To avoid that it gets mixed-up with potentially confusing terminology, I switch it to a Romance language. ‎Andreasmperu fixed some of the items yesterday.
    --- Jura 11:53, 5 August 2018 (UTC)

Read lexeme entities?

It would be good if the above worked with following:

  • {{#property:P5185|from=L9316}}
  • {{#invoke:Wikidata|formatStatementsE|item=L9316|property=p5185}}

Could-it be enabled at least at Wikidata?
--- Jura 10:43, 4 August 2018 (UTC)

If you could help flesh out phabricator:T195895 we would get closer to it, yes. For both of your examples we'd need clarification what they actually mean. What happens when there are statements on the Form level with that property? --Lydia Pintscher (WMDE) (talk) 16:29, 4 August 2018 (UTC)
  • Not sure if there are that many statements on form level yet. I think the form feature still needs some work, both from development and editors. So it might be too earlier to fully specify that.
    The main problem will be the selection of a form. It should be done bearing in mind Wiktionary use cases (whatever they may be). If one follows the "subentity" approach you mention before, one could imagine using them as such.
    For current Wikidata use, displaying statements that are directly on lexemes should be sufficient (sample question: does a lexeme have "conjugation class" set?) and not take in account statements on forms.
    --- Jura 11:29, 5 August 2018 (UTC)

Wikifying with lexemes?

Going through https://fr.wikisource.org/wiki/Lettres_(Musset)/01 , I tried to find which lexemes were still missing. A partial list is at Talk:Q55867126#Lexemes. Would be interesting to have tool that combined them, found missing ones directly.
--- Jura 13:18, 4 August 2018 (UTC)

@Jura1: good idea, can you add it to Wikidata:Lexicographical_data/Ideas_of_tools (it's different but not far from my idea written in the #Spell-checker section). Cheers, VIGNERON (talk) 11:19, 5 August 2018 (UTC)
✓ Done at Wikidata:Lexicographical_data/Ideas_of_tools#Wikify_with_lexemes.
--- Jura 12:18, 5 August 2018 (UTC)

Form lookup limitations

Other than the problem mentioned above #Filter_values_by_language?, a few minor tweaks:

  • Somehow it's sub-optimal, that the form isn't displayed when it's shorter than the lemma, e.g. type "cherche" and find chercher (L2243), but doesn't display F1, only "chercher".
  • It works when the form is different, e.g. type "chercha" and find chercher (L2243) displaying "chercher (chercha)".
  • It also displays nothing if the difference is merely in diacritical marks. "rencontré" finds rencontrer (L9147), but only displays "rencontrer" even though it's defined there.
  • "rencontrè" finds the lemma too, even though it's not defined on L9147 and shouldn't be.

That's it for now.
--- Jura 16:13, 5 August 2018 (UTC)

  • I just noticed that it happens when doing look-ups on a lexeme-datatype property, not a form-datatype one. So, maybe the above are expected.
    --- Jura 16:20, 5 August 2018 (UTC)


Special:NewLexeme and F1

I think I brought this up before, but I don't know where it's in the pipeline: when creating a new lexeme, the special page should add the lemma as form F1. Or at least suggest to do that by default.
--- Jura 17:11, 5 August 2018 (UTC)

Is Form a data type?

I consider that Form is a data type embedded in Lexeme, but seeing that the constraint of grammatical gender (P5185) gives a "Bad parameter" error, I am no longer that sure of the definition. Shouldn't form (Q54285143) be a possible allowed entity types constraint (Q52004125)?--Micru (talk) 17:13, 2 August 2018 (UTC)

  • In terms of entity types, lexemes are opposed to property, items and media entities. Forms are more like special statements.
    --- Jura 17:15, 2 August 2018 (UTC)
  • Forms (and later Senses) can be seen as sub-entities. Lexemes, Items and Properties are "proper" entities. Datatypes are the ones listed at Special:ListDatatypes. --Lydia Pintscher (WMDE) (talk) 17:28, 2 August 2018 (UTC)
    • And yes Forms should be allowed as an entity type there. Can you open a ticket on Phabricator please? Then we'll investigate why it doesn't work. --Lydia Pintscher (WMDE) (talk) 17:30, 2 August 2018 (UTC)
      • Looking into it further form (Q54285143) is the wrong item. It is the datatype of a property and not the entity type. The former is what is listed on Special:ListDatatypes. You want the item for the entity type. I can't find it right now though. --Lydia Pintscher (WMDE) (talk) 17:33, 2 August 2018 (UTC)
        • I have created a task on Phabricator. Maybe it is just the wrong item.--Micru (talk) 19:22, 2 August 2018 (UTC)
@Lydia Pintscher (WMDE): hm, the situation looks a bit messed up to me TBH. Wikibase item (Q29934200) and lexeme (Q51885771) are currently both data types and entities, according to their instance of (P31) statements, while Wikibase property (Q29934218) is just a datatype. Those are the three items we currently use, so adding form (Q54285143) and sense (Q54285715), both of which are datatypes, seems consistent to me, even though it might not make perfect sense. --Lucas Werkmeister (WMDE) (talk) 12:11, 6 August 2018 (UTC)

Order of L-template

Japanese Lexeme Lexeme:L8439, for example, has two lemmas; "一月" (ja) and "いちがつ" (ja-x-Japanese written in hiragana (Q53979341)). "一月" (ja) is more general than "いちがつ" (ja-x-Q53979341), so I added "一月" (ja) first. However, the template {{L|8439}} shows 一月/いちがつ (L8439) ("いちがつ/一月 (L8439)", ja-x-Q53979341 → ja). Is it possible to change its order to "一月/いちがつ" (ja → ja-x-Q53979341)? Thanks, --Okkn (talk) 14:00, 4 August 2018 (UTC)

@Okkn: I have the same problem with akowski/AK-owski (L3529) and akowiec/AK-owiec (L3530). I already showed this problem to @GZWDer: who made the code but without response. KaMan (talk) 14:15, 4 August 2018 (UTC)
@KaMan: We may have to fix the getLemma() function in Module:Lexeme to sort the lemmas by language code. --Okkn (talk) 14:38, 4 August 2018 (UTC)
@KaMan, GZWDer: I have fixed Module:Lexeme. Is it OK? I'm not familiar with Lua script... --Okkn (talk) 15:48, 4 August 2018 (UTC)
temporary it works for me, but it still a hack bacouse languages don't have to be in alphabetical order. KaMan (talk) 15:56, 4 August 2018 (UTC)
@KaMan: Yeah, ideally they should reflect the order in the Lexeme entity. But as a matter of practice, do you have any trouble with alphabetical order? --Okkn (talk) 16:11, 4 August 2018 (UTC)
Works for me, thanks. KaMan (talk) 16:14, 4 August 2018 (UTC)
Not at all. @Lea Lacroix (WMDE): Is it possible to get the actual order of lemmas in Lexeme entities? --Okkn (talk) 16:23, 4 August 2018 (UTC)
The order with which the editor is entering the Lemmas is not stored in the system. Therefore it's not possible to get it. I think sorting the Lemmas via the module is the best to get the order you want. Lea Lacroix (WMDE) (talk) 11:34, 6 August 2018 (UTC)

(se) figurer

For the above and similar, I think it's probably better to create two lexemes, e.g.

The second links to the first with combines (P5238).
--- Jura 11:57, 6 August 2018 (UTC)

"à" and "au"

Currently I added "au" as a form to à (L9107). Should it remain there or be on a separate lexemes? Should be there and on a separate lexeme?
--- Jura 15:30, 5 August 2018 (UTC)

Prefix/suffix

Should these be handled with combines (P5238)? Sample: excessivement (L9266).

Initially I started out with derived from (P5191), but switch it to the above.
--- Jura 04:54, 6 August 2018 (UTC)

I suppose that a prefix/suffix is not a stem, and that compounds are defined as words with multiple stems, so combines (P5238) should not be used for prefix/suffix. — Finn Årup Nielsen (fnielsen) (talk) 15:05, 6 August 2018 (UTC)

Data model for senses

I thought senses would be independent entities, like lexemes and items; but the test site and the documentation show senses being defined in the lexeme entities. In that case, how are synonyms and lexemes from different languages that can have the same meaning meant to be dealt with? --Njardarlogar (talk) 18:43, 9 August 2018 (UTC)

As I understand it, the lexemes would all have a sense which means the same, without sharing the senses. While that does lead to duplication, it actually makes things much more straightforward for editors, because you are always describing how a particular word is used in a particular language, so you can update a sense without having to make sure that it still works for all the other words the sense is linked to. - Nikki (talk) 21:09, 9 August 2018 (UTC)
What Nikki says. And it will be possible to connect a Sense to another, with a property "synonym" or "antonym". Lea Lacroix (WMDE) (talk) 06:21, 10 August 2018 (UTC)
Some questions remain about the current plan. If I want to create a list of all lexemes from all languages with the meaning 'blue', how do I do that? What connects the senses for 'blue' in language 1, language 2, language 3, and so on?
If lexeme 1 and lexeme 2 have synonymous senses, does the sense on lexeme 1 link to the sense on lexeme 2 and the sense on lexeme 2 link to the sense on lexeme 1 (i.e. bidirectional linkage)?
For sense entities, changed definitions could be handled programmatically. If a user changes the definition of a sense in a language, they could tick off 'definition changed' (as opposed to fixing a spelling error or improving grammar) and the definitions in the other languages will be marked as needing verification. This would work best if each sense had a source language whose definition the definitions in the other languages should correspond to, but operating with up-to-date and outdated languages should also do the trick (to reduce the amount of unwanted definition updates, the ability to update the definitions (but not changing spelling errors) could be restricted to a new group of users, e.g. 'sense editors'). --Njardarlogar (talk) 08:31, 10 August 2018 (UTC)
Regarding your first point, you can link all senses to the item "blue" with item for this sense (P5137), then you can infer automatically what are the synonyms and the translations. I agree that there should be a way to make sense-translations outdated, but it is perhaps too hard to implement.--Micru (talk) 09:56, 10 August 2018 (UTC)
Should every sense have a corresponding item? In other words, as Norwegian has a lexeme with the sense 'at home' (adverb), I should then create an item for the adverbial sense 'at home' and link it with item for this sense (P5137)? Are we not then effectively making items also act as sense entities, anyway? --Njardarlogar (talk) 10:51, 10 August 2018 (UTC)
It depends. In some cases it will make sense to have a "hub item", and in others it won't. --Micru (talk) 11:14, 10 August 2018 (UTC)
For "blue", you could query for item for this sense (P5137), as Micru says, but in general I would expect there to be a property linking translations. It's hard to say exactly which properties we'll have right now, because we don't have senses yet to be able to create properties between them. You can see what people have currently proposed at Wikidata:Property proposal/Lexemes.
A property like "synonym" would be used to link both ways, but it doesn't automatically create a link both ways. That's annoying for things like this, but that's how properties always work.
I don't think there's much point considering a different way to implement senses at this point because the developers have already put a lot of effort into the existing plan (the original proposals date back to 2013!) and have already started implementing it. If you're curious about a model where senses are separate entities, you might want to look at OmegaWiki (Q154436).
To respond to the points you made anyway, in case it helps clarify why the model is the way it is: Having a source language would create its own set of problems: Who decides what the source language is? What do you do when you want to add or edit a sense but don't speak the source language? More importantly, we want Wikidata to be usable for monolingual speakers trying to document words in their own language, so they must be able to add and edit senses without needing any knowledge of another language. That means the source language of the sense should always be the same as the language of the lexeme, but you can't do that if you share senses between languages. Technically you could have a model where senses can only be shared between lexemes of the same language, but then it's a question of whether the benefits of sharing senses between exact synonyms outweigh the increased complexity. As for having a group of editors who can edit senses, that sounds very restrictive. I would expect there to be a lot of senses.
- Nikki (talk) 13:04, 10 August 2018 (UTC)
I want to make clear that having a source language for sense entities was not my preferred solution, only that it would provide greater robustness to the definition of senses. With the programmatic support for marking definitions in different languages as outdated, the up-to-date definitions will collectively act as source languages for the sense.
With sense entities, monolingual users could create their own sense entities without checking whether the sense has already been created. If it is a duplicate, the sense can be merged once someone identifies it as a duplicate. If a monolingual speaker later wants to change the definition of this sense, they could attempt to reach out to other users in an appropriate avenue if they think the sense needs an update in order to be more useful, otherwise they could just create a new sense with the changed definition and use that one instead.
The planned solution also introduces similar issues of its own: if you have x monolingual speakers of x different languages who know the meaning of a lexeme, you will end up with x copies of the same sense on the lexeme. Also, if a user decides to split or otherwise make significant changes to a sense that has glosses in multiple languages, the user will be unable to update the gloss in any language they do not speak.
But the big issue is duplication of effort: the vast majority of lexemes will have neither glosses nor translations for the vast majority of languages (not good for monolingual speakers, either). You could use bots to say that if x is a translation of y and y is a translation of z, then x is also translation of z and vice versa; but this would introduce its own set of nontrivial issues.
It is late in the process, but I only now became aware of the intended data model, so it is the best I can do ... If nothing else, reviewing the model here should be enlightening for other users who are not that familiar with it. Personally, beyond linking senses to items, I don't feel very inclined to contribute to senses with the planned data model, e.g. because of the risk of accidental duplication of effort, and the duplication of effort that is actually necessary.
I also wonder if it wouldn't have been better to have sense entities even if each language has its own version of a sense. This would make it possible to set the same sense for synonymous lexemes, and also make it easier to discuss and keep track of senses since they have their own pages and are not duplicated within the same language. To make it clearer that each sense belongs just to one language, language codes could be used as part of the ID, e.g. S-xx-1, S-xx-2, and so on, for the language code xx (though just requiring that each sense, like each lexeme, has a language, would be adequate). --Njardarlogar (talk)
I think your approach has great benefits. But if I understand correctly how it should work I believe it would not work well with Wikidata flow. It could work if (1) some group of linguist set bunch of rules for creating senses (2) with the clear idea of how the data would be used in the end and (3) everyone would obey those rules. While we could fulfill the first condition, the other two would be in my opinion imposible here. There are so many slight differences languages can have between senses that might seem identical at first sight.
Let's think about your example of wanting all lexemes with the meaning "blue". We will create sense page "blue" that will be linked by all words meaning blue. Everything works great. Well but there are many shades of blue and some of them have their own name. They can link to sense page "blue", why not, it is still blue colour. But then someone comes and says "Hey, it is confusing, those two word are not synonymous, we should create special sense page for them!" ... In the end you get either something close to have special sense page for any sense of a given lexeme (where your search would not work) or system where you have link to "blue" sense page with numerous clarification statements as well (search works very well, very few editors knows how to make it and probably almost none of the data users would understand what we mean).
I think we can have both - human understandable senses and idea based searchable data - standing side by side with planned data model. Duplication is another story, I agree there will be so much duplication that will lead into uncompleteness and in some cases into errors. But what we get in the end depends on considering many diferent priorities by developers. I think duplication of information is not considered critical problem for Wikidata from the very beginning of its existance. --Lexicolover (talk) 20:32, 11 August 2018 (UTC)
I don't see the same issues with a sense entity that you do. I envision an organic process where whenever it is discovered that two languages have slightly different definitions of a similar concept, we create a new sense. If either of the senses is wholly contained by the other, we set one as a subclass (sub-sense) of the other ('light blue' is a sub-sense of 'blue', for example). If they merely overlap, we should be able to link them either directly or indirectly. When you query for the translations of 'blue', you can then set up the query such that it does or does not return lexemes with sub-senses and overlapping senses.
If we are worried that a large number of similar senses will make the senses difficult to navigate, we could modify the search function so that if I search for 'blue' in the sense namespace, all senses used on lexemes with the form 'blue' will show up. We could also use properties for example lexemes (e.g. "examples of lexemes with this sense") or labels on the senses (distinct from their definitions). --Njardarlogar (talk) 14:47, 12 August 2018 (UTC)

Forms defined on other lexeme?

How much repetition do we want?

For a Q10976085 like avoir besoin de (L9441), it could be sufficient to point to avoir (L1886) to get the conjugation.

How should this be done?

  • We could add a qualifier to whatever statement link them. Currently this is combines (P5238).
  • or create a new property.
  • conjugation class (P5186) seems suboptimal, as it isn't actually a conjugation class and doesn't link a lexeme.
    --- Jura 14:10, 11 August 2018 (UTC)
I think this question could be little generalized as how to work with phrasemes. Those will definitely have senses, but should there be forms or not? What properties to use with them? --Lexicolover (talk) 19:17, 11 August 2018 (UTC)
Not sure if it could be generalized further ;) Maybe there is a solution that will work in any case, but it might easier to try a first step and see how it goes (usual Wikidata approach?).
--- Jura 20:02, 12 August 2018 (UTC)

Ancient Greek?

I just tried to use Special:NewLexeme to create an antry for μόνος as an adjective in Ancient Greek (Q35497) but I kept getting "There are problems with some of your input." messages, so I only created μόνος (L11816), the corresponding entry for Greek (Q9129). What is the expected workflow for entering Ancient Greek lexemes, and what conditions have to be met by a language in order for lexemes in it to be creatable this way? --Daniel Mietchen (talk) 11:23, 12 August 2018 (UTC)

@Daniel Mietchen: After you select Ancient Greek, a second language field should appear before the original one. You need to pick "other (mis)" (at the very end) there. I haven't got a clue why it's done that way though. - Nikki (talk) 20:55, 12 August 2018 (UTC)
Actually it seems that you can select Ancient Greek from the second field that appears. Since the list is ordered by language code, you'll find it between Gothic and Swiss German. - Nikki (talk) 15:16, 13 August 2018 (UTC)
Thanks — that worked: μόνος (L12250). Very, very unintuitive. --Daniel Mietchen (talk) 19:44, 13 August 2018 (UTC)

Search box on list of lexemes

There is now a box on Wikidata:Lists/lexemes.
--- Jura 09:28, 13 August 2018 (UTC)

Improvements on search features

Hello all,

I'm very happy to announce that the search features for Lexemes have been widely improved over the past days :)

It was already possible to look for a Lexeme or a Form in the value field when editing a statement, here's what you can do now, from Special:Search or the search box on any page:

  • look for a lexeme by its L-number
    • by typing "Lexeme:L123"
    • by typing "L123" and selecting the Lexeme namespace
  • look for a Lexeme by the name of its lemma
    • by typing "Lexeme:sandbox"
    • by typing "sandbox" and selecting the Lexeme namespace
  • use the L shortcut: "L:L123" or "L:sandbox"
  • look for a Form: (eg "Lexeme:mangeant") with any of the methods described above

Note that the selector (drop-down menu popping up to suggest results) is not working yet. But if you press Enter or search after typing your keyword, you'll access the results.

Feel free to try it with your favorite Lexemes! If you encounter any issue, or get a result that seems incorrect to you, please let a message on this page, describing what you looked for, what you got, and why this result is wrong.

Thanks, Lea Lacroix (WMDE) (talk) 07:51, 9 August 2018 (UTC)

@Lea Lacroix (WMDE): Works great, many thanks. The only minor issue I see is that matching results are not bolded in result links. Fo example if I search for "test" in main space then word "test" is bolded in results while if I search for "L:to" then "to" is not bolded in results. KaMan (talk) 08:30, 9 August 2018 (UTC)
@Lea Lacroix (WMDE): Thank you so much, this is great! One question: I notice there isn't a separate namespace for Forms, and sometimes it returns a Form and sometimes the Lexeme? For example searching for Lexeme:destroying returns the lexeme destroy (L4571), but searching for Lexeme:hottest returns the form hottest (L3299-F3). Something to do with stemming perhaps? Should that be turned off for lexemes? ArthurPSmith (talk) 14:48, 9 August 2018 (UTC)
There's no separate namespace for Forms. Fulltext search searches both forms and lexemes, but completion search does specific search depending on what particular property requires. Not sure why Lexeme:destroying does not return form, I will check on that. One suggestion is that since Forms belong to Lexeme (i.e. they are all in one document) if the search matches both Lexeme and Form, only one of them will be returned (as search always returns documents). And "destroying" may match "destroy" due to English stemming. Smalyshev (WMF) (talk) 05:24, 15 August 2018 (UTC)
  • Just noticed that the result with a lexeme looks better when search is limited to lexeme namespace than when using lexeme and item namespace.
    --- Jura 13:39, 13 August 2018 (UTC)
That's right, currently the Lexeme-specific search only works for Lexeme namespace alone. If you combine namespaces, it uses the generic search which is much weaker. Task https://phabricator.wikimedia.org/T194968 is the extension of this. Currently, the search works best if only one type of things (articles, items, lexemes) are searched for. This may be improved in the future. Smalyshev (WMF) (talk) 00:50, 14 August 2018 (UTC)

Page to request deletion of lexeme ?

Wikidata:Requests for deletions is for Q
Wikidata:Properties for deletion is for P
Where is the page for L ?
I did mistake and create Lexeme:L12375 a duplicate of Lexeme:L1183. V!v£ l@ Rosière /Murmurer…/ 23:41, 13 August 2018 (UTC)

@Vive la Rosière: The name "Requests for deletions" for that page makes no reference to Q items specifically; in fact there is a pending request for lexeme deletion there already (which is on hold because we lack a merge capability for lexemes at present). Feel free to list L12375 there if you wish. Mahir256 (talk) 02:25, 14 August 2018 (UTC)
  • Maybe a separate deletion page could be useful. BTW, I complete the L12375 as "jaune".
    --- Jura 04:20, 14 August 2018 (UTC)
I have just requested deletion of Lexeme:L6164 at Wikidata:Requests for deletions (error in printed dictionary) and got it removed immediately, so I don't think there is need for new separate page. KaMan (talk) 07:34, 14 August 2018 (UTC)
Ok, thank you. I don't think Wikidata need a new page for that neither ; but the actual introduction help section in Wikidata:Requests for deletions and specialised template Template:Q should be updated in order to make it clear for casual contributors from wiktionaries who'll treat mainly lexeme. V!v£ l@ Rosière /Murmurer…/ 08:33, 14 August 2018 (UTC)
@Vive la Rosière: Further duplicates should be merged instead. --Infovarius (talk) 21:04, 14 August 2018 (UTC)

Sources of lexemes

How can I add a source to a lexeme? I would for example like to add the source of a standard dictionary where the lexeme could be found. Wellparp (talk) 08:53, 7 August 2018 (UTC)

You can use described by source (P1343), see вода (L189) and gepard (L5573) for examples. You can also propose new property dedicated to your dictionary if it is online. KaMan (talk) 09:41, 7 August 2018 (UTC)
Thanks! I will use described by source (P1343). Wellparp (talk) 19:09, 7 August 2018 (UTC)
  • Would this work for citations of actual use of forms? I hesitate between this and "present in work".
    --- Jura 20:04, 12 August 2018 (UTC)

Form occurs in combination with

At in- (L10465), I'd like to indicate that the form "im" occurs before b, m, p. What would be the best way to do that? I hesitate between a new property with lexeme- or item-datatype (both for letters).
--- Jura 09:58, 11 August 2018 (UTC)

  • If we use items, one could do one for "b,m,p" combined.
    --- Jura 10:24, 11 August 2018 (UTC)
  • An item in "grammatical features" could be sufficient.
    --- Jura 04:31, 16 August 2018 (UTC)

Showcase Lexeme

Hey folks :)

For my talk at Wikimania it would be nice to show a Lexeme that is pretty well described with Forms and statements. Any suggestions which Lexeme would be a good candidate? --Lydia Pintscher (WMDE) (talk) 10:03, 15 July 2018 (UTC)

@Lydia Pintscher (WMDE): Try any of these:
Note that all above were created with Wikidata Lexeme Forms tool so they are good showcase for the available tools too.
KaMan (talk) 10:31, 15 July 2018 (UTC)
Awesome! Thank you. I'll use one of these. --Lydia Pintscher (WMDE) (talk) 18:35, 15 July 2018 (UTC)
What about the biggest and the fullest (and not Latinic) вода (L189)? Sense water (Q283). It also has the biggest graph, however no homographs... --Infovarius (talk) 23:32, 15 July 2018 (UTC)
@Infovarius: вода is no more biggest lexeme, gepard (L5573) is bigger :) KaMan (talk) 07:30, 26 July 2018 (UTC)
@KaMan: I suppose вода (L189) is bigger again :) --Infovarius (talk) 13:39, 3 August 2018 (UTC)
@Infovarius: mango (L7565) is bigger :) KaMan (talk) 09:38, 16 August 2018 (UTC)

Filter values by language?

Lookups for forms and lexemes work fine. As the number of available values grows, it might eventually be needed to allow selection or limitation of output by language.
--- Jura 17:31, 2 August 2018 (UTC)

    • Sorry I'm not sure what you're asking for. Do you mean the case when you link to a Form or Lexeme in a statement? And you want to restrict that by language? If so then yes this seems like something we should somehow make possible but I have no idea how yet. --Lydia Pintscher (WMDE) (talk) 19:50, 17 August 2018 (UTC)

Basic grammar ?

How shall we express that some conjunctions are generally followed by a specific mood?

Sample: bien que (L10102) -> subjunctive (Q473746)
--- Jura 12:52, 11 August 2018 (UTC)

  • To stick with the terminology in the interface, we could go with "requires grammatical features".
    --- Jura 04:29, 16 August 2018 (UTC)

wesh

I wonder how to create wesh and it's multiple variants wèche, ouech, ouaich, ouaiche, etc. ; it's one lexeme but how Wikidata treat the variants ? V!v£ l@ Rosière /Murmurer…/ 11:07, 14 August 2018 (UTC)

(in French, it will be easier) la façon de structurer ce genre de variation n'a pas encore vraiment été tranchée mais tu peux regarder par exemple ama/𒂼 (L1) ou quatre-vingt-quinze/quatrevingt-quinze (L95) qui ont plusieurs lemmes en vedette. Pour les formes, soit on applique la même méthode voir septiembre/setiembre (L8645) version actuelle soit on les sépare en tant que formes différentes, cf. L8645 (ancienne version). Cdlt, VIGNERON (talk) 13:56, 14 August 2018 (UTC)
D'accord. Les deux façon de faire ont des avantages et des contraintes différentes. La première permettrait de synthétiser et grouper les éditions en un endroit. Cependant, je vois un problème qui n'est pas des moindre, les références et autres attestations, les étymologies liées aux lexèmes sont également groupés dans une entité alors qu'elles pourraient possiblement ne concerner qu'une des variantes. Par exemple l'étymologie pour les formes graphiques de "wesh" qui subissent plus ou moins une francisation (très claire entre les variantes orthographiques "wesh" et "ouaiche"). Je n'ai absolument rien contre le projet lexème sur Wikidata, bien au contraire il pourrait s'avérer très utile. Cependant, si dès le départ ça part dans tout les sens sans un minimum d'harmonisation lors des contributions ou qu'on n'identifie pas, ou ne traite pas en amont les cas litigieux, je pense qu'à terme certaines données/propriétés seront tout simplement chaotiques et donc inexploitables. V!v£ l@ Rosière /Murmurer…/ 05:55, 16 August 2018 (UTC)
@Vive la Rosière: ta première phrase est tout à fait juste, c'est justement pour cela que la question n'est pas encore tranchée (et peut-être n'a-t-elle pas besoin d'être tranchée d'ailleurs).
Si une information ne concerne qu'une des variantes, il est clair qu'il faut alors la mettre sur la forme correspondante. Voir par exemple soul (L10007) où l'attestation ne concerne évidemment que la forme sou. Après, si il y a de nombreuses différences, peut-être est-ce deux lexèmes distincts et non deux variantes d'un seul lexème (cela ne me semble pas le cas ici mais c'est une question toujours bonne de se poser, j'aime reprendre l'exemple un peu extrême de tour (L2330), tour (L2332), tour (L2331) qui - un peu contr-intuitivement - sont trois lexèmes différentes).
Oui, l'inhomogénéité peut être problématique, c'est pourquoi il ne faut surtout pas hésiter à soulever des questions (même celles qui peuvent sembler triviales). Ceci dit, pour le moment, la structure globale me semble se mettre naturellement en place (le système de base ayant été bien pensé ;) ) c'est plus lorsque que l'on rentre dans les détails de la complexité des langages que des structures concurrentes apparaissent (mais c'est normal, le système doit être souple pour ne pas être bloquant et c'est cette souplesse qui donne parfois un léger flou gênant mais nécessaire). Après, le plus intéressant (pour moi en tout cas), la possibilité de requêter les données permettra de mieux surveiller et manipuler les données.
Cdlt, VIGNERON (talk) 11:12, 16 August 2018 (UTC)
@VIGNERON: Merci, je galérais à trouver des cas similaires déjà présent pour voir leurs traitements actuels. Je notifie aussi late/l8 (L8) avec la variante d'écriture leet présente "l8". Mais du coup, ne serait-ce pas mieux d'intégrer un genre de sous-paramètre lexème, à l'instar de forme, mais au lieu d'un F1/F2/F3 quelque chose comme V1/V2/V3 ? Où alors créer une propriété variante (plus ou moins précise quand la variante provient d'un langage/formalisme identifié : sms, leet, abréviation courante, etc...). Car le cas de wesh est indéclinable mais j'imagine que certains lexème avec des variantes auront également des déclinaisons (je pense déjà à tout les couples pré et post rectifications orthographiques de 1990 par exemple). V!v£ l@ Rosière /Murmurer…/ 13:09, 16 August 2018 (UTC)
@Vive la Rosière: forme, variante, flexion, etc. pour moi c'est la même chose. Pour indiquer la variation de langage, il suffit de mettre la variante correspondante dans le champs Variante d’orthographe, comme je viens de le faire pour late/l8 (L8) et pour la variation de registre, soit à mettre au même endroit, soit demander la création d'une propriété spécifique (ou les deux). Cdlt, VIGNERON (talk) 15:02, 16 August 2018 (UTC)
Ok, ça marche, je me baserais là-dessus. Merci. V!v£ l@ Rosière /Murmurer…/ 16:04, 16 August 2018 (UTC)
@VIGNERON:, je t'embête pour un autre cas de figure. "01" comme forme de janvier (L1183), je voulais l'ajouter mais je me disais que c'était sûrement une variante d'écriture (commune d'ailleurs à beaucoup de langue) concernant les dates. Cette variante existe mais je ne ne sais pas si le sous-système d'écriture dont elle relève à un nom spécifique d'usage linguistique. Est-ce que fr-x-ordinal number (Q191780) serait suffisant ou il faudrait plutôt créer une propriété "écriture date" spécifique ? V!v£ l@ Rosière /Murmurer…/ 16:48, 16 August 2018 (UTC)
Là j'avoue que je ne sais pas trop, je dirais qu'un étiquette de langue suffit mais sans certitude et j'aurais plutôt mis quelque chose comme fr-x-numerical digit (Q82990). @Jura1: qu'en penses-tu ? Cdlt, VIGNERON (talk) 16:54, 16 August 2018 (UTC)
A mon avis, ça serait bien sur Q108. Ce qui est un peu bien étrange, c'est que janvier y est décrit comme "premier mois de l'année", alors que ce n'est pas forcément correct. Juste pour ma compréhension, quel est le lexème que vous avez finalement créé pour le mot au départ de cette discussion?
--- Jura 06:32, 17 August 2018 (UTC)
@Jura1: Pour janvier, j'ai précisé "premier mois de l'année calendaire" mais on pourrait mettre "année civile". Le lexème "wesh" n'est pas encore créé. J'attendais de voir les différentes présentations des lexèmes, les façons de gérer les variantes, ainsi qu'appréhender un peu mieux le contenu "standard" a attaché au lexème avant de me lancer. Je prend la température déjà et ensuite je me jetterais à l'eau. Face-smile.svg PS : D'ailleurs y-a-t-il une propriété qui sert à rattacher janvier (L1183) à January (Q108) (et vice-versa) ? V!v£ l@ Rosière /Murmurer…/ 10:25, 17 August 2018 (UTC)
@Vive la Rosière: merci ! Pour le lien avec les élements Q, ce n'est pas encore possible mais c'est prévu ; c'est tout la partie Senses qui devrait arriver (@Lea Lacroix (WMDE): aurait-on une date de sortie prévisionnelle ?). Cdlt, VIGNERON (talk) 16:52, 19 August 2018 (UTC)
Si tout va bien™ d'ici le mois d'octobre :) Lea Lacroix (WMDE) (talk) 14:00, 20 August 2018 (UTC)

How to deal with lexemes with excessive number of forms?

Hi, I have realized a little problem with the current representation of lexeme forms. For adjectives in Czech we have 7 cases (nominative, genitive, dative, accusative, vocative, locative, instrumental) × generally two numbers (singular, plural) × 4 grammatical genders (masculine animate, masculine inanimate, feminine, neuter) which gives us in total at least 56 forms (there might be more than one form for a given combination). Assuming that comparisons are also forms of a given lexeme it gives us at least 168 (but posibly far far more) forms. In my opinion it is little too much for any human to fill and it would make any Wikidata page messy.

On the other hand most of these forms are not unique. If I count correctly then for example adjective "mladý" has only 11 unique forms out of those 56 forms (not counting comparisons). In my opinion there should be a way to say that this same form is used for example in two different cases but I don't see a way to do it without ruining the data. Am I wrong or is it correct way to fill forms repeatedly?

--Lexicolover (talk) 16:29, 8 August 2018 (UTC)

  • The problem comes up once in a while. I don't think we have come up with a definite answer yet. Aspects: fill or not fill? reference a template or define on lexeme? improve layout? fill through tool or form-tool? do now or wait for later?
    --- Jura 18:42, 8 August 2018 (UTC)
    • I have bulided a module to generate a Lexeme object based on a given word (Module:Lexeme-en). I'm going to write a script to automatically create a lexeme based on a object.

--GZWDer (talk) 08:42, 9 August 2018 (UTC)

Hi @GZWDer:, I'm looking forward to your script because I really want to fill empty adjectives and verbs in Polish. Meantime I created Module:Lexeme-pl but it return "can't convert function to JSON" error (see below)
Can you help? KaMan (talk) 11:55, 9 August 2018 (UTC)
@KaMan: fixed, but this is not the way the workflow intended to be; the forms should be calculated automatically like {{#invoke:Lexeme-pl|noun|te|st|u}}.--GZWDer (talk) 15:37, 9 August 2018 (UTC)
@GZWDer: Thanks. Yes, I understand that workflow is for calculated flexions. It is just that nouns in Polish are hardly calculated in general so first I created general use for nouns. Now I can prepare calculable specific ones on top of it. KaMan (talk) 07:22, 10 August 2018 (UTC)
Hello, here is a bit more information from our side:
Automated generation of Forms is definitely planned on our roadmap. But it's more a long-term plan: we have many other things to fix first, and we need to investigate on how is the best way to do it. We're currently thinking about using Lua, but that's all I can tell for now.
Considering this, I'd suggest not to wait for us to fill the forms, you can start manually, or build your own tools. While doing that, two suggestions:
  • please be aware that the APIs are unstable and may change in the future
  • if you're writing your models and code in Lua, we may be able to build up on it later, so your work will be useful in the future
I stay available to talk more about it if needed :) Cheers, Lea Lacroix (WMDE) (talk) 09:12, 9 August 2018 (UTC)
@Lea Lacroix (WMDE): I've prepared Lua script Module:Lexeme-pl to build full inflection of Polish adjectives. It is able to produce 84 forms just from four input parameters. Hope it helps and brings us closer to the tool you mentioned. KaMan (talk) 15:43, 18 August 2018 (UTC)
  • What are the plans for the layout? The way the countless forms on aimer (L47) are presented isn't really helpful for people looking at the entity. Besides, it took a month to find that 3 were missing.
    --- Jura 11:44, 9 August 2018 (UTC)
  • It would be nice to have a filter to show only a group of statements based on their shared grammatical features, like all "present tense" or "indicative" forms.--Micru (talk) 12:27, 9 August 2018 (UTC)
  • There is the stub ticket at phabricator:T200885. It links to the mockups that we have from the beginning of the work. --Lydia Pintscher (WMDE) (talk) 15:39, 9 August 2018 (UTC)
    • Ok, I see tasks 11 and 13. If the list is followed in sequence, we are currently at #5. So it would be another 2 years?
      Maybe an alternate approach could be to generate a summary for the forms section in LUA and display that below the header. It could show if a form is available and link to existing ones. The summary can vary based on language, lexical category and some properties. It might also reduce the burden on the development team as the module could be edited onwiki.
      --- Jura 15:52, 9 August 2018 (UTC)
    • Definitely not 2 years ;-) Senses for example are making pretty quick progress. (You can test the current state on https://wikidata.beta.wmflabs.org/wiki/Lexeme:L1) --Lydia Pintscher (WMDE) (talk) 15:57, 9 August 2018 (UTC)
      • @Lydia Pintscher (WMDE): Nice, last time I tried it didn't save. BTW, could we insert a mediawiki message could allow such a summary. Initially, it would be just an empty string inserted below the header. This would be similar to the one available for the bottom of Wikipedia articles.
        --- Jura 10:32, 11 August 2018 (UTC)
  • @Lexicolover: can you explain how comparatives increased number of forms by factor 3? You mean that each comparative form of adjective can also be declined in cases, numbers and genders? (because in Russian comparatives add at most 2 forms to overall count...) --Infovarius (talk) 14:31, 9 August 2018 (UTC)
  • That is exactly right. Comparative and superlative forms actually act as any other adjective. As I see Czech Wiktionary even has separate pages for positive, comparative and superlative forms (eg. dobrý - lepší - nejlepší) but if I understand it correctly (I might be wrong, I am not linguist) all those forms are still one lexeme and thus should be together at Wikidata.
  • My point is that there are lot of combinations but few unique forms (as seen in the above examples). So it would be nice to have ability to "nest" forms where applicable. So instead of current:
L999-F1 | foo (lang)
Gramatical features singular nominative
Statements about L999-F1 

L999-F2 | foo (lang)
Gramatical features singular accusative
Statements about L999-F2 

...
  • to have something like
L999-F1 | foo (lang)
Gramatical features  * singular nominative
                     * singular accusative
                     * ...
Statements about L999-F1 
  • while sustaining the ability to datamine it as it is now. I don't see any way to do it now (corerct me if I am wrong). Well I don't know whether it is doable or even whether it is desirable as i have no idea what consequences for Wikidata data model it might have. But in my opinion this approach could save lots of redundancies in some languages. --Lexicolover (talk) 19:24, 9 August 2018 (UTC)
  • It would be interesting to differentiate adjectives/adverbs with "formes synthétiques" ("better") from others ("more good"). Depending on the language, it may be the exception (or the rule).
    --- Jura 13:35, 11 August 2018 (UTC)

Hello @GZWDer, KaMan: and thank you very much for building these scripts! They will help us move forward with autogeneration of Forms. I just created a ticket for this, listing the existing scripts (if some more are created, we can add them here). We can also track discussions in this ticket. What about adding your scripts in Wikidata:Tools/Lexicographical data? Lea Lacroix (WMDE) (talk) 13:51, 20 August 2018 (UTC)

@Lea Lacroix (WMDE): thank You for creating ticket. You are right, I've added both modules to tools page. KaMan (talk) 11:26, 21 August 2018 (UTC)

Download Lexemes

How can I download the available Lexeme data (i.e. https://www.wikidata.org/wiki/Lexeme:L7111)? I have been lately working with wikidata json dumps (i.e. latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/) but the lexemes seem not to be included there. Thanks! --Motagirl2 (talk) 10:57, 17 August 2018 (UTC)

Hello @Motagirl2: this is not possible for now :) The lexemes are not included in the dumps yet. It will be part of enabling the Query Service for Lexemes, which will hopefully happen in the next months. Lea Lacroix (WMDE) (talk) 13:54, 20 August 2018 (UTC)
@Lea Lacroix (WMDE): Ok, thanks! I'll stay tuned :) --Motagirl2 (talk) 07:36, 21 August 2018 (UTC)

Lexemes about months

Hi,

Here is a (maybe crazy) idea: could everyone create the lexemes for the twelve months in their language? The idea is to have these words in the most possible languages so we can compare them and use them as example. Ideally, I would love to make these showcases lexemes.

They could also be used to built a common fundamental structure for lexemes. Or at least, to solve question together. The main question (that we had several times) is: How to deal with variation in general?

  • For instance, how to deal with Ianuarius/Januarius/ianuarius/januarius for Januarius (L8160) in Latin?
  • Or with plurals Januaries and Januarys for January (L701) in English?

Cdlt, VIGNERON (talk) 16:57, 31 July 2018 (UTC)

List

Here is the months we already have (ordered by the Qid of the language) :

  1. Esperanto (Q143) : januaro (L8772), februaro (L8773), marto (L8774), aprilo (L8775), majo (L8776), junio (L8777), julio (L8778), aŭgusto (L8779), septembro (L8780), oktobro (L8781), novembro (L8782), decembro (L8783)
  2. French (Q150) : janvier (L1183), février (L1184), mars (L1185), avril (L1186), mai (L1187), juin (L1188), juillet (L1189), août (L1190), septembre (L1191), octobre (L1192), novembre (L1193), décembre (L1194)
  3. German (Q188) : Januar (L8174) / Jänner (L8882), Februar (L8175), März (L8182), April (L8188), Mai (L8190), Juni (L8191), Juli (L8192), August (L8193), September (L8194), Oktober (L8196), November (L8197), Dezember (L8198)
  4. Turkish (Q256) : ocak (L8736), şubat (L8737), mart (L8738), nisan (L8739), mayıs (L8740), haziran (L8741), temmuz (L8742), ağustos (L8743), eylül (L8744), ekim (L8745), kasım (L8746), aralık (L8747)
  5. Latin (Q397) : Januarius (L8160), Februarius (L8161), Martius (L8162), Aprilis (L8163), Maius (L8164), Junius (L8165), Julius (L8166)/Quintilis (L8172), Augustus (L8167)/Sextilis (L8173), September (L8168), October (L8169), November (L8170), December (L8171)
  6. Italian (Q652) : gennaio (L8269), febbraio (L8270), marzo (L8271), aprile (L8273), maggio (L8274), giugno (L8275), luglio (L8276), agosto (L8277), settembre (L8278), ottobre (L8279), novembre (L8281), dicembre (L8283)
  7. Polish (Q809) : styczeń (L1872), luty (L4487), marzec (L4562), kwiecień (L4734), maj (L5023), czerwiec (L5149), lipiec (L5184), sierpień (L5444), wrzesień (L5581), październik (L5737), listopad (L5910), grudzień (L6023)
  8. Spanish (Q1321) : enero (L8637), febrero (L8638), marzo (L8639), abril (L8640), mayo (L8641), junio (L8642), julio (L8643), agosto (L8644), septiembre/setiembre (L8645), octubre (L8646), noviembre (L8647), diciembre (L8648)
  9. Finnish (Q1412) : tammikuu (L8819), helmikuu (L8822), maaliskuu (L8823), huhtikuu (L8824), toukokuu (L8825), kesäkuu (L8826), heinäkuu (L9048), elokuu (L9049), syyskuu (L9050), lokakuu (L9051), marraskuu (L9052), joulukuu (L9054)
  10. English (Q1860) : January (L701), February (L702), March (L703), April (L704), May (L705), June (L706), July (L707), August (L708), September (L709), October (L710), November (L711), December (L712)
  11. Portuguese (Q5146) : janeiro (L8748), fevereiro (L8749), março (L8750), abril (L8751), maio (L8752), junho (L8753), julho (L8754), agosto (L8755), setembro (L8756), outubro (L8757), Novembro (L8758), dezembro (L8759)
  12. Japanese (Q5287) : 一月/いちがつ (L8439), 二月/にがつ (L8440), 三月/さんがつ (L8441), 四月/しがつ (L8443), 五月/ごがつ (L8444), 六月/ろくがつ (L8445), 七月/しちがつ (L8446), 八月/はちがつ (L8447), 九月/くがつ (L8448), 十月/じゅうがつ (L8450), 十一月/じゅういちがつ (L8451), 十二月/じゅうにがつ (L8452)
  13. Croatian (Q6654) : siječanj (L8898), veljača (L8899), ožujak (L8900), travanj (L8901), svibanj (L8902), lipanj (L8903), srpanj (L8904), kolovoz (L8905), rujan (L8906), listopad (L8907), studeni (L8908), prosinac (L8909),
  14. Catalan language (Q7026) : gener (L8195), febrer (L8199), març (L8200), abril (L8201), maig (L8202), juny (L8372), juliol (L8376), agost (L8380), setembre (L8385), octubre (L8389), novembre (L8392), desembre (L8395)
  15. Dutch (Q7411) : januari (L12576), februari (L17484), maart (L17485), april (L17486), mei (L17487), juni (L17488), juli (L17489), augustus (L17490), september (L17492), oktober (L17493), november (L17494), december (L17495)
  16. Russian (Q7737) : январь (L8841), февраль (L8842), март (L8843), апрель (L8844), май (L8845), июнь (L8846), июль (L8847), август (L8848), сентябрь (L8849), октябрь (L8850), ноябрь (L8851), декабрь (L8852)
  17. Romanian (Q7913) : ianuarie (L8829), februarie (L8830), martie (L8831), aprilie (L8832), mai (L8833), iunie (L8834), iulie (L8835), august (L8836), septembrie (L8837), octombrie (L8838), noiembrie (L8839), decembrie (L8840)
  18. Georgian (Q8108) : იანვარი (L8760), თებერვალი (L8761), მარტი (L8762), აპრილი (L8763), მაისი (L8764), ივნისი (L8765), ივლისი (L8766), აგვისტო (L8767), სექტემბერი (L8768), ოქტომბერი (L8769), ნოემბერი (L8770), დეკემბერი (L8771)
  19. Basque (Q8752) : urtarril (L8720)
  20. Ukrainian (Q8798) : січень (L8795), лютий (L8797), березень (L8799), квітень (L8800), травень (L8802), червень (L8803), липень (L8806), серпень (L8807), вересень (L8809), жовтень (L8811), листопад (L8812), грудень (L8813)
  21. Swedish (Q9027) : januari (L8457), februari (L8460), mars (L8462), april (L8463), maj (L8465), juni (L8466), juli (L8468), augusti (L8469), september (L8470), oktober (L8471), november (L8472), december (L8473)
  22. Danish (Q9035) : januar (L8223), februar (L8232), marts (L8236), april (L8238), maj (L8241), juni (L8245), juli (L8250), august (L8257), september (L8259), oktober (L8265), november (L8280), december (L8285)
  23. Czech (Q9056) : leden (L1202), únor (L1203), březen (L1204), duben (L1205), květen (L1206), červen (L1207), červenec (L1208), srpen (L1209), září (L1210), říjen (L1211), listopad (L1213), prosinec (L1214)
  24. Slovak (Q9058) : január (L10671), február (L10672), marec (L10676), apríl (L10678), máj (L10681), jún (L10683), júl (L10685), august (L10688), september (L10691), október (L10693), november (L10696), december (L10700)
  25. Hungarian (Q9067) : január (L8477), február (L8485), március (L8488), április (L8492), május (L8495), június (L8498), július (L8502), augusztus (L8503), szeptember (L8507), október (L8510), november (L8512), december (L8515)
  26. Estonian (Q9072) : jaanuar (L8723), veebruar (L8722), märts (L8724), aprill (L8725), mai (L8726), juuni (L8727), juuli (L8728), august (L8729), september (L8730), oktoober (L8731), november (L8732), detsember (L8733)
  27. Latvian (Q9078) : janvāris (L8865)
  28. Lithuanian (Q9083) : sausis (L8805), vasaris (L9137),...
  29. Greek (Q9129) : Ιανουάριος (L9291), Φεβρουάριος (L9293), Μάρτιος (L9294), Απρίλιος (L9295), Μάιος (L9296), Ιούνιος (L9297), Ιούλιος (L9298), Αύγουστος (L9299), Σεπτέμβριος (L9300), Οκτώβριος (L9301), Νοέμβριος (L9302), Δεκέμβριος (L9303)
  30. Maltese (Q9166) : Jannar (L8866), Frar (L8867), Marzu (L8868), April (L8869), Mejju (L8870), Ġunju (L8871), Lulju (L8872), Awwissu (L8873), Settembru (L8874), Ottubru (L8875), Novembru (L8876), Diċembru (L8877)
  31. Persian (Q9168) : ژانویه (L8817), ...
  32. Serbian (Q9299) : јануар (L11349), фебруар (L11350), март (L11351), април (L11352), мај (L11353), јун (L11354), јул (L11355), август (L11356), септембар (L11357), октобар (L11358), новембар (L11359), децембар (L11360)
  33. Welsh (Q9309) : Ionawr (L8433), Chwefror (L8435), Mawrth (L8436), Ebrill (L8437), Mai (L8442), Mehefin (L8449), Gorffennaf (L8453), Awst (L8428), Medi (L8427), Hydref (L8454), Tachwedd (L8429), Rhagfyr (L8455)
  34. Bengali (Q9610) : জানুয়ারি (L8672), ফেব্রুয়ারি (L8673), মার্চ (L8674), এপ্রিল (L8675), মে (L8676), জুন (L8677), জুলাই (L8678), আগস্ট (L8679), সেপ্টেম্বর (L8680), অক্টোবর (L8681), নভেম্বর (L8682), ডিসেম্বর (L8683)
  35. Breton (Q12107) : Genver (L8146), C'hwevrer (L8147), Meurzh (L8148), Ebrel (L8159), Mae (L8150), Mezheven (L8151)/Even (L8152), Gouere (L8158), Eost (L8157), Gwengolo (L8156), Here (L8155), Du (L8154), Kerzu/Keverdu (L8153)
  36. Arabic (Q13955) : كانون الثاني/كَانُونُ ٱلْثَانِي (L8660), شباط/شُبَاطُ (L8661), أذار/أَذَارُ (L8662), نيسان/نِيسَانُ (L8663), أيار/أَيَّارُ (L8664), حزيران/حَزِيرَانُ (L8665), تموز/تَمُّوزُ (L8666), آب/آبُ (L8667), أيلول/أَيْلُولُ (L8668), تشرين الأول/تَشْرِينُ ٱلْأَوَّلُ (L8669), تشرين الثاني/تَشْرِينُ ٱلْثَّانِي (L8670), كانون الأول/كَانُونُ ٱلْأَوَّلُ (L8671)
  37. Cornish (Q25289) : Genver (L8176), Whevrer (L8177), Merth (L8178), Ebrel (L8179), Me (L8180), Metheven (L8181), Gortheren (L8183), Est (L8184), Gwyngala (L8185), Hedra (L8186), Du (L8187), Kevardhu (L8189)
  38. Asturian (Q29507) : xineru (L9059), febreru (L9060), marzu (L9061), abril (L9062), mayu (L9063), xunu (L9064), xunetu (L9065), agostu (L9066), setiembre (L9067), ochobre (L9068), payares (L9058), avientu (L9057).
  39. Egyptian Arabic (Q29919) : يناير (L8408), فبراير (L8649), مارس (L8650), أبريل (L8651), مايو (L8652), يونيو (L8653), يوليو (L8654), أغسطس (L8655), سبتمبر (L8656), أكتوبر (L8657), نوفمبر (L8658), ديسمبر (L8659)
  40. Haitian Creole (Q33491) : janvye (L8853), fevriye (L8854), mas (L8855), avril (L8856), me (L8857), jen (L8858), jiyè (L8859), out (L8860), septanm (L8861), oktòb (L8862), novanm (L8863), desanm (L8864)
  41. Kashubian (Q33690) : stëcznik (L9034), gromicznik (L9035), strumiannik (L9036), łżëkwiat (L9037), môj (L9038), czerwińc (L9039), lëpinc (L9040), zélnik (L9041), séwnik (L9042), rujan (L9043), lëstopadnik (L9044), gòdnik (L9045)
  42. Tagalog (Q34057) : Enero (L8911), Pebrero (L8923), Marso (L8924), Abril (L8926), Mayo (L8927), Hunyo (L8928), Hulyo (L8929), Agosto (L8930), Setyembre (L8932), Oktubre (L8933), Nobyembre (L8935), Disyembre (L8936)
  43. Interlingua (Q35934) : januario (L9077), februario (L9078), martio (L9079), april (L9080), maio (L9081), junio (L9082), julio (L9083), augusto (L9084), septembre (L9085), octobre (L9086), novembre (L9087), decembre (L9088)
  44. Vilamovian (Q56485) : styćyń (L8708), lüty (L8709), mjeca (L8710), kwjećyń (L8711), mȧja (L8712), ćerwjyc (L8713), lipjyc (L8714), oügus (L8715), september (L8716), paźdźjernik (L8717), listopad (L8718), grüdźjyń (L8719)

add your language! (list ordered by the Qid of the language)

The same in qitems looks like:

Full table of item labels at list of month names

Discussions

@VIGNERON: Fortunately I had it already done:
Polish (Q809) : styczeń (L1872), luty (L4487), marzec (L4562), kwiecień (L4734), maj (L5023), czerwiec (L5149), lipiec (L5184), sierpień (L5444), wrzesień (L5581), październik (L5737), listopad (L5910), grudzień (L6023)
KaMan (talk) 17:12, 31 July 2018 (UTC)
@KaMan: thanks (I shoud have know it existed already in Polish !), I've add it directly to the list on my message, other people don't hesitate to edit the list. Cdlt, VIGNERON (talk) 17:16, 31 July 2018 (UTC)
I'm working on German now. --LydiaPintscher (talk) 17:17, 31 July 2018 (UTC)
German added. --LydiaPintscher (talk) 17:29, 31 July 2018 (UTC)
Done for Welsh and Swedish. --Pymouss (talk) 20:03, 31 July 2018 (UTC)
Added Japanese version. --Okkn (talk) 20:49, 31 July 2018 (UTC)
Added Hungarian Dennismoore (talk) 21:02, 31 July 2018 (UTC)
Forms for German months also added :) --Lucas Werkmeister (talk) 21:36, 31 July 2018 (UTC)
Spanish added. Please check them! They are my first lexemes.--Kippelboy (talk) 21:50, 31 July 2018 (UTC)
@Kippelboy: Fixed September, but otherwise good! Mahir256 (talk) 22:13, 31 July 2018 (UTC)
@Kippelboy:@Mahir256: Both setiembre and septiembre are valid. This is the kind of things we should solve here. -Theklan (talk) 07:38, 1 August 2018 (UTC)
@Theklan, Kippelboy, Mahir256: based on what we did on other lexemes, I tried something on septiembre/setiembre (L8645). I've put the code es-x-Q2034 for 1900 (Q2034) as I noticed on Ngrams that 1900 is the year when "septiembre" became more common than "setiembre" but I don't know well Spanish graphic system, an other code would maybe be better, I let you decide. Cdlt, VIGNERON (talk) 10:33, 1 August 2018 (UTC)
Edit: KaMan changed the structure. I'm not really convinced: how to put statements true for "septiembre" but not for "setiembre", or specific for only one variant, like [seˈtjem.bɾe] and [sepˈtjem.bɾe] for IPA transcription (P898) ? Cdlt, VIGNERON (talk) 13:44, 3 August 2018 (UTC)
@VIGNERON: Oh, sorry if I interrupted some experiment. I just inspect every new lexeme and missed it is linked from here. Feel free to revert me if needed. KaMan (talk) 16:02, 3 August 2018 (UTC)
@KaMan: no problem, I'm not sure, maybe your solution is best. How would you add the pronunciation of each variant? Cheers, VIGNERON (talk) 11:27, 4 August 2018 (UTC)
@VIGNERON: good question. I suppose with qualifier language of work or name (P407) set to Spanish (Q1321) for first pronunciation and for second language of work or name (P407) set to Spanish (Q1321) as well but with another qualifier pointing "to date" 1900 (Q2034). KaMan (talk) 11:48, 4 August 2018 (UTC)
@KaMan: could you try it please so I can see what your solution looks like? (I see what it could look like with separate forms but not with different forms together as one form). Cheers, VIGNERON (talk) 13:45, 15 August 2018 (UTC)
@VIGNERON: sorry but I do not have pronunciation to try it. Should I take IPA mentioned by You earlier in this thread? is this correct? KaMan (talk) 16:13, 28 August 2018 (UTC)
+item labels by language--- Jura 22:39, 31 July 2018 (UTC)
Added Arabic and Egyptian Arabic version. @Jura: In the list "item labels by language" Arabic = Egyptian Arabic since the Arabic month names are rarely used. --Kolja21 (talk) 00:17, 1 August 2018 (UTC)
@Kolja21: you might want to correct the labels on these items. The icon next to the label should makes this easier. Aliases can include additional names.
--- Jura 11:11, 1 August 2018 (UTC)
Added Bengali. Mahir256 (talk) 00:47, 1 August 2018 (UTC)
Added Estonian, but only the main forms for now - at some point I might come back to add all 28 forms for each, but that's tons of effort :/ --Reosarevok (talk) 08:16, 1 August 2018 (UTC)
Added Portuguese. --Jcornelius (talk)
Added Turkish.-- Hakan·IST 10:31, 1 August 2018 (UTC)

@VIGNERON: so I started with urtarril (L8720) and created by hand the 46 possible grammatical forms. It's a very hard work, it should be somehow automated. -Theklan (talk) 08:24, 1 August 2018 (UTC)

Theklan Thanks for your work! Maybe that's something @Lucas Werkmeister:'s Lexeme Forms tool can help with. Feel free to get in touch to find out how it can be done :) Lea Lacroix (WMDE) (talk) 08:35, 1 August 2018 (UTC)
Lea Lacroix (WMDE) I have been looking to that tool and it's not practical if I have to add 46 forms by hand. Grammatical forms are quite regular, and we have it solved both with templates and modules in different versions of Wiktionary. If I enter "urtarril", it should be autofilled somehow. -Theklan (talk) 09:02, 1 August 2018 (UTC)
Léa, did you try to use it?
--- Jura 18:38, 3 August 2018 (UTC)

Duplicate where only capitalisation differs: Janeiro (L8734) and janeiro (L8748). Both exists but shouldn't it be one lexemes only? (I think so but I'm not sure) And how to indicate this typographic variation? For info, @Jcornelius:. Cdlt, VIGNERON (talk) 13:06, 1 August 2018 (UTC)

@VIGNERON:: It should be only one lexeme, actually. --Jcornelius (talk) 13:14, 1 August 2018 (UTC)

I found some resources on English Wiktionary and the French one to add translations in other languages. I'm starting with Basque (Q8752). Djiboun (talk) 14:02, 1 August 2018 (UTC)

@Djiboun: Make sure you coordinate with Galder on Basque. Also note Jura's nice table of labels linked above. Mahir256 (talk) 14:06, 1 August 2018 (UTC)
@Mahir256: Yes you're right, I forgot his message. This language really needs a template ... Djiboun (talk) 14:09, 1 August 2018 (UTC)
@Mahir256: @Djiboun: The language itself is not a problem, but uploading all the declensions yes. That's what it should be automated for each language. -Theklan (talk) 15:26, 2 August 2018 (UTC)

Pictogram voting comment.svg Comment Doesn't this also assume that months are the same part of speech in every language? In Classical Latin, the months were adjectives, not nouns. --EncycloPetey (talk) 15:33, 4 August 2018 (UTC)

@EncycloPetey: nope, I don't see why you would assume that. For latin, I plan to create the lexemes for the adjective and the locution too, but I lack a bit of itme and references to do things properly. If you can, feel free to improve them ;) Cdlt, VIGNERON (talk) 11:26, 5 August 2018 (UTC)
But you already have Januarius (L8160) as a noun, with all the noun forms instead of the adjective forms. It is true that in the middle ages and later, the names of months began to be used as nouns in Latin, but not in Classical Latin, so how will you handle a situation where a single lexeme exists as two different part of speech, with different inflectional forms, dependent about the time in the history of a language? --EncycloPetey (talk) 14:42, 5 August 2018 (UTC)
@EncycloPetey: if there is two *two* different part of speech, how can it be *one* lexemes? As I said, I planned to create lexemes about the adjectives, can I do it or do you see a problem there? Cdlt, VIGNERON (talk) 15:48, 14 August 2018 (UTC)
Because it's exactly the same meaning for the same word/lexeme. it's just a gradual shift over time in the grammar of that meaning. Modern European grammarians see a sharp divide between nouns, adjectives, and verbs, but Classical linguists didn't always see such divisions: the line between noun and adjective was often very fuzzy, which is why there are instances of nouns used as adjectives and adjectives used as nouns. --EncycloPetey (talk) 02:04, 15 August 2018 (UTC)
@EncycloPetey: true but still, I think there is two lexemes. Like "love" in English is a word which both a verb and a noun: love (L1029) and love (L4471). It's the same word and the same meaning but it's two lexemes (actually, it's probably 4 lexemes according to en:wikt:love). When it comes to data, I think that if there is different data for something then we should have different entities. For instance, here we have different declensions and etymology, so why not having distinct lexemes ? (PS: please ping me so I can see your answer) Cdlt, VIGNERON (talk) 13:41, 15 August 2018 (UTC)
@VIGNERON: If it's the same word and the same meaning, then how can it be two lexemes? We do not have different etymology in this case, because they are the same historical word. There aren't different declensions either, because it's the same word, same endings, same meaning, same etymology. --EncycloPetey (talk) 13:47, 15 August 2018 (UTC)
@EncycloPetey: because there is two different lexical categories (noun and adjective) and a lexeme cannot have two lexical categories. If it were possible, it would mean that this lexeme is always both a noun and an adjective, which is obviously no true (you said it yourself « in the middle ages and later, the names of months began to be used as nouns in Latin », it's diachronic not synchronic ; which again is a data that can be indicated with different value if there is two lexemes). And the etymology is obviously different as the noun comes from the locution which comes from the adjective which themselves comes from something else. How would you represent that with only one lexeme? Cdlt, VIGNERON (talk) 15:55, 15 August 2018 (UTC)
@VIGNERON: Well, if you're going to decide a priori that two parts of speech automatically means that there are two lexemes, then the argument is circular and there's no point in a discussion. --EncycloPetey (talk) 03:40, 16 August 2018 (UTC)
@VIGNERON: how can you say that love (L1029) and love (L4471) have the same meaning?? Of course they are different - one is about feeling, the other is about some action (which can have little in common with the feeling). --Infovarius (talk) 13:58, 16 August 2018 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Is this not something, @EncycloPetey, VIGNERON:, that could be resolved by introducing a new grammatical category specifically to refer to this noun-adjective mix? Surely it is useful to note the 'fuzziness' of the specific classification of this word directly—at least as long as only one lexical category can be assigned to an L item—while keeping all declensions and other relationships on one lexeme. Mahir256 (talk) 05:19, 16 August 2018 (UTC)

@Mahir256, EncycloPetey: interresting idea, we need to see if it fits what the reference says. And if this the case, we should correct the wiktionaries, like en:wikt:Ianuarius or la:wikt:Ianuarius who treat this word as two lexemes. In the L&S entry Januarius or in the Lewis entry Ianuarius, both says it's an adjective and *also* a substantive (the Gaffiot entry Januarius is less clear, but indicate that the adjective can be used alone to designate the month).
For the more than on lexical category on L item, I don't want to decide but from the linguistical literature I've read about lexemes, it seems to me that a lexeme can only have one category (but scholars don't always agree on what exactly is a lexeme, sadly Lexical Polycategoriality: Cross-linguistic, cross-theoretical and language acquisition approaches is not available to me on G00gle books :/ I've read it too many times I guess). Any sources that lexemes can have multiple category? Cdlt, VIGNERON (talk) 06:32, 16 August 2018 (UTC)
  • To comment on some of the initial questions:
    1. the problem with the approach chosen at septiembre/setiembre (L8645) is that we may need a (new) qualifier if we want to make statements about one of the two forms. Sample: "setiembres".
    2. the form can't be linked explicitly with form datatype.
    3. a qualifier like applies to name (P5168) could work.
    4. it may be difficult to find the ideal x-Q-code.
    5. Using a separate form could be difficult if the lexical category has plenty of forms.
    6. Abbreviations should probably be a separate form too.
      --- Jura 04:17, 14 August 2018 (UTC)
In Polish some abbreviations have own inflection so they are usually separate lexemes in dictionaries. KaMan (talk) 05:55, 14 August 2018 (UTC)
I'd rather have "janv" as form on janvier (L1183), but it might not work for all languages.
--- Jura 05:01, 16 August 2018 (UTC)
And what about "01" for janvier (L1183) as form aswell ? V!v£ l@ Rosière /Murmurer…/ 08:07, 16 August 2018 (UTC)

Monitoring recent changes per language is now possible

For those interested, thanks to fixing WhatLinksHere it is now possible to monitor lexical changes per language if someone is interested only in one preffered language. You can use Special:RecentChangesLinked with pointer to language Q-item. For example here are last two weeks of changes in German language: https://www.wikidata.org/wiki/Special:RecentChangesLinked?hidebots=1&hidecategorization=1&target=Q188&showlinkedto=1&namespace=146&limit=50&days=14&enhanced=1&urlversion=2 KaMan (talk) 09:16, 27 August 2018 (UTC)

Lexemes available for text

Most words appearing in Q55867115 are now available as Lexemes. Should we mark such texts somehow?
--- Jura 07:36, 30 August 2018 (UTC)

Impressive work! I'm not sure we can do a lot of things right now, but in the future, this provides a nice playground for experiments with Wikisource, like tagging words directly on the text. Lea Lacroix (WMDE) (talk) 12:16, 30 August 2018 (UTC)
I think that would require more forms in Lexemes. Currently sadly many lexemes lack forms. KaMan (talk) 12:44, 30 August 2018 (UTC)
The question is if we should have a form without a text evidencing it or not. The egg or the chicken first?
--- Jura 12:49, 30 August 2018 (UTC)
I think if possible we should have both together at the beginning :) English, Polish, German lexemes mostly follow this path (mainly using great tool by Lucas Werkmeister). French not. I'm not blame you. That's the way you choosed to enter lexemes. I only think that it will be harder to complete them later. I think complete lists of forms are key to a lot of functionality in the future. KaMan (talk) 12:57, 30 August 2018 (UTC)
  • I'm more interested in adding the actual quotes once some functionality is there. BTW, English verbs are fairly trivial to fill, even by hand ;)
    --- Jura 13:01, 30 August 2018 (UTC)
@Jura1: wonderful! Keep going, I love that!
Two quick remarks, could you add a documentation on Template:TR lexeme fr ? (at least give the legend for the colors) and wouldn't it be better to link to the specific form ?
For the creation of forms without attestation, first here you have attestations so no problem Face-wink.svg and then if the forms is obvious and trivial, we don't absolutely need to provide an citation (for instance "cats" is a plural form of "cat"). Plus, attestation can allays be provided afterwards (like the wiki-spirit already do for references). In any case, a lexeme should always have at least one form, that should be mandatory.
Cheers, VIGNERON (talk) 11:33, 31 August 2018 (UTC)
Well, I'd rather see some for the obvious than none at all. Somehow I doubt you'd find some for all forms on aimer (L47). I also hope that the forms eventually get filled on Lexeme creation (there is phab ticket somewhere).
--- Jura 13:35, 31 August 2018 (UTC)