Wikidata talk:Lexicographical data

From Wikidata
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2022/08.



Milestone - 200k lexemes[edit]

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).Reply[reply]

Different spelling or different words?[edit]

mensa (L31224) = (L278590)? --Infovarius (talk) 19:19, 15 March 2022 (UTC)Reply[reply]

@Infovarius: clearly the same lexeme. The diacritic (here a macron) a modern notation, it's useful for prononciation (to indicate a long vowel) but I'm not sure how it should be stored (see also Wikidata_talk:Lexicographical_data/Archive/2018/06#Arabic_diacritics where we mention this question). PS: Uzielbot import had a lot of strange things that should be fixed (but I'm not sure where to start...). Cheers, VIGNERON (talk) 15:49, 9 May 2022 (UTC)Reply[reply]
@Infovarius: merge ✓ Done. But I'm not sure what to do with the form now... @Escudero, Uziel302: VIGNERON (talk) 09:02, 5 June 2022 (UTC)Reply[reply]
I believe they should be merged too (as different variants of spelling). --Infovarius (talk) 21:15, 6 June 2022 (UTC)Reply[reply]
VIGNERON, when I made the import there were very few Latin words so I just imported everything that was on whitaker's program, and the duplicates created by the import should be merged. As of other strange things, some of them might be in the source, you can check online [1]. If there is a difference from that source, I would love to hear about it.Uziel302 (talk) 21:00, 2 July 2022 (UTC)Reply[reply]
@Uziel302: I know, and at that time in the context of starting the lexemes, it was not a bad import but now, the standard are a bit higher and this is bad lexemes. For the cleaning, see below #Cleaning of Latin lexemes. And yes, I didn't see any difference (not yet) from the source but William Whitaker's Words (Q533803) is known to be a not bad but no so good source. Cheers, VIGNERON (talk) 09:00, 3 July 2022 (UTC)Reply[reply]

Childhood language[edit]

I see several language registers items for childhood languages. They seem like duplicates but maybe it’s not

Babbling is the sounds babies make before they can speak. I think the French Wikipedia page is on the wrong item. - Nikki (talk) 16:51, 19 June 2022 (UTC)Reply[reply]
@TomT0m: +1 with Nikki I moved the interwiki link and change the French label on the first item to make it clearer. So for childhood language, the third one is probably the best one. Cheers, VIGNERON (talk) 17:14, 19 June 2022 (UTC)Reply[reply]

The Grammatical Person category and its representation[edit]

I'm looking at the representation of lexemes from the perspective of using them in the Abstract Wikipedia project. Here I want to raise some qualms I have with the Grammatical Person category.

  • First, contrary to other grammatical categories, there is no corresponding Grammatical Person property. This means that one cannot add statements regarding the grammatical person of pronouns, for instance. Can I simply create such a property?
  • Second, from a linguistic perspective, as a grammatical category the person category has (typically) three possible values: first person, second person and third person. Yet these items are not marked as instances of the category, but rather as subclasses of it. As instances of the category we found composite features such as third-person masculine plural and similar beasts. Linguistically speaking, these are not person categories, but rather descriptions of pronominal elements, combining thus several atomic grammatical features. In fact, it was agreed in a previous discussion, not to use these combined features, but only the atomic ones. Is there a way to systematically clean this up? I would suggest the following steps:
    • Make the atomic person categories instances of Grammatical Person.
    • Either delete the composite person categories completely, or remove their instance of relation to Grammatical Person. In the later case, the composite person categories can still be annotated as subclasses of the atomic person categories, if we understand the subclass of relation as a subsumption relation (but that's probably another discussion).

Does this seem reasonable? Is there some easy way to make this bulk change? AGutman-WMF (talk) 20:46, 9 June 2022 (UTC)Reply[reply]

@AGutman-WMF: I'm not sure what you are referring to regarding the property issue - perhaps you can provide more detail on this and the existing properties you see in this area (I wasn't aware we had them)? Please don't create a new property without going the the property proposal process - see Wikidata:Property proposal. On the second point, yes, this makes sense to me, I haven't really looked into it but it sounds like cleanup would be helpful. ArthurPSmith (talk) 13:48, 10 June 2022 (UTC)Reply[reply]
Thanks for the pointer!
Some grammatical categories are modeled both as items and as properties:
The properties are useful as they allow specifying a entry-level statement for a lexeme (not tied to a specific form). For instance, the lexeme she could have a statement Grammatical Person: third-person, instead of repeating this feature once for every form. The same is actually also true for the Grammatical Number category, which doesn't have a corresponding property either.
AGutman-WMF (talk) 11:51, 13 June 2022 (UTC)Reply[reply]
Update: Looking into this, I see that there has already been a proposal for creating a Grammatical Person property, but it has been rejected on the ground that the person property is specific to forms and not to lexemes. While this is true for verbs, it is certainly not true for pronouns (where the person property is a property of the lexeme). Should I create a new proposal or revive the old one?
The same question is relevant also for the rejected proposal to create a Grammatical Number property. There too it was rejected on the ground that it is only relevant to forms, and not lexemes, while in fact there are lexemes (such as pronouns, but also plurale tantums or mass nouns) which have a fixed number category, which should IMHO be stated on the lexeme level.
AGutman-WMF (talk) 12:25, 13 June 2022 (UTC)Reply[reply]
@AGutman-WMF: thanks a lot for these comments! I'm not entirely sure why you can't use the grammatical persons already indicated as grammatical features but feel free to make a proposal to explain in more detail your idea (with examples and ideally also some external references). Especially as the last proposal was from 2018, *before* the creation of the first lexemes, a lot as changed since.
For the atomic person categories, I think it's mostly clean on lexemes themselves (no systematically cleaning that I know of, not sure it's needed ; maybe we could have constraints for grammatical features?). And we can't delete the composite ones as it seems they are needed outside lexemes, removal as P31 might be a good idea (but indeed we probably need an another discussion).
For the « bulk change », not sure we have the tool but we have some tools (did you already take a look at Wikidata:Tools/Lexicographical data?).
Cheers, VIGNERON (talk) 18:29, 13 June 2022 (UTC)Reply[reply]
Thanks! I've created now a property proposal for Grammatical Person. The logic behind it is that if a grammatical feature applies to the lexeme itself (i.e. all forms) it should be stated on the lexeme rather than on each form individually.
Every now and then I run into English verbs that use the third-person singular feature so I think there is an opportunity for a systematic clean-up here. Maybe this is something we could tackle in the upcoming Wikidata Quality Days.
In general, I think constraints on grammatical features would be very welcome. Logically, even features of forms, appearing as grammatical features, should be tied to grammatical properties and obey these constraints (as if they were statements), but apparently this is difficult to achieve in the existing data model. AGutman-WMF (talk) 15:42, 17 June 2022 (UTC)Reply[reply]
@VIGNERON It seems the proposal does get now some support, an no objections. Who takes it from here to actually create the property? AGutman-WMF (talk) 10:44, 4 July 2022 (UTC)Reply[reply]

United States Postal Service abbreviation (Q30619513) as a grammatical feature[edit]

a few lexemes use this item as a grammatical feature. I'm not sure this is what forms are for – Loominade (talk) 10:58, 16 June 2022 (UTC)Reply[reply]

I'd think the abbreviation should have it's own lexeme --Loominade (talk) 12:35, 16 June 2022 (UTC)Reply[reply]

@mxn: --Loominade (talk) 12:37, 16 June 2022 (UTC)Reply[reply]

I'm not sure that it should be a "grammatical feature", but this is a written form used for the word in English. ArthurPSmith (talk) 13:27, 16 June 2022 (UTC)Reply[reply]
It is certainly not a "grammatical feature" in any common sense of this term. It is rather an orthographic variant of certain forms. I would instead create a corresponding property for United States Postal Service abbreviation (Q30619513), and put the abbreviation as part of a statement. If the abbreviation is the same for both singular and plural forms I would create a corresponding statement on the lexeme itself, otherwise one statement per form. AGutman-WMF (talk) 14:13, 17 June 2022 (UTC)Reply[reply]
Ah, that's a reasonable solution. What do you think about abbreviation (Q102786) which is used as a grammatical feature similarly in some cases? For example for mademoiselle (L11884). ArthurPSmith (talk) 15:46, 17 June 2022 (UTC)Reply[reply]
@ArthurPSmith I think here too it would be better to have a statement with the abbreviation rather than using abbreviation (Q102786) as a grammatical feature. As far as I understand the data model, the "forms" of the lexeme should really only be grammatically inflected forms (also in accordance with Lexical Masks), and not any orthographic (or other) variation of it. AGutman-WMF (talk) 10:14, 20 June 2022 (UTC)Reply[reply]
@AGutman-WMF: Restricting forms to inflection is currently impossible in some languages anyways... Minh Nguyễn 💬 08:28, 21 June 2022 (UTC)Reply[reply]
Here is an example: Pkw (L678783)Loominade (talk) 10:28, 21 June 2022 (UTC)Reply[reply]
@Loominade, mxn, ArthurPSmith, AGutman-WMF: shouldn't we use variety of form (P7481) for this? and for me, an abbreviation absolutely should'nt have its own separate lexeme, it's has no meaning in itself, it's just a variation of the form (just like "center" and "centre" or "organisation" and "organization"). Cdlt, VIGNERON (talk) 11:05, 19 June 2022 (UTC)Reply[reply]
@VIGNERON: my observation is that lexeme databases have abbreviations as lexemes. why not? --Loominade (talk) 08:40, 20 June 2022 (UTC)Reply[reply]
@VIGNERON As far as I can see, using variety of form (P7481) would still entail having a separate form for the abbreviations, which is, IMHO, not desirable. As I wrote above, my understanding is that forms should only reflect grammatically inflected forms, and not other type of variation. So I would still advocate to link the abbreviation (using a statement) to the form it abbreviates.
As for the question whether an abbreviation should have its own lexeme entry: I think it would be good here to distinguish two kinds of abbreviations. Most abbreviations are only orthographic short-hand notations of the full lexeme. In these cases I wouldn't list them separately, since they have no independent spoken existence. However, sometimes an abbreviation becomes a word of its own, and it that case it can be listed as a separate lexeme. (This is very common in Hebrew, e.g. תנ"ך (L68688)). AGutman-WMF (talk) 10:32, 20 June 2022 (UTC)Reply[reply]
@AGutman-WMF: « forms should only reflect grammatically inflected forms » ? Form is used for any forms, grammatical or otherwise (orthographical, dialectical, etc.). Abbreviations are a bit specific and maybe desserve separate lexemes but forms will always contain flexions that are not purely grammatical (ie. if a tool whant to select a form, it can not solely look at the grammatical features). Cdlt, VIGNERON (talk) 19:10, 23 June 2022 (UTC)Reply[reply]
@VIGNERON what you suggest goes, as far as I understand, counter the lexicographical data-model. The forms are clearly defined (and distinguished) in terms of their grammatical features, which are therefore just listed after the form. If you consider Lexical Masks, which are intended to be used as validators of lexical data, you'll see that they define for each part-of-speech a fixed number of forms, all distinguished by purely grammatical features.
An abbreviation which has no independent spoken-form, but is merely read out as the full word it abbreviates, is generally speaking not a lexeme by its own, and not even a form of that lexeme. It is simply an orthographic representation of a form (or the entire lexeme), and it should be linked to that, IMHO, either through a statement, or possibly as a "variant spelling" (e.g. using the code en-x-Q102786). Just as you wouldn't consider the orthography 10 to be a distinct form of the lexeme ten (L338), there is no reason to consider min a form of minute (L2500).
As for dialectal variation, here the situation is different, because a dialectal form is primarily a spoken form, so it is not merely a spelling variant. It may in fact exhibit different inflection patterns or grammatical features (e.g. Swiss German nouns sometimes have a different gender than the corresponding High German noun). So for these I think the ideal solution would be to represent them as a different lexemes with a language code corresponding to the relevant dialect (e.g. gsw-x-Q248682 for for Zurich German (Q248682)). If the differences are only in pronunciation/spelling, then it may be simpler to list the dialectal forms as spelling variants of the standard form (with the appropriate language code), but I think this would need discussion for each dialect.
(This discussion goes much farther than the original scope... Maybe we should start a distinct thread to discuss this. I will also discuss these questions in my upcoming presentation in the Wikidata Quality Days.) Cheers, AGutman-WMF (talk) 07:28, 24 June 2022 (UTC)Reply[reply]
@Loominade @VIGNERON: I don't have a strong opinion on how to model these abbreviations, but the USPS abbreviations should be modeled consistently somehow so that they're easy to query for. Maybe a monolingual text statement on a form? There was some previous discussion about a better approach in Wikidata talk:Lexicographical data/Archive/2020/10#Abbreviations. Minh Nguyễn 💬 08:27, 21 June 2022 (UTC)Reply[reply]

Splitting of L1131[edit]

Hi y'all,

Looking at the example above, do we all agree that key (L1131) should be split, as (at least) L1131-S5 belong to a different lexeme? @Mxn: WDYT? VIGNERON (talk) 11:08, 19 June 2022 (UTC)Reply[reply]

Yes it looks like it - wikt:key has a separate etymology for that sense. ArthurPSmith (talk) 14:12, 20 June 2022 (UTC)Reply[reply]
I was not aware we were supposed to split lexeme by ethymology. As a first thought it seem like a potential very big potential headache as we don’t always really know the etymology for each sense ? And splitting items is not an easy task as you have to move the references to the sense/forms as well as they have no own identifiers … author  TomT0m / talk page 16:36, 20 June 2022 (UTC)Reply[reply]
Yeah, unfortunately gadget Move doesn't support moving senses and forms yet (and my plea was left without answer...). --Infovarius (talk) 20:01, 20 June 2022 (UTC)Reply[reply]
@TomT0m: There's no need to split if etymology is not known, but if you have two distinct ones then how do you represent that in our current lexeme model? The etymology is at the top level of the lexeme, but if you have several and each applies only to some senses then that gets complicated. Better to split I think. ArthurPSmith (talk) 13:52, 23 June 2022 (UTC)Reply[reply]
@ArthurPSmith:, thanks I'll create a new item for the split soon (S5 and F3-F4).
@TomT0m: you're absolutely not « supposed to split lexeme by ethymology », etymology is just a clue that this is two lexemes here and not one (counter-example: in some rare case, one lexeme can have two etymologies for instance, in such case we obviously don't split). But yes, this could be a headache (but lexicographically - identifying two lexemes with the same lemma, lexical category and lang is not always easy - and technically - see Infovarius idea -, that said they do have identifiers L1131-S5, L1131-F3 and L1131-F4 here).
@Infovarius: good idea, I supported your request.
Cheers, VIGNERON (talk) 21:24, 22 June 2022 (UTC)Reply[reply]
@VIGNERON: how about a possibility to create (almost) duplicate Lexeme (when most of the content is similar)? Please discuss here (or here). --Infovarius (talk) 07:30, 23 June 2022 (UTC)Reply[reply]


@ArthurPSmith, TomT0m, Infovarius: It's done (painfully manually), the second lexeme is key (L684194). I also re-created the United States Postal Service abbreviation (Q30619513) it's not optimal but untill we have a solution, it's better than nothing and at least the information is not lost. Cheers, VIGNERON (talk) 09:42, 14 July 2022 (UTC)Reply[reply]

Cleaning of Latin lexemes[edit]

Hi,

I talked about it a few times but we really need to clean the lexemes in Latin (Latin (Q397), Ordia), there is a lot of things to do but here is a start with some easy suggestion :

To give a more clear view, I did it by hand on bellum (L260469) : before and after.

What do you all think? Is it okay, do you see anything more to clean or things to clean differently?

PS : after this 'easy' part (in the sens it can be automated), there will a lot more work to do, like check and correcting all the strange thing from Whitaker (eg. belle as L:L260469#F4 and F5?, where the regular form is bella). VIGNERON (talk) 12:19, 19 June 2022 (UTC)Reply[reply]

 Support Looks good to me. ArthurPSmith (talk) 14:15, 20 June 2022 (UTC)Reply[reply]
 Support, I basically imported the data close to the source (Whitaker), I accept that some of the data is not needed or not accurate in the context of Wikidata. Uziel302 (talk) 19:20, 13 July 2022 (UTC)Reply[reply]

You can now reuse Wikidata Lexemes on all wikis[edit]

Hello all,

In 2018, the Wikidata development team enabled Lexemes, Forms and Senses on Wikidata, allowing everyone across the Wikimedia projects to gather structured data about words and languages. Lexicographical data has been growing, thanks to the effort of the community who added, up to this point, more than 661K Lexemes in 846 different languages on Wikidata (13 languages having more than 1500 Lexemes - see statistics here.

In order to make this data usable and useful, one missing feature was the ability to access and use the data from Wikidata’s Lexicographical Data on the other Wikimedia projects via Lua modules. This feature has been requested for a long time by editors, and after a test phase on a few Wiktionaries, we are happy to announce that Lua access to Wikidata Lexeme will be enabled on June 21st on all Wikimedia projects.

Practically, Lua access means that we created some new Lua functions that will allow you to integrate Lexemes, Forms and Senses from Wikidata on any of the pages of any Wikimedia wiki. Among many possibilities that this feature offers, you will be able to create for example: conjugation or declination tables, stubs of Wiktionary entries, tools displaying the meaning of a word on Wikisource, and many other things, depending on what your project needs. Until someone on your project writes a Lua module that makes use of these new functions and then uses this module on a page, nothing changes for your project.

In order to use it, people with experience with Lua modules and templates can look at the documentation listing the available functions. You can also have a look at simple example showing the singular and plural forms of an English noun: the template, the module, the result.

Following the deployment of this feature, we are confident that several editors will start creating their own modules for Wikisource or Wiktionary - we invite you to share your experiments on this talk page, so other people can discover what you have been doing and get inspired.

If you’re involved on a Wiktionary or Wikisource, feel free to share this announcement around and to try the feature with your community.

If you have any questions or feedback, or if you want to discuss with other editors, feel free to use this talk page or the related Telegram group. To report a technical issue, feel free to use this Phabricator ticket.

Thanks for your attention and have fun with Lexemes! Lea Lacroix (WMDE) (talk) 12:19, 20 June 2022 (UTC)Reply[reply]

I'm delighted to learn that the French Wiktionary will finally have access to Lexemes. There's a lot of investigation and experimentation to be made to find out what could be the best integration, but I might have some usecases of this feature for Lorrain (Q671198).
Do you have any examples of what the Wiktionaries in the test phase managed to do with that? Poslovitch (talk) 16:03, 20 June 2022 (UTC)Reply[reply]
@Poslovitch: I'm happy to hear some interest and I'd love to follow what you manage to do with Lorrain :)
As far as I know, only @Mahir256: implemented modules on Bengali Wiktionary, I found this link but I guess the best explaination of what it does will come from Mahir :) We're also preparing a livestream in the next few weeks to show usecases and create a template live, I'll keep you updated here. Lea Lacroix (WMDE) (talk) 16:33, 20 June 2022 (UTC)Reply[reply]
@Lea Lacroix (WMDE) Thank you for your answer. Keep me posted about the livestream. The main issue I can find though, is that unlike Wikipedias, we have no "implicit item" for Wiktionaries - Infoboxes on Wikipedias do not need us to specify a Wikidata item in them. For Wiktionaries, we'll have to do that everywhere. i.e. in templates for wikt:fr:ba#Lorrain, I'll have to put {{MyTemplate|L678668}}... This is definitely going to hinder re-usage of Lexemes on Wiktionaries (let aside the... grudges that may subside between both projects). So... Any workarounds for this? Poslovitch (talk) 07:58, 21 June 2022 (UTC)Reply[reply]
@Poslovitch: Nothing I can think of, unfortunately. Since there no 1:1 connection between a Lexeme and a Wiktionary page, you will have to enter the Lexeme ID as a parameter. Lea Lacroix (WMDE) (talk) 12:40, 21 June 2022 (UTC)Reply[reply]
Reposted to the Vietnamese Wiktionary with some local context. Minh Nguyễn 💬 01:04, 22 June 2022 (UTC)Reply[reply]
Update: I also created Wikidata:Wiktionary#Lua_access_to_Lexemes with a bit of documentation about the feature. Improvements and more examples are welcome on this page. Lea Lacroix (WMDE) (talk) 09:30, 22 June 2022 (UTC)Reply[reply]

Forms in Vietnamese[edit]

Lexemes in Vietnamese use forms very differently than described in Wikidata:Lexicographical data/Documentation#Data Model. Vietnamese has no grammatical feature in need of multiple forms per lexeme, but a typical lexeme has at least two forms, one for Vietnamese alphabet (Q622712) and the other for chữ Nôm (Q875344). Many lexemes have multiple forms because of different orthographic styles in Vietnamese alphabet (Q622712) or because multiple Nôm character (Q15100640) correspond to a single Vietnamese alphabet (Q622712) word. pronunciation audio (P443) statements are always duplicated among all the forms. Some examples:

In other languages, such as English, these purely orthographic variants would be modeled as representations of a single form. However, Wikibase only allows one representation per locale code per form. This is impossible in Vietnamese, because there's a many-to-many relationship between Vietnamese alphabet (Q622712) words and Nôm character (Q15100640), both by design and because chữ Nôm (Q875344) was never standardized. A single author may use 吧 and 𡝕 interchangeably for (L619034). Moreover, a Han character in this lexeme (P5425) statement only makes sense in the context of a particular representation; it makes no sense when paired with a different character or an alphabetic word.

Hopefully this atypical usage of forms won't cause too many problems for software consuming lexemes. Should we mention the possibility of non-grammatical forms in Wikidata:Lexicographical data/Documentation#Data, to raise awareness among editors and developers?

 – Minh Nguyễn 💬 08:21, 21 June 2022 (UTC)Reply[reply]

Thanks for bringing this up!
In principle, as far as I understand, spelling variants should be part of the base lexeme in this case. As you mentioned, there is a problem that you cannot currently add multiple variants of the same code. I would suggest championing for removing this restriction (or alleviating it, see my suggestion on the Phabricator bug rather than working around it. AGutman-WMF (talk) 09:26, 21 June 2022 (UTC)Reply[reply]
It isn’t ideal, but the workaround is at least entrenched enough with alternative form (P8530) that maybe it deserves a mention in the documentation. Until the issue is resolved, we shouldn’t simply exclude entire languages from Wikidata lexemes over this issue. I’m sure Wiktionary will find a use for lexemes despite this suboptimal modeling. Fortunately, forms aren’t being used for any other purpose in Vietnamese, so there’s not as much ambiguity. Minh Nguyễn 💬 16:27, 21 June 2022 (UTC)Reply[reply]
As far as I understand, the alternative form (P8530) property only works well when you have pairs of alternative spellings (and indeed, you don't use it for the Vietnamese lexemes, as far as I can see), but in fact, if there is only a single alternative form, one could simply use the language code vi-x-Q59342809 using alternative form (Q59342809), or another Q-qualifier, instead of adding non-inflectional forms. I think this suits better that data-model. AGutman-WMF (talk) 09:26, 22 June 2022 (UTC)Reply[reply]
(By way of an update, I've restructured the Vietnamese lexemes so that they don't have multiple forms. Instead, each chữ Nôm form is in a separate lexeme linked by translation (P5972).) – Minh Nguyễn 💬 20:20, 28 June 2022 (UTC)Reply[reply]

Work in progress : wiktionary interwikis[edit]

Hi all, I started a gadget to add links to wiktkionaries wikipage : User:TomT0m/LexToWiktionary.js

It’s still a work in progress but it is already useful I think, so I announce it here to gather input. It adds a button « wiktionary » you can click on to get a lists of wikipages of wiktionaries with relevant titles (the lemmas of the lexeme) First it only shows a few wiktionaries, those who corresponds to your user language, your babel language and the wiktionary of the lang of the lexeme, but there is a link to load every wiktionaries.

Things to do :

  • Better view,
    • in some skins the popup does not work well atm. (minerva) (the button is hidden in the « plus » menu.
    • in vector2022 the button is hidden when the window is too narrow
  • add a feature to query all forms and not only the main lexemes (asked by @VIGNERON)

Maybe add a searchbar if you are looking for a specific language if the list is long ?

Please tell me if there is any blocker for you to use. author  TomT0m / talk page 17:22, 21 June 2022 (UTC)Reply[reply]

Salut TomT0m, merci pour ce gadget. Pourrais-tu préciser comment on l'installe et ensuite comment on s'en sert avec un exemple si possible. Car pour le moment, j'ai ajouté une ligne dans mon common.js mais je n'ai rien remarqué. Par ailleurs, pas sûr d'avoir bien compris à quoi il sert, donc probablement qu'avec un exemple, ça m'aidera à mieux comprendre l'utilité du gadget. Merci Pamputt (talk) 21:52, 21 June 2022 (UTC)Reply[reply]
Salut @Pamputt, tu utilises quelle skin ? Le gadget rajoute un bouton qui devrait être bien visible, aux endroit des interwikis habituels, approximativement, pour chaque skin. Pour minerva c’est planqué dans un menu.
J’ai la même ligne dans mon common.js donc ça devrai en principe marcher.
J’en ai pas trop fait pour la doc pour le moment parce que c’est juste une phase de test et que ça devrait en principe être assez autoexplicatif. Il y a un bouton, il faut cliquer dessus si ça marche. author  TomT0m / talk page 05:58, 22 June 2022 (UTC)Reply[reply]
Et sinon pour la fonctionnalité c’est de passer facilement de chat (L511) à en:wikt:Chat ou fr:wikt:Chat et ainsi de suite. author  TomT0m / talk page 06:00, 22 June 2022 (UTC)Reply[reply]
Ah ok, ça marche, « importScript » ne faisait pas le job dans le common.js Pamputt (talk) 11:18, 22 June 2022 (UTC)Reply[reply]
C’est curieux ça fonctionne chez moi … mais en me renseignant sur ImportScript je suis tombé sur cette discussion, il vaudrait mieux utiliser l’autre forme pour des raisons de chargement dynamique. Parfois le script pourrait être chargé avant que la page ne le soit … et ça fonctionnerait pas, je vais modifier la ligne de chargement de mon script pour prendre ça en compte. author  TomT0m / talk page 15:11, 22 June 2022 (UTC)Reply[reply]

Japanese words[edit]

In japanese and probably other languages, some kanjis can have several way to say it. For example for « 四 » in the sense of « 4 » it can be said « yon » or « shi », we have two lexemes for the two cases : 四/よん (L625228) or 四/し (L641752). They have the same sense and usually are just one line on dictonnaries as far as I can tell. Is it a good idea to have 2  lexeme or should we find another solution ?

It’s impossible to put the two variants in the same lexeme as lemmas because they share the same language code, ja-hira and you can have only one. Maybe there is over language code for variants ? I see in Chinese languages there can be up to 5 in Wikidata now (母語/bó-gí/bó-gí (L305218) for example). It can be put as form, of course, but is it the way to do ? author  TomT0m / talk page 19:58, 21 June 2022 (UTC)Reply[reply]

@TomT0m: This sounds similar to the problems we're facing in Vietnamese. Maybe the same workaround can be used for Japanese for now. Minh Nguyễn 💬 01:02, 22 June 2022 (UTC)Reply[reply]
@TomT0m: That problem is common in Japanese. 私/わたし (L676) is written "わたくし" as well as "わたし" in Hiragana. I did not know how best to resolve this. Afaz (talk) 02:06, 22 June 2022 (UTC)Reply[reply]
This example is a sound change. The original form is "わたくし(watakushi)," and "わたし(watashi)" is a truncated form. However, "watashi" is often used in modern Japanese, and "watakushi" has the added nuance of being polite. In 2008, the Ministry of Education established that there are two kun-yomi of "私": "わたくし" and "わたし". Afaz (talk) 22:46, 22 June 2022 (UTC)Reply[reply]
@TomT0m: I'm not entirely sure, it's a weird case here (at first, I thought it was a onyomi vs. kunyomi problem but it's does not seem to be the case and anyway, I'm not sure that the readings are enough alone to make it two different lexemes). Loominade and Shisma could you tell us why you created two lexemes here? and more importantly why is there no indication to distinguish these two lexemes. Indeed, we really need to think more on how to deal with asian and asiatic languages. Cheers, VIGNERON (talk) 21:49, 22 June 2022 (UTC)Reply[reply]
I'm not an expert, but as far as I understand よん and し have different etymologies which should make them distinct lexemes. So if you ask me they are homograph lexemes that happen to have the same meaning. – Shisma (talk) 05:53, 23 June 2022 (UTC)Reply[reply]
here is what I think is going on:

四輪車/よんりんしゃ (L678963) is derived from 四/よん (L625228) is derived from 🤷

四季/しき (L678968) is derived from 四/し (L641752) is derived from (L656234)

but please correct me – Shisma (talk) 07:09, 23 June 2022 (UTC)Reply[reply]

@Afaz, TomT0m: do you disagree? – Shisma (talk) 16:51, 26 June 2022 (UTC)Reply[reply]

@Shisma: There is no Japanese dictionary that lists them as homographs. However, it is correct to say that they have different etymologies. "よん" is from the Japanese native word "よ" and this is called kun-yomi. The word "し" is from the Chinese sound of the Chinese word "四", and this is called the on-yomi. The problem is that both words with multiple kun-yomi and words with multiple on-yomi exist. Afaz (talk) 18:07, 26 June 2022 (UTC)Reply[reply]
I think, this is precicely where we differ: I don't think of lexemes as having a, reading. For me a reading is a property that a lexeme doesn't have. The kanji has a reading: in this case (Q3594955) where the lexeme(s) are the subject of each reading: The actual word. The kanji is merly a representation of the word. For a japanese-only dictionary it might be sensable to dismiss this layer of abstraction: but if we want to use wikidata to map ethymology across languages, I guess we need it. But ultimately I'm not an expert, neither in japanese nor in linguistics. 🤷 @AGutman-WMF: do you have thoughts on this? Is there an expert in both? – Shisma (talk) 18:15, 27 June 2022 (UTC)Reply[reply]
@Shisma if we want to use wikidata to map ethymology across languages, I guess we need it I don’t think so. We need to reference senses as etymology, not lexeme, for this. A lexeme has typically several senses, not all of them match when they are derived in a new language. author  TomT0m / talk page 18:18, 27 June 2022 (UTC)Reply[reply]
assuming that each ethymology has exactly one sence? – Shisma (talk) 06:30, 28 June 2022 (UTC)Reply[reply]
@Shisma I’m not sure it’s worth trying to align our whole data model to such a constraint … In a lot of cases the etymology will be missing, unknown or incomplete. So this means the whole lexeme ids/senses/forms may be (duplicated for the forms, this leads to duplication of datas, maybe more than if we duplicate etymology in each sense) and so a potential big disruption for data users, even wiktionaries ? … each time we discover a new etymology, or even over etymology disputes … (how do we handle conflicting etymology datas ? One lexeme for each hypothesis ?)
Lexeme ids stability seems much more important to me than etymology data ease of use. They are an important part of a dictionary of course, and important for our datas, an opportunity to structure etymology … but it does not seem to me the most important use of the lexeme datas. So we better try to weight the problems splitting item may arise for the main reusers.
If you want to avoid duplications we may find other solutions, like etymology statements not in the senses but as the main lexime statements, but with qualifiers like « apply to sense » to make them correspond to senses. Or creating items for etymologies and reference those items in the senses. Or put etymology in one sense and add a property « has the same etymology as sense […] » to put in other senses. author  TomT0m / talk page 07:59, 28 June 2022 (UTC)Reply[reply]
often the origin of the word influences their grammatical gender (P5185) and paradigm class (P5911) which in terms also influences their forms. Thats why all these properties belong to the lexeme rather than the sense, because that's not what senses are for. lexemes where that look identical (in at least one representation) use homograph lexeme (P5402) and we already have ~15000 of them. About ~4000 even in the same lexical category, check out some french ones. Would you merge those too? – Shisma (talk) 10:00, 28 June 2022 (UTC)Reply[reply]
@Shisma For example for « tour » I would not merge those with the same grammatical gender (un tour vs. une tour). It’s an important features of french lexemes that can change the form when they are inflected.
For those with the same gender maybe ( tour : a tool to make circular objects, and tour, a full turn, I’m not sure why there is two lexemes)
I checked several examples of your query and it seems to be the case in most examples.
As far as I can see most don’t really have etymology informations yet, so how do you guarantee they won’t be further split in the future ? Or would some of them be merged because, after all they all derives from a close historical word sense ? author  TomT0m / talk page 10:36, 28 June 2022 (UTC)Reply[reply]
The « poisson » example seems more compelling to me poisson (L11978) poisson (L455419), in that it has two entries in an important lexicographical resource in french, cf. https://www.cnrtl.fr/definition/poisson/1 and https://www.cnrtl.fr/definition/poisson/0 .
But what could work for french, a well documented language for quite a long time, can work for over languages ? I’m not sure. And considering the goals of the lexicographical datas, my conviction is that that should be a point to take into account ? author  TomT0m / talk page 11:27, 28 June 2022 (UTC)Reply[reply]
let me give you an example of a language I actually speak 😅. On first sight Ausdruck (L296956) and Ausdruck (L296957) look like they are the same. One is apperently derived from a translation of french expression (L12883) and it means among others expression, language style or swear word. The other is a nominalisation of the verb ausdrucken (L680080) and it means printout or process of printing. Both have the same grammatical gender but all plural forms are different: Ausdrucke/Ausdrücke. That's why dictionaries list them as different lexemes: Duden 1 & Duden 2; dwds 1 & dwds 2. These words are different in essence. So are よん and し. Its just even less obvious because they don't have multiple forms (as far as i know). If hiragana wouldn't exist and if we wouldn't know how to pronounce these words, we would have to consider them to be a single lexeme – Shisma (talk) 15:35, 28 June 2022 (UTC)Reply[reply]
This question goes to the core of what we understand as "Lexeme". In my opinion, when we are dealing with living (oral) languages, the basic units of language are the spoken forms and not their written representation. So, if we have two completely different spoken forms, as in this case, they should be represented as two different (though synonymous) lexemes. The fact that the two words can be written using the same Kanji character (and indeed, the fact that this is due to the history of how Kanji characters were introduced and read) should not confuse us. So I am in favor of keeping the current representation as two distinct lexemes. Of course, to enrich the data you can link the two lexemes as synonyms and also add statements clarifying the etymology of each lexeme (derived as On'yomi or Kun'yomi), using mode of derivation (P5886). AGutman-WMF (talk) 11:31, 28 June 2022 (UTC)Reply[reply]
@AGutman-WMF « Lexeme » is used in some linguistic school as a semantic unit. If you are talking of the « form » here am I correct that you take a more « lemmatic » (a lemma is in Wikidata a chosen representation of a lexeme) definition of lexeme ?
Comparing with the « etymological » definition we discussed, this is a completely different viewpoint am I correct ? For example if we adopt the convention of the cnrtl poisson (L11978) poisson (L455419) are said exactly the same, but have an unrelated meaning.
Note, the current documentation, see the last table in the section Wikidata:Lexicographical_data/Documentation#Data_Model suggest all theses should be decided language by language. Does not seem to exhibit criteria at this point ? author  TomT0m / talk page 11:53, 28 June 2022 (UTC)Reply[reply]
Definitions of the term "Lexeme" may differ between linguistic schools of thought, but in general I think it is agreed that the Lexeme is a sub-case of the Saussurean Linguistic Sign, i.e. a linking between a sound pattern (the signifier) and a concept (the signified). Thus, if we change either of these two, the pronunciation or the meaning, we get a different lexeme. In the case of Poisson we have clearly two different meanings, so it warrants two lexemes, while in the Japanese case discussed here we have two different pronunciations, thus two lexemes. The difficult cases are when two pronunciations or senses are are quite close to each other, in which case they may be seen as variation within the same lexeme.
The etymology itself does not play a crucial role: one lexeme may be (in rare cases) issued from a merger of two etymologies, while one etymological item may lead to the emergence of two different lexemes. On the other hand, differing grammatical features (e.g. gender, part-of-speech) may be a reason to favor multiple lexemes, but these are typically also coupled to a difference in meaning.
As for the table you mentioned, I think it is a bit misguided, since it allows entering differing pronunciations as different forms, which goes, in my opinion, against the idea that the lexeme forms should represent the inflection paradigm of the lexeme. AGutman-WMF (talk) 14:05, 28 June 2022 (UTC)Reply[reply]
@AGutman-WMF: Some months ago, I had tried this approach for Vietnamese, but the use of synonym (P5973) proved problematic, as it became impossible to distinguish between senses of two lexemes that were semantically synonymous from those that were differentiated only by the transcription method. It would not have been possible to tease out instances of the latter from the larger group of synonyms on the basis of pronunciation alone, as written dialectal variations and spoken dialectal variations were often only partially coincident. In my second attempt at your suggested approach the other day, I wound up using translation (P5972) instead, expanding the meaning of that property to include transcriptions. However, I remain interested in introducing more nuanced properties specific to transcriptions. Minh Nguyễn 💬 04:38, 29 June 2022 (UTC)Reply[reply]
I do not agree with the idea of keeping 四/よん (L625228) and 四/し (L641752) as separate lexemes. They are the same lexicon, even if pronounced differently. Afaz (talk) 05:12, 29 June 2022 (UTC)Reply[reply]
I honestly want to understand, why you think so. I have an assumption based on what you said so far. Please tell me if I am right or wrong:

When you look at a japanese dictionary intended for real people (not nerds like us 😅), you see an entry along the lines of

四【し ;よん ; よ 】4番目の正の整数

You wouldn't expect multiple entries like:

四【し】 4番目の正の整数
四【よん】 4番目の正の整数
四【よ】 4番目の正の整数

Because thats obvious and redundant.
Does this summarise your problem? Or is it something else? – Shisma (talk) 15:04, 29 June 2022 (UTC)Reply[reply]
I have come to believe that the problem is that there is more than one lemma.
  • Stop using ja-Hira, ja-Kana, and ja-Hrkt.
    • →Use only one Lemma (ja).
  • Use name in kana (P1814) to describe the reading of the lexemes.
  • Describe hiragana and katakana word forms in Forms instead of Lemma.
Example
The Japanese dictionary called UniDic is divided into three levels: Lemma, Forms, and Orthopaedic. Since there are three forms of Japanese, Orthopaedic describes all the Kanji, Hiragana, and Katakana forms of each word form. If we want to realize this in Wikidata, it would be enough to describe all the orthographic characters in Forms. Afaz (talk) 17:33, 29 June 2022 (UTC)Reply[reply]
that's interesting. @AGutman-WMF: why is UniDic structured like this? And why isn't Wikidata? – Shisma (talk) 18:29, 29 June 2022 (UTC)Reply[reply]
I'm not very knowledgeable about the UniDic representation, but from what I can gather it is especially geared towards written Japanese Corpus linguistics. If the basic unit is understood to be the written form, than one may collapse different spoken lexemes to a single written lexeme, if they have the same orthography (here a Kanji character). However, if we take the basic units of language to be the spoken form, this makes less sense. AGutman-WMF (talk) 12:28, 30 June 2022 (UTC)Reply[reply]
@AGutman-WMF It’s never been formalized which kind of forms we are supposed to take as a master in Wikidata isn't it ? As far as I understand, the main applications will be textual in the foreseeable future.
Is it worth pondering if some representation will be easier to handle considering the main usage of datas we can envision ?
Interestingly as far as I can tell there is actually very little phonological informations on Wikidata.
My intuition is that as long as the informations are linked correctly it does not matter much if we conflate several « lexeme » on the same page ? But it seems important, for a structured project, that the structure is consistent. I don’t think we have much ideas of how consumers would use the datas to guide us, unfortunately … author  TomT0m / talk page 14:13, 30 June 2022 (UTC)Reply[reply]

However, if we take the basic units of language to be the spoken form, this makes less sense.

which is what wikibase is designed to do, and what all other languages in wikidata currently do. Is it? I assume it would be unwise to make an exception for japanese. @Afaz, TomT0m, AGutman-WMF: Can we agree on that? – Shisma (talk) 17:15, 30 June 2022 (UTC)Reply[reply]
@Shisma Not so sure about the practice adopted by each language on Wikidata, as it’s not really documented anywhere and there is more than 200 languages. The page Wikidata:Lexicographical data/Documentation even suggest there may be different ways of doing In some cases or languages, there may be multiple entities for related words, in others just one. The below table provides an overview how they may be linked: author  TomT0m / talk page 17:46, 30 June 2022 (UTC)Reply[reply]
on the other hand, there is only 42 languages (can’t invent) with more than 1000 lexeme and 359 with just 1 lexeme … (full list of lexeme by language) author  TomT0m / talk page 19:25, 30 June 2022 (UTC)Reply[reply]
@TomT0m, @Shisma
The main use case I'm aware of for Wikidata's lexicographical knowledge is for use in Abstract Wikipedia. Admittedly, this use-case is currently for generation of written language (though it may change in the future). Still, I would prefer organizing the lexemes according to spoken representations, since it is linguistically more sound. It is also easier to lump together related spoken lexemes (using some property) rather than splitting written lexemes into different spoken representations. Another possible use-case may be exporting lexemes from Wikidata to Wiktionary; the latter seems to represent both the two readings of the Kanji and the Kanji character itself as distinct entries (though they are of course interrelated). This also directs us into the direction of using distinct lexemes for each reading. AGutman-WMF (talk) 13:07, 1 July 2022 (UTC)Reply[reply]
Not all wiktionaries, for example the japanese one has one entry for both yon and shi : https://ja.wiktionary.org/wiki/%E5%9B%9B#%E5%90%8D%E8%A9%9E author  TomT0m / talk page 14:22, 1 July 2022 (UTC)Reply[reply]
@AGutman-WMF: In any case, I think the point of the story is that it’s hazardous assuming all languages will follow the same organisation at this point ?
Why : I’m not sure every contributor will be aware of the guidelines, the contributors communities may be too segmented by language, the guidelines are not very clear at this point and if they stay as comments in a talkpage in english it will for sure not be enough to ensure a strong coordination. Especially if the community starts to really grow.
Maybe we need at some point a more formal discussion like a RfC, involving as most as diversity in language in the writing process ? author  TomT0m / talk page 14:28, 1 July 2022 (UTC)Reply[reply]
Wiktionary's use case is not to be underestimated: there's a real need for structured representation of those wikis' contents, and a tantalizingly close solution in Wikidata's lexicographical data, the name of which suggests a focus on dictionary-making. The Wiktionary community's collective experience defining a wide variety of languages will be an asset to this project. Linguistic soundness is important, but so is some connection to longstanding convention. Minh Nguyễn 💬 02:22, 2 July 2022 (UTC)Reply[reply]
If there is a lesson to be learned from the wiktionnaries, I think it’s that several entry points seems useful. Jawikt and enwikt both have pages for wikt:en:四 wikt:ja:四 wikt:en:よん wikt:ja:よん wikt:en:し wikt:ja:し.
How do we get to the data ? From a searching point of view, I guess the interface to get in at least as important as the structure. There is not much work done atm. I think on the lexicographical data, this may co-evolve with the structuring of datas and help our decisions. author  TomT0m / talk page 09:02, 2 July 2022 (UTC)Reply[reply]
Since the Wiktionary creates a page for each word form, it makes no sense to map a page to a lexeme.
There are pages for "wikt:en:やっぱり" (yappari), "wikt:en:やっぱし" (yappashi), and "wikt:en:やっぱ" (yappa), but these are only variant forms of "wikt:en:やはり" (yahari). Afaz (talk) 16:04, 2 July 2022 (UTC)Reply[reply]
One thing that is interesting with current modelling, is that all the lemmas are shown for example when we use the {{L}} template, so in the Japanese case those who read kanas can immediately have an idea of how it is said, even if incomplete. This is some help as there is no chance to guess that from a kanjis alone. author  TomT0m / talk page 18:42, 29 June 2022 (UTC)Reply[reply]
@Afaz: how does UniDic handle lexemes that have no kanji representation? – Shisma (talk) 18:56, 29 June 2022 (UTC)Reply[reply]
UniDic's lemma are not limited to kanji. If only kana characters are available, they will be in kana characters. Afaz (talk) 14:32, 30 June 2022 (UTC)Reply[reply]
This is a web service to search UniDic. https://cradle.ninjal.ac.jp/. All the forms 4 and IV are also grouped together under the lemma "四". Afaz (talk) 14:50, 30 June 2022 (UTC)Reply[reply]
Maybe I misunderstand something: Yes, both have the lemma "四" but they seem to be two distinct entities: 四/よん and 四/し. Can you link to the entity that includes both readings? – Shisma (talk) 16:38, 4 July 2022 (UTC)Reply[reply]
@Shisma There is a discussion on whether words with different etymology deserves different lexeme above, so I’m not sure it’s a sufficient reason to have two lexeme. author  TomT0m / talk page 17:52, 27 June 2022 (UTC)Reply[reply]
Where is this discussion? – Shisma (talk) 18:16, 27 June 2022 (UTC)Reply[reply]
See #Splitting of L1131 above. author  TomT0m / talk page 18:19, 27 June 2022 (UTC)Reply[reply]


Multiple Lexical categories per lexeme?[edit]

In english, german and russian (and probably others) homograph lexemes can have different lexical categories: for instance there is

these are considered to be individual lexemes. They also come with their own set of forms:

  • sounds (plural of noun)
  • sounder (comperative of adjective)
  • sounded (simple past of verb)

This also occours in japanese. A good example might be:

(Please fix my translations in case they are off 😅) Only the prefix form actually usually uses the kanji representation 又 (I'm not sure what that means, please enlighten me 🙂). Now wikidata assumes that each lexeme should have only one lexical category and I guess that's alright 🤷. But it also collides with the assumption that all Japanese homographs should be treated as a single lexeme.

I'm sorry I had to create a subthread but I couldn't handle the indentations anymore 😭 – Shisma (talk) 16:18, 3 July 2022 (UTC)Reply[reply]

nominal phrase (Q29888377) for translations[edit]

German has a lot of words that are compound (Q245423). Often but not always, the English equivalent is simply multiple words. In an English-only dictionary, entries like (L678907) would look funny. when provided as a translation for Freibier (L678904), they would make sense. Do you think (L678907) is a valid lexeme? If not, what makes black hole (L2890) a lexeme? are both nominal phrases in the first place? maybe @AGutman-WMF: – Loominade (talk) 09:34, 23 June 2022 (UTC)Reply[reply]

Not sure on (L678907) - possibly as it represents a concept (a certain category of freedom) beyond the meaning represented simply by the two words in combination. ArthurPSmith (talk) 13:49, 23 June 2022 (UTC)Reply[reply]
In gneral I think it makes sense to represent in a dictionary (and thus in Wikidata) idioms, which are multi-word expressions whose meaning cannot be compositionally deduced from its sub-components. Some linguists use the term lexeme in a stricter sense (as a single word), but I think most definitions would accept non-compositional multi-word expressions as lexemes. This also applies conversely: while a compound is (orthographically) a single word, if its meaning is compositionally derived from its components, there is no need to list it in a dictionary, unless it has some special pragmatic flavor to it, in which case it is not strictly compositional. So insofar (L678907) and (L678907) simply mean bear which is free (of cost) I don't think they merit representation in Wikidata, but as @ArthurPSmith mentioned, these expressions do have a special pragmatic usage of representing a category of freedom (gratis), so with that sense they could be retained, but the sense should be added to their definitions, IMHO. AGutman-WMF (talk) 15:41, 23 June 2022 (UTC)Reply[reply]
and with definition, you mean gloss or sense properties? – Loominade (talk) 07:59, 24 June 2022 (UTC)Reply[reply]
following your advice I redirected the English free beer lexeme to the German word. – Loominade (talk) 09:52, 29 June 2022 (UTC)Reply[reply]

Japanese する (suru)-form of nouns[edit]

For the lexeme challenge I created some lexeme that are listed by enwiki as verbs associated to some nouns, example for division :  https://en.wiktionary.org/wiki/%E9%99%A4%E7%AE%97 . I however have a doubt about whether I  should have done that. It seems they rarely/not have a page on jawikt for example.

Is it just akin to transform a noun like « division » into « doing a division » in english for example and as such does not really deserves a lexeme page, except in some particular places ? Should they just be removed from enwiki ?

If they do not deserves a lexeme on Wikidata, do they deserves a form on the noun-lexeme page ? It would be weird to have a form that is not the same grammatical class than the main lexeme I think … author  TomT0m / talk page 16:44, 23 June 2022 (UTC)Reply[reply]

@TomT0m: I'd say those lexemes of the sort you question which exist in databases such as JMdictDB (and thus have at least one external identifier) or in dictionaries with definitions given in Japanese (such that they may be properly referenced via described by source (P1343) qualified with a page number) can stay. Mahir256 (talk) 17:36, 23 June 2022 (UTC)Reply[reply]
@Mahir256 That don’t really answer my question I guess. My understanding is that in other languages, french for example, we sometime can change the kind of word by adding a suffix.
For example we can pass from the adjective "possible" with the adverb "possiblement" by adding the suffix "ment". Or in english "true => truely" . In Japanese there seem to have a similar mechanism, "本当" can mean "true" and truely can be said "本当に", by adding a "ni" particle. Whereas in french or in english it seems clear that both forms have their pages, it does not seem to be the case in Japanese. My point is that some of these adverbs, like « 上手に » does not seem to have an entry in for example JMdictDB and by your rule should not be a lexeme, but seems to be used in real expressions. It would not occur in languages like french or english I guess … as far as I know a similar pair of words « habile / habilement » in french always both have entries in french dictionaries.
Of course it’s different in Japanese but in the dictionaries in Japanese seem to consider usually that the added particles does not account for a new lexeme are just forms.
The problem is, on Wikidata I think there is only one main grammatical category associated to a lexeme. As I said in that model it seem weird that a form has a different category than the main lexeme.
How can we account for these differences, which seems mainly cultural and not totally « linguistic » in Wikidata ? Is it really a problem to have a lexeme for them ? author  TomT0m / talk page 19:30, 23 June 2022 (UTC)Reply[reply]
we currently have
本当/ほんとう (L680773)
Lexical category: adjectival noun (Q1091269)
本当に/ほんとうに (L671856)
Lexical category: adverb (Q380057)
and
元気/げんき (L2454)
Lexical category: adjectival noun (Q1091269)
元気に/げんきに (L2454-F4)
Grammatical features: conjunctive form (Q2888577)
for some reason, this is even reflected in JMdictDB. My intuition is nouns and adjectives should always be separate lexemes. maybe because in my language they are easy to tell apart. But in japanese the line seems to be blurry 🤷. Maybe again @AGutman-WMF: can help us here? – Shisma (talk) 13:35, 9 July 2022 (UTC)Reply[reply]
In many languages there is no clear distinction between nouns and adjectives, in that adjectives can often be used as nouns, e.g. Hebrew חכם/חָכָם (L65269) and חכם/חָכָם (L210912) are in fact the same word, which can be used either as the adjective "wise" or as the noun "wise man". It seems that in Japanese there is a whole class of such words, in which the distinction is only marked by means of the syntactic particle -na, so it is reasonable to collapse the adjective and noun lexemes together. As for the adverbial forms, insofar they are completely productive and regular, there is no need to list them at all. However, if there is some idiosyncrasy, I would list them. AGutman-WMF (talk) 09:22, 11 July 2022 (UTC)Reply[reply]

Some senses have both « item for this sense » and « predicate for ». How are their values linked ?[edit]

I wanted to toy with that question so here is the query that lists them and find all predicates that links them : https://w.wiki/5Mhk

There is relatively few pairs of items that match this criteria on Wikidata yet.

Surprisingly we find that walk (Q25443024) and walking (Q6537379)  View with Reasonator View with SQID are currently unlinked on Wikidata.


Here is a list of the predicates used, thanks to listeriaBot :

This list is periodically updated by a bot. Manual changes to the list will be removed on the next update!

WDQS | PetScan | TABernacle | Find images Recent changes | Query: select (?pitem as ?item) ?property ?propertyLabel (count(?property) as ?count){ ?sens wdt:P5137 ?pratic ; wdt:P9970 ?action . ?lexeme ontolex:sense ?sens ; dct:language ?lang ; wikibase:lemma ?lemma . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } {?pratic ?pred ?action . } union {?action ?pred ?pratic } ?property wikibase:directClaim ?pred optional { ?property wdt:P1629 ?pitem . } } group by ?property ?pitem ?propertyLabel order by desc(?count)
?item ?property ?propertyLabel ?count
cause has cause
has effect
has immediate cause
has cause
has effect
has immediate cause
17
1
effect has effect has effect 17
product product or material produced product or material produced 4
eponym named after named after 1
memorial society named after named after 1
namesakes named after named after 1
research studies studies 1
has part of class has part(s) of the class has part(s) of the class 1
subclass of subclass of subclass of 1
facet of facet of facet of 1
academic discipline studied by studied by 1
End of automatically generated list.

author  TomT0m / talk page 17:36, 27 June 2022 (UTC)Reply[reply]

walk (Q25443024) and walking (Q6537379)  View with Reasonator View with SQID should probably be merged. We need people that speech Czech to sort that out. I my oppinion predicate for (P9970) should only be used on verbs and item for this sense (P5137) should only be used on nouns. — Finn Årup Nielsen (fnielsen) (talk) 20:04, 4 July 2022 (UTC)Reply[reply]
What about go (Q5574688)? --Infovarius (talk) 09:42, 6 July 2022 (UTC)Reply[reply]
go (Q5574688) is more like a « word/lexeme item », it’s an article about the word, the conjugation etymology and so on, and not really about its meaning. Just as a lexeme as forms for the conjugation, plural and statements for the etymology.
If there is a relationship between a lexeme and such a « word-item » it should be « item for this lexeme » as a top level statement. author  TomT0m / talk page 11:36, 6 July 2022 (UTC)Reply[reply]

Multifaceted language variants in representations[edit]

I've been using ad hoc language codes to represent language variants of Vietnamese, for which Wikibase lacks recognized codes. For example, in hủy bỏ/huỷ bỏ (L679211), vi-x-Q55856374 "bánh mỳ" is for Northern Vietnamese (Q55856374) and vi-x-Q11994045 "bánh-mì" is for syllabification (Q11994045), namely the dated practice of spelling compound words with hyphens instead of spaces. But I would also like to indicate that "bánh-mỳ" is the result of combining the two. There are other combinations too, such as final-vowel tone mark placement (Q112681980) with syllabification (Q11994045). Unfortunately, it doesn't seem possible to string together two QIDs in a language code, as phab:T236593#5610378 seemed to suggest was the original intent. How do folks currently work around this limitation? Create items for combinations of variants? Minh Nguyễn 💬 20:41, 28 June 2022 (UTC)Reply[reply]

@Mxn: I'm pretty sure I didn't understand the specific but wouldn't a general solution to use statements instead of trying to put everything on the language code. Also, wouldn't hyphenation (P5279) be useful here? Cheers, VIGNERON (talk) 09:48, 14 July 2022 (UTC)Reply[reply]

Looking for input re: a sense specific to a plural form[edit]

I've started adding lexemes related to street furniture in Punjabi, and have run into some senses which only apply to the plural form of a word with multiple senses. For example, Lexeme:L680584 can mean light source(s), generically. The plural form can be used to specifically describe traffic lights. What would make more sense?:

  • sense for traffic lights, with subject form linking to the plural
  • separate "plurale tantum" lexeme for with the traffic lights sense

I am leaning to the former, in which case I am also wondering if "plurale tantum" can be added as a statement to the specific sense rather than at the lexeme level.

Any thoughts are appreciated --Middle river exports (talk) 15:36, 3 July 2022 (UTC)Reply[reply]

I see similar situations in Russian. I am too lazy to create separate lexeme, so I just mark specific sense as "pluralia tantum". But probably separate lexeme approach is more clean. --Infovarius (talk) 18:59, 4 July 2022 (UTC)Reply[reply]
I'm in favour of the first proposition, separate lexemes will just create redundant data and a lot of problems. Cheers, VIGNERON (talk) 13:59, 9 July 2022 (UTC)Reply[reply]

Characterizing lexemes and forms[edit]

I am unsure how to best annotate about the use of syncope (Q1136950) in a derivation or an inflection/conjugations. syncope (Q1136950) can appear during the derivation from another word, e.g., hygiejnisk (L37305) is derived with a syncope (Q1136950). Currently I have added a has quality (P1552) => syncope (Q1136950) on the lexeme level. But it could perhaps also make use of mode of derivation (P5886). enkel (L36451) has inflections where there are syncope (Q1136950). Currently I use has quality (P1552) => syncope (Q1136950) on the form level. Is that an appropriate use? — Finn Årup Nielsen (fnielsen) (talk) 20:10, 4 July 2022 (UTC)Reply[reply]

I have a bit more examples of the problems with consonant reduction (Q112915481) (consonant disappearance) and consonant doubling (Q112915196), see, e.g., Danish lexemes and forms skæg (L235661) and datter (L36822). — Finn Årup Nielsen (fnielsen) (talk) 12:35, 5 July 2022 (UTC)Reply[reply]
I believe that these are (sub)properties of combines lexemes (P5238) so they should be not as main properties of lexeme but as qualifiers to P:P5238. --Infovarius (talk) 13:47, 6 July 2022 (UTC)Reply[reply]
Hmm, I don't exactly know this case but here some remarks that I hop useful.
The lexeme level L37305 has quality (P1552) syncope (Q1136950) feels wrong, the lexeme don't "has quality", only its formation. The form level L:L31494#F2 also feels wrong (for the same reason).
Infovarius tried L37305 combines lexemes (P5238) L37304 / uses (P2283) syncope (Q1136950) . I'm not entirely sure for uses (P2283) but it feels way better and at least in the right direction.
Cheers, VIGNERON (talk) 09:23, 14 July 2022 (UTC)Reply[reply]

spelling variants should be forms[edit]

I think woman fashion (L203929) should be a form of woman-fashion (L203930). Change my mind 😬 –Shisma (talk) 09:04, 9 July 2022 (UTC)Reply[reply]

@Shisma: I wont change your mind, I agree with you. If there is no objection (@SixTwoEight:), I'll merge them next week. Cheers, VIGNERON (talk) 17:55, 13 July 2022 (UTC)Reply[reply]
@VIGNERON: I likewise agree that they should be merged. My bot should've checked for alternate forms on Wiktionary before creating Lexemes. SixTwoEight (talk) 17:59, 13 July 2022 (UTC)Reply[reply]

New Lexeme creation page available for testing[edit]

Hi everyone,

The lexicographical data part of Wikidata is still in need of some love. Over the last few weeks we have worked on this in 2 areas. The first one was Lua access to Lexemes. We have rolled this out and all Wikimedia projects can now access not just Item data but also Lexemes. See the announcement for more details. The second one is coming today. We have reworked the Special:NewLexeme page. Lexicographical data is still hard to understand for people not familiar with lexicography. The new Special:NewLexeme page has a number of tweaks that we hope will make it more understandable and easier to use. This includes an information panel that gives a bit of context about what Lexemes are as well as a lexical category selector that ranks appropriate Items higher. (Better ranking of the Items in the language selector will come soon as well.) Additionally we have put the page on a better technical base.

The information panel on the new special page includes an example Lexeme, which uses live data of a real Lexeme on Wikidata. That Lexeme is selected by the wikibaselexeme-newlexeme-info-panel-example-lexeme-id interface message in the current user interface language. The idea is that you can override this message on Wikidata to select suitable example Lexemes for various languages (e.g., set the German version of the message to the ID of some suitable German example Lexeme).

Today we’d love for you to test the new page and give feedback.

Test it: Special:NewLexemeAlpha

This page will be there in parallel to the current version during this testing period. It creates proper Lexemes and you can use it for your regular Lexeme creation work. We currently plan to replace Special:NewLexeme with this new version on August 3rd. At that point we also plan to turn off the temporary Special:NewLexemeAlpha page.

If you have feedback or questions please let us know here. Additionally Lydia is looking for a few people for some short calls to get individual feedback from you. If you are up for that please let me know and we’ll schedule something.

Cheers, -Mohammed Sadat (WMDE) (talk) 13:43, 14 July 2022 (UTC)Reply[reply]

The "Lemma" and "Lexeme's language" fields should show an example in the user's interface language. I have no idea what "ama" means and whether it's a full word since the field says its the "base word" and idk if that means a full word or not as a beginner. Lectrician1 (talk) 14:09, 14 July 2022 (UTC)Reply[reply]
@Lectrician1: I agree with you but from what I understand this is exactly what is stated in this message above! Could someone change MediaWiki:Wikibaselexeme-newlexeme-info-panel-example-lexeme-id/br to L62 ? And for other languages, I guess the smallest Lid in the language is probably a good choice for starter (caveat: languages of the interface are not exactly the same as languages of the lexemes). Cheers, VIGNERON (talk) 16:22, 14 July 2022 (UTC)Reply[reply]
Yes exactly! That's what we meant with that in the announcement. For English we recommend L344. Lydia Pintscher (WMDE) (talk) 16:23, 14 July 2022 (UTC)Reply[reply]
I created edit requests for the English and German version of the message (en talk, de talk). @VIGNERON, I suggest you create an edit request for the Breton version of the message as well, so it shows up in the tracking category (though I’m not sure how many people watch that category, to be honest). Lucas Werkmeister (talk) 15:54, 18 July 2022 (UTC)Reply[reply]
Why not use translatewiki.net? Afaz (talk) 20:21, 18 July 2022 (UTC)Reply[reply]
I can't seem to find much difference between this and the current version. Also, the absence of the * in the fields makes it look like filling those fields are optional.
A gist about what lexemes are is also a plus. Musahfm (talk) 14:56, 14 July 2022 (UTC)Reply[reply]
@Musahfm: I see a lot of small but great improvements. Agreed for the mandatory field indication (both the * - that I never saw - and the red highlight when you leave a field empty). @Mohammed Sadat (WMDE): is this absent because of the test? if not, could it be add? And yes, a definition of what a lexeme is could be useful but so far we don't have one (that the downside of lexicographs, most of us already know what it is but can't really define it). Cdlt, VIGNERON (talk) 16:22, 14 July 2022 (UTC)Reply[reply]
@Musahfm, VIGNERON, Thanks for letting us know that the mandatory field indicators are useful for editors. I created a ticket so we can add it. Regarding a definition for Lexems, can the community come up with one? -Mohammed Sadat (WMDE) (talk) 08:03, 15 July 2022 (UTC)Reply[reply]
Why is it using a different font from the rest of the site?
The help link for spelling variants should not point to Help:Monolingual text languages. That's nothing to do with lexeme languages and won't help people understand what a spelling variant is.
When the spelling variant field is shown, it sometimes shows a warning saying "This Item has an unrecognized language code is. Please select one below."... and sometimes doesn't. (I know what's going on, because I know the internals of how it's implemented, but I don't expect it to make any sense to most people)
- Nikki (talk) 18:16, 14 July 2022 (UTC)Reply[reply]
I filed phab:T313166 for the font issue.
The link we can definitely change. Do we have good page to point it to?
For the spelling variant issue: Can you give me an example code please? Thanks! --Lydia Pintscher (WMDE) (talk) 17:32, 16 July 2022 (UTC)Reply[reply]
  1. I can't see one needed feature: check the lexeme for existance. It would be superuseful (especially, for newer editors) to know if there is already such lexeme.
  2. Could have "real" languages (e.g. with P31=Q34770 or P31/P279* =Q34770 if possible) be preferred in pop-down list? E.g. one need to enter at least 5 Cyrillic characters for selecting Q7737. --Infovarius (talk) 14:35, 15 July 2022 (UTC)Reply[reply]
 Support on the drop-down list, maybe it should filter on anything that has one of the ISO language code properties (P218, P219, etc.)? ArthurPSmith (talk) 16:20, 15 July 2022 (UTC)Reply[reply]
Yes languages will be prioritized soon. This should happen with one of the next deployment as Mohammed wrote in the initial announcement.
As for checking for existence: Noted. We have phab:T195469 for it but we were not able to include it yet. I hope we can get to it in the next iteration. --Lydia Pintscher (WMDE) (talk) 17:32, 16 July 2022 (UTC)Reply[reply]

Ping project for Lexicographical data[edit]

Hi everyone,

I'm notifying manually some of the most prolific user for Lexemes @Fnielsen, Jon Harald Søby, Mahir256, Bodhisattwa, Nikki: @KaMan, Jsamwrites, 白布飘扬, So9q, Hameryko: (according to this SQL query).

There is a very useful template called {{Ping project}}. Why don't we create one for the Lexicographical data (it's not exactly a project per se but close enough). Since this template is limited (on purpose) to 50 people, I'm not sure what is the best way to go: a list per language? and/or maybe a list for the most involved people (for modelling data themselves regardless of the language).

What do you think?

PS: there is also the {{User LexData}} template for those who wants to put it on their use page.

Cheers, VIGNERON (talk) 11:20, 14 August 2022 (UTC)Reply[reply]

Hello! Cool idea! As you say, several lists: one general for people who are interested in the whole lexicographical data project, and one per language. By the way, your SQL query is quite wrong. Here is a fixed one. Cheers, — Envlh (talk) 12:20, 14 August 2022 (UTC)Reply[reply]