Wikidata talk:Lexicographical data

From Wikidata
Jump to navigation Jump to search









Support for Wiktionary


How to help






Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2020/05.






a query

for deletions

for comment


for permissions


for deletion

and imports



for checkuser

Milestone - 200k lexemes[edit]

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).

Property suggestions for Lexeme/Form/Sense[edit]

On blank items, currently instance of (P31) and subclass of (P279) are suggested (see Help:Suggesters_and_selectors#Property_suggester about the feature). The idea is that the user would input a statement with one of these properties (not all).

I think it would be good to have some default suggestions for Lexemes/Form/Sense as well, e.g.

Are there other properties that could be included? --- Jura 11:01, 12 March 2020 (UTC)

For Form the main property is pronunciation (P7243) and two above as qualifiers. According to new model. --Infovarius (talk) 18:25, 12 March 2020 (UTC)
It seems to be in used mainly for Russian and had been added by bot, but why not. Latin could probably use it too. Only one of the properties needs to be chosen and one can always use others. I added it above and expanded others.--- Jura 21:54, 12 March 2020 (UTC)
If it's mainly added by bot and in Russian, maybe we should skip that. Also, maybe the property should have string datatype (but this is not really relevant to this discussion). --- Jura 16:37, 26 March 2020 (UTC)
Why only Russian? Or other languages have no variations in pronunciation of forms? --Infovarius (talk) 21:45, 27 March 2020 (UTC)
It was just an observation from current uses. --- Jura 13:58, 24 April 2020 (UTC)
For forms there could also be hyphenation (P5279). And language style (P6191) for both forms and senses. --Lexicolover (talk) 11:04, 13 March 2020 (UTC)
Sure. I added them. --- Jura 16:37, 26 March 2020 (UTC)

@Lea Lacroix (WMDE): is there a way to activate this? --- Jura 13:58, 24 April 2020 (UTC)

We would need to estimate how much work would be required for such a feature. I can't promise that it could happen any time soon, as our roadmap is already packed. Lea Lacroix (WMDE) (talk) 14:58, 27 April 2020 (UTC)

Common misspellings data[edit]

Hi. I recently found and really liked that they provide suggestions for correct spelling. I'm wondering if it would be a fruitful endevour to collect common misspellings under our lexeme forms? This would enable us to easily make something like that. If yes, then I suggest we team up ask LibreOffice for help with collecting the data we need. WDYT?--So9q (talk) 10:50, 19 March 2020 (UTC)

I personally do not think misspellings should be listed as forms. Perhaps they could be listed with regular statements using a special property. ArthurPSmith (talk) 17:34, 19 March 2020 (UTC)
I do think listing misspellings as forms is a great idea, especially if it's common misspellings. Also, the line between misspellings and variants can be blurry, especially for poorly documented or mostly oral languages. And even for well documented languages, it still useful, for example in French "occurence" (with only one "r" instead of "occurrence" with two "r") is (sadly) very common, almost 20% in the 80s according to Google Ngrams. Cheers, VIGNERON (talk) 20:05, 19 March 2020 (UTC)
Hmm, I guess that's a good point, and we do allow archaic forms... English also has some very common misspellings - for example Nobody can spell “fuchsia”. ArthurPSmith (talk) 17:06, 20 March 2020 (UTC)
Interesting statistic @VIGNERON:. Here is the equivalent for fuchsia which is actually only half as bad as your example.
I added a misspelled form to fuchsia (L3280) with a form with has quality (P1552) misspelling (Q1984758) / of (P642) fuchsia (Q5005364) but is it better to create a lexeme and add the form there? Also what is best, to refer to fuchsia (Q5005364) or the correct form invalid ID (L3280-F1) (right now the latter is impossible it seems as form support is missing in the input field). WDYT?--So9q (talk) 09:51, 21 March 2020 (UTC)
Oh, I actually meant what Arthur suggested: list with regular statements using a special property: e.g. common misspellings misspelling 1 misspelling 2. But having them as forms with a property indicating a misspelling might be a better way to model it. Obviously this also begs for annotating non-misspellings with at least a reference to an authorative source e.g. the danish correct spelling dictionary [1]. @fnielsen: What do you think about misspellings and modeling them?--So9q (talk) 12:54, 20 March 2020 (UTC)
I'm not 100% sure about the best way to model it; fuchsia (L3280) looks good though. The ideal would be a way to do all sorts of annotation, not only "misspelling" but also "variant", archaic form", and so on. Cdlt, VIGNERON (talk) 10:08, 21 March 2020 (UTC)
I agree, I created a new property proposal for misspellings.--So9q (talk) 21:29, 22 March 2020 (UTC)
Misspelling lexeme forms design mockup
I just added two misspellings to but seeing there are already 8 forms and imagining 8 forms for each wrong spelling we end up with a mess with a total of 24 forms on one lexeme. It might be better to create a misspelling-lexeme like [2], creating of these would really be helped by e.g. updating Lexeme forms to help map form the misspelled form to the correct one side by side/drag drop or intelligently by analyzing grammatic features. The tool could also visually enable adding a one new correct lexeme and one or multiple misspelling-lexemes side-by-side with propagation to minimize typing, see the mockup to the right. WDYT @Lucas Werkmeister:?--So9q (talk) 21:41, 22 March 2020 (UTC)
In Wikipedia there are already a lot of misspellings documented: swedish, english. If we could find a way to import these we have a real good start IMO, they might be covered by copyright though.--So9q (talk) 22:28, 22 March 2020 (UTC)
Previously it's proposed that misspellings are stored as individual lexemes: see Wikidata:Property proposal/correct form. Alternatively, it may also be possible that misspellings are stored as statements.--GZWDer (talk) 00:49, 23 March 2020 (UTC)

I thought about it and I think the best approach is to have misspellings as statements for forms:

Correct Form (Lxxxx-Fx) --> Common misspelling (Pxxxx) :  misspelled word

IMO mispellings are neither lexemes nor forms. There is one thing I am not sure of - what actually common misspelling is? How do we define it? We all probably have some thought about what it is and that thought is probably different for each of us. There would be more strict opinions and more loose opinions. Would it differ language by language? Should it always be attested by lingustic science? It is probably not good thing to have future arguments about number of Google search hits. --Lexicolover (talk) 20:58, 23 March 2020 (UTC)

This looks good to me. I changed my proposal to reflect this. --So9q (talk) 22:19, 23 March 2020 (UTC)
I agree we should define "common misspelling". Here is my suggestions: the misspelling has to appear in one of the following sources:
  1. an authoritative source such as e.g. Retskrivningsordbogen (Q3398246)
  2. appears in one of wikipedias list of misspellings, e.g. [3]
  3. appears in WD with p31 misspelling (Q1984758) e.g. Rzehakinacea (Q33188867)
  4. appears in a corpus approved explicitly by this community with an occurrence over a certain threshold. (yet to be created, decided)  – The preceding unsigned comment was added by So9q (talk • contribs).
  • I think the forms should be included as forms, not in some random string property. Accordingly, I prefer the initial version of the proposal. Wikidata is generally descriptive, not prescriptive. --- Jura 16:31, 26 March 2020 (UTC)
    @Jura1: In my understanding misspellings are not forms. Forms have some function in the language, misspellings don't. There might be mispellings that match other forms of the word (different form, archaic form, colloquial form etc.) but there is not form that has "quality" of being misspelling - such thing has no use for language.--Lexicolover (talk) 21:59, 27 March 2020 (UTC)
    • So you prefer the normative view? Some languages have a language board with a fairly authoritative view of its role or in some countries, there is an official list of first names to use. --- Jura 12:33, 1 April 2020 (UTC)
@So9q: Thank you for your effort. 1) is okay for me; 2) and 3) are IMO not good - it's basicaly selfreferencing. The link at 2) IMO combines misspellings and mistypings (there's a bit of difference for me); the example at 3) is not even misspelling it's incorrect name. 4) might be good but I am afraid most of corpuses would not be annotated to decide what word is real misspelling.--Lexicolover (talk) 21:59, 27 March 2020 (UTC)

item for this sense (P5137) for non-nouns[edit]

Hello, I know I am probably not the first to ask but still ... could someone explain to me (as simply as possible) how the senses for adjectives, adverbs and verbs are modelled using item for this sense (P5137), please? As it is now and what I understand from property talk page, it is not about senses of words but more about "it has something to do with X". I don't see what benefit it has like this. For example I see no way how to take synonyms or translations properly from this (it works quite well for nouns). --Lexicolover (talk) 21:42, 23 March 2020 (UTC)

@Lexicolover: I'm not sure there is a wide consensus but indeed Property talk:P5137 is probably the best place to sum up the current statu quo. The benefit here is: one same property for the same thing on all senses of all lexemes (as lexical category doesn't not impact the meaning) and because there is currently no other property (and creating of properties for each lexical category seems like a very bad idea to me).
« it works quite well for nouns » does it? A noun in one language is often not translated by a noun in an other language. To take a simple example: « J'ai faimnoun » in French (word to word « I have hunger ») is « I am hungryadj. » in English (notice also how the verb "to have" is replaced by "to be"). I wouldn't use Lexemes for translation, not alone at least. That said, if you really want it, it's trivial to filter by lexical category.
Cheers, VIGNERON (talk) 09:14, 26 March 2020 (UTC)
@VIGNERON: Thank you for your reply but it does not answer my question how the senses should be modelled. What is that property for if not for expressing sense of the word? Really just to say "it has something to do with X"? (I am not asking for special property for each lexical category at all.) And yes it works for nouns quite well (not perfectly but quite well) because nouns by definition express concepts. On the other hand, for example, adjectives modify other concepts - just linking adjective to the concept does not express what that adjective actually mean (in some cases it might do, in other cases it doesn't). Two adjectives linking to the same concept might be synonyms or antonyms or they might have some completely different relation between them, thus filtering by lexical category solves nothing. Your example is good, it explains why one can't translate word by word between different languages, but it is more of syntactic or phraseological issue not lexical. At this point I would not use this property for translation either (because of above mentioned issues) but it somehow kills the whole good idea of indirect synonyms and translations. --Lexicolover (talk) 16:05, 26 March 2020 (UTC)
@Lexicolover: yes, it is to « for expressing sense of the word », emphasis on word: meaning any words and not just nouns.
« Two adjectives linking to the same concept might be synonyms or antonyms », it shouldn't. Do you have any example? Antonymic concept should have distinct items (this is why there is opposite of (P461)). "peaceful" should link to peace (Q454) and "warring" to war (Q198) for instance. Am I missing something?
The statu quo may not be perfect but I think this is the best we have right now. Obviously, if you have other ideas, I'll be glad to hear them and talk about it.
Cheers, VIGNERON (talk) 16:50, 26 March 2020 (UTC)
@VIGNERON: Okay, some examples of what I have in mind:
  1. heavy × light
    Polar opposites. Both are sure to fall under concept of mass (Q11423). For the sake of it we could create (sub)concepts od heaviness and lightness (or do we have them already?)
  2. beautiful × nice × so-so × ugly
    Concept of beauty (Q7242), different points on the scale. Again we can create (sub)concepts but it starts to feel little streched. We don't know how many points that scale have, different languages might differ in understanding where those points lie (thus we are getting little messy here). Do we really expect that everyone is able to find correct Q-item or create it?
  3. kamenný@cs × kamenitý@cs
    First means 1) made of stone (as in 'kamenná socha' ~ 'stone statue'); 2) which evokes stone (as in 'kamenná tvář' ~ 'stone face'); the second means 1) which contains lot of stones (as in 'kamenitá cesta' ~ ?'stony road'). Different senses, not interchangeable, both under stone (Q22731). I can't imagine having Q-item kamenitost (the word itself sounds weird), even if we can have such Q-items we are getting to the point of having special Q-items to deal with subtleties of a language which might or might not have counterpart in any other language. And we are also getting to the point where we have special items just for the sake of one adjective (how should those items look like?) which is probably not the direction we want (because we want to use concepts not words, because concepts are broader).
  4. válečný@cs (as 'in válečný zločin' ~ 'war crime') × válčící@cs (verbal adjective; as in 'válčící státy' ~ 'warring states')
    Analogy to the previous example. Not interchangeble and even more strange to create special Q-item.
I came to the solution how to deal with cases like actor × actress (eg. námořnice (L290307)) which is not consensual but it works for me until we have something better but it does not seem applicable for adjectives or verbs (not to mention we should have something generally accepted and working). --Lexicolover (talk) 18:25, 26 March 2020 (UTC)
@Lexicolover: your examples are a bit strange, is this true example? "heavy" or "light" should *not* link to mass (Q11423) (except when "heavy" is used in the sens of "having mass, massive, weighty" but that's not the point here), which is obviously too general or at least not alone. Same for 2), use the possibility of Wikidata to create subitems, use qualifiers, multiples values and so on. The point 3) and 4) is more interesting but again seems to be the same as the previous one. Indeed if an item would be used by one and one lexeme, it's a bad idea (both for items and lexemes), but whatever precision you would put in the item could go in the lexeme.
Yes, a qualifier like in námořnice (L290307) (and the 15 others lexemes see query including fyr (L33928) in Danish) is what should be done! (if "námořnice" is indeed just for female and not just for feminine ; unlike in French). I don't see how it's « not consensual » (that's done on a big number of lexemes with others qualifiers), to me it seems that this is what we do on Wikidata for more than 7 years now ;) And it shouldn't be limited only for adjective by the way, it would be useful for all words, including nouns.
Cheers, VIGNERON (talk) 09:11, 27 March 2020 (UTC)
@VIGNERON: Thank you for all your time but I am back to my original question - How should I do it? I know I can use qualifiers or multiple values, but I don't know desired syntax, I don't know how should I put it together so it would be understandable, usable and useful. If I use multiple values is the relation between them logical AND or logical OR (I personally think of it as logical OR and qualifiers as logical AND, but I don't know if that is correct, I have seen people proposing otherwise)? How far can I go with subitems? Do we have properties I could use as qualifiers to deal with above mentioned examples (it is not always easy to find properties matching something one have in mind)? Is there any reference material (internal or external) of what we want to achieve and best practices of how to achieve it? This whole time I don't want to change the property used, this whole time I want to know how should I do it so it will be useful. Right now I only have that mentioned property talk page and it has really simplified approach to the issue.
I don't know what you mean by your question whether it is true example. All of those words exist and I tried my best to describe my thought process of what issue I see with them. Where the "heavy" should link to then? (I could use an example of "long" and "short" that fall under concept of "length") Creating subitem like "heaviness" for item "mass" (or whatever better item) really feels to me the same as creating special occupation item for opposite gender (which I think is not desired). I am not veteran Wikidata editor who went throught dozens of discussions to be able to say this item is okay and this not or easily find reference material of how to do something. I have to ask, it really isn't meant to offense anyone.
Using "gender" qualifier is unconsensual. When I asked about this way of doing it befere there were voices that prefered different approaches. I've just chosen this one. And since 15 out of 16 lexemes in that query were created by me it shows it is not generally accepted (or that everyone else is working with languages where this is not an issue or they don't see it as an issue). I am happy to see someone say it is correct.
Modelling sense of the word is IMO not comparable to common statements. I could come up with some way myself but if it is not accepted by others it would be useless. Thank you for all of your time and effort with me. --Lexicolover (talk) 21:18, 27 March 2020 (UTC)

Gadget to link Wiktionary and Wikidata[edit]

Hi y'all,

For people who, like me, edit both Wiktionary and Wikidata Lexemes, here is a little gadget that add links to the lexeme(s) with the same lemma as the Wiktionary entry title of the page where you are : fr:wikt:Utilisateur:VIGNERON/LienLex.js (which is the gadget itself, you need to call it on your personal js page, like this fr:wikt:Utilisateur:VIGNERON/common.js). The links are added "Tools" pannel on the left of the screen.

It's a bit crude and slow (it's going through a SPARQL query, maybe the query could be improved or maybe there is a whole other way) but I think it could be useful so I'm sharing it here.

Every question, comment and remark is obviously welcome.

Cheers, VIGNERON (talk) 16:57, 26 March 2020 (UTC)

PS: special thanks to @Abbe98: (I stole the code from User:Abbe98/osm.js to begin) and @Darmo117: (for the JS debugging and improvements).

@VIGNERON: I get this message from the web console: The resource from “” was blocked due to MIME type (“text/html”) mismatch (X-Content-Type-Options: nosniff). --Vriullop (talk) 14:08, 27 March 2020 (UTC)
@Vriullop: thanks for the feedback but I'm not sure to understand: I've got no error in my console (not in Firefox nor in Chrome). I'll try to look into it. What web browser do you use? Could it be a conflict with some other gadget? My bad, I'm a bit dumb! Obviously, the URL should be "" and not just "/wiki/Utilisateur:VIGNERON/LienLex.js?action=raw&ctype=text/javascript" (which only works locally on the French Wiktionary as the script is on the same project). Cheers, VIGNERON (talk) 14:36, 27 March 2020 (UTC)
@VIGNERON: I tried with wikt:ca:Special:Permalink/1539621. I have removed all gadgets in my preferences and it doesn't work for me with Firefox nor Chrome. I get still the same message in the web console. --Vriullop (talk) 11:49, 28 March 2020 (UTC)
@Vriullop: and if you put explicitly the prefix https:// instead of just // does it work? Cheers, VIGNERON (talk) 12:44, 28 March 2020 (UTC)

User box for the Lexicographical data project ?[edit]


the title is the question Face-smile.svg is there already a user box ? if not, could we create one ? --Hsarrazin (talk) 12:55, 1 April 2020 (UTC)

I don't think there is one yet. Feel free to create one :) Lea Lacroix (WMDE) (talk) 12:27, 3 April 2020 (UTC)
✓ Done {{User LexData}} (see also #Wikidata:WikiProject Lexicographical Data ?). Cheers, VIGNERON (talk) 09:28, 24 May 2020 (UTC)

How to model the two different comparisons of germanic languages?[edit]

Hi, I discussed briefly with Lucas W about how to model this and we did not reach a solution so now I would like your input. My example where both syntactic and morphological comparison applies was this:

1) happy happier happiest

2) happy, more happy, most happy

So "the most happy day of my life" == "the happiest day of my life"?

If this is correct then I believe we should have a total of 5 forms on happy (L1349) but today we only have the -ier -iest forms. I think that is an omission we should decide if we want to correct. WDYT?--So9q (talk) 17:05, 1 April 2020 (UTC)

@So9q: I don't know about other languages, but in English I would view the "more" and "most" variants as applicable to any adjective or adverb; "more" and "most" are lexemes in themselves that have those comparative meanings, and there's no need to add them as forms. Most English adjectives do not have "-er" and "-est" ending forms, but require "more" and "most" to create those comparatives. So I think it's most useful just to add those single-word forms where they do exist, and not add extra forms with "more" and "most". ArthurPSmith (talk) 19:37, 1 April 2020 (UTC)

Huge increase of number of Forms[edit]

Hello all,

We noticed a huge increase of the number of Forms these past few days (+100K) (see Grafana board). Is one of you working on an import project?

Thanks, Lea Lacroix (WMDE) (talk) 12:39, 3 April 2020 (UTC)

I'm guessing that at least some of that increase is due to @Uziel302: who is working on adding forms in Latin (and with Latin declension the numbers of forms go up very quickly!) Cdlt, VIGNERON (talk) 14:49, 3 April 2020 (UTC)
Lea Lacroix (WMDE), indeed I've uploaded around 1M forms with their lexicographical analysis based on Whitaker's WORDS. The graph is confusing, mixing all types on the same graph. Let me know if any issue was found in the import. Uziel302 (talk) 20:47, 3 April 2020 (UTC)
The Elhuyar Fundazioa bot has also been adding some more Basque lexemes lately. Mahir256 (talk) 20:52, 3 April 2020 (UTC)
User:Elhuyar Fundazioa is indeed uploading forms but since he uploads each form in a separate edit, it takes more than an hour for him to upload a thousand forms. It also affects load on the system and we reach high maxlag. I think he should group the forms and upload them all in one edit like I did in my last upload. Code is here. Uziel302 (talk) 21:04, 3 April 2020 (UTC)
I've finished uploading Latin forms now, about 1.1M were uploaded in total.Uziel302 (talk) 13:41, 6 April 2020 (UTC)

Adding Kurdish data[edit]


Sorry in advance, if it's not the best place to ask this question.

I am wondering if there is a way to import data automatically and merge them with the existing entries or create new ones. We have developed a few dictionaries for three dialects of the Kurdish languages, namely Hawrami, Sorani and Kurmanji. The datasets (available here) are originally created in Ontolex-Lemon, particularly using the Lexicog module.

Any help would be greatly appreciated.

--Sina.ahm (talk) 23:41, 5 April 2020 (UTC)

Hi Sina.ahm, this is the right place to ask help for this kind of question/request. The first point to solve, before taliking about the technical side, is the licence. On the Github repository, I see that your data are released under CC by-nc-sa. This licence is not compatible with Wikidata nor any Wikimedia project. Wikidata uses a CC-0 licence which means, in short, that is very close to the public domain. Other Wikimedia projects use mostly the CC by-sa licence, that is similar to the licence you have chosen but that authorises to reuse the data also for a commercial purpose. So, before going further, you should see to change the licence of your data. Changing to CC-0 will allow you to import your data on Wikidata. Moving to CC by-sa will allow you to import your lexicographical data on any Wiktionary project. Pamputt (talk) 06:44, 6 April 2020 (UTC)
Sina.ahm, Pamputt, this is true for copyrightable content, grammatical information is not copyrightable so you may upload it and leave the copyrightable data aside (definitions, encyclopedic information etc.). Uziel302 (talk) 13:46, 6 April 2020 (UTC)
Thanks very much, Pamputt and Uziel302. As the owner of the data, I will change the license to the appropriate one. Assuming that the problem with the license is solved, is there a bot or a tool to import the data automatically? Where should I start importing? --Sina.ahm (talk) 13:49, 6 April 2020 (UTC)
Sina.ahm, you can upload the data using the wikibase API, this is the code I used to upload Latin lexemes. You should do some testings and when your script is ready and has some valid example edits, you should request permission to run as a bot. Uziel302 (talk) 14:09, 6 April 2020 (UTC)

New LexData version 0.2[edit]

I just released the version 0.2 of LexData – my python library to create and edit Lexicographical data. The changes include:

  • Support for bot-passwords
  • Respecting an properly handling maxlag
  • claims can be added to exsisting Lexemes, Forms and Senses
  • Claims of arbitrary data types can be created
  • Updates via JSON are possible
  • Improved search
  • I added tests, so there hopefully should be considerably less bugs and regressions
  • Consistent logging
  • and more…

As before you can get it from Github or pypi and the documentation can be found here. There are some minor incompatible changes but I tried to keep it compatible with the 0.1 version branch and only deprecate the old apis. In most case there should be no changes necessary. Happy Hacking! -- MichaelSchoenitzer (talk) 22:39, 17 April 2020 (UTC)

Very nice! I hopw it will be put to good use in our tooling going forward.--So9q (talk) 22:52, 23 April 2020 (UTC)

Described by Wiktionaries[edit]


It seems complicated to use wikibase:sitelinks to a Wiktionary, so I suggest to use P1343 to indicated when a form is described in a Wiktionary and when a sense is described. I made a test to say that the sequence of letters chat is described in French Wiktionary, and the meaning associate with a pet is also described in French Wiktionary. Do you think it is acceptable? Another strategy may be to create a serie of property similar as P7829 (for Wiktionaries instead of Vikidia) to add links to the dedicated pages. Noé (talk) 07:52, 18 April 2020 (UTC)

@Noé: This seems reasonable to me, but it would be nice to also include a link to the specific Wiktionary page referred to, perhaps with a reference URL (P854) reference statement, or maybe there's a qualifier that would work? ArthurPSmith (talk) 15:16, 20 April 2020 (UTC)
I don't have any preference, but I'll be please to include also links to the wiktionaries pages and to the meanings (with anchor to a definition when available). If a Sense is also connected to a Qid with a Wikipedia page via P5137, we may be able to have one or more links to Wiktionary in the side menu on the Wikipedia page, next to the other projects. That may definitively prove Lexeme is answering to the initial project of Wikidata:Wiktionary, helping Wiktionaries Noé (talk) 15:34, 20 April 2020 (UTC)
I love the idea of linking Lexemes to Wiktionary (this was indeed one of the orignal goal after all), and vice-versa (that's exactly why I built the gadget mentioned here: #Gadget to link Wiktionary and Wikidata). This is definitely an idea worth looking deeper into it and doing the things right. So for me, it's a big YES, this is acceptable and even desirable!
In practice, I'm not sure what is the best way, it could be described by source (P1343) (alone or with qualifier, eg. reference URL (P854)), it could be a linking property, or anything else. I'm also wondering what is the best place, described by source (P1343) is often used on main Lexeme level (again alone or with qualifiers, for instance demonstrates form (P5830)) ; in the test, it on both sense and form. It make sense but not sure if it's really optimal.
If we put on form level and/or indicate the form as qualifier, then there is no need to explicitely store the link as the link is trivial to built (it could probably even be done by javacript inside the Wikidata interface itself) and even without this precision, most of the times, the main lemma can be used by default. In fact, it seems to me a more reliable and elegant way to do it than using reference URL (P854) and even more than a solution like English Vikidia ID (P7829). Caveat: it may depends both on the granularity wished (linking to the page itself or to an anchor in the page) and on the wiktionary targeted as each wiktionary has a different structure, but on the other hand, it allows the reuser to built as they wish Face-wink.svg.
Cdlt, VIGNERON (talk) 16:59, 20 April 2020 (UTC)
Hey, thanks to VIGNERON live contribution yesterday, I was able to write a query to have all French entries that have a connection to a Qid with associated with a Wikipedia page :
I am quite happy to see 500+ pages. I am wondering now if it could be a good sample to do a proof of concept for the idea exposed up here. So, it could be great if someone have an idea on how to check if those pages exists in French Wiktionary and then automatically add this statement to those entries. Then, when the last column of the query is fulfilled (now it is only filled for "chat", you can search for it in the results), it could be possible to develop a gadget to indicate in Wiktionary the relate Wikipedia page, and in Wikipedia to indicate the related French Wiktionary pages. It could be nice! Noé (talk) 08:45, 13 May 2020 (UTC)
@Noé: the gadget I spoke earlier #Gadget to link Wiktionary and Wikidata could be adapted for your purpose. You should contact @Darmo117: who helped me built this gadget for the tech part and obviously I can help for the SPARQL part. Cheers, VIGNERON (talk) 09:28, 13 May 2020 (UTC)

Mapping of Lexicographical data model[edit]

I was wondering if the data model itself can have statements that apply external mapping? I am doing most of the mapping as some of you know and I noticed that the Lexicographical data model could possibly be mapped to other ontologies where it made sense. For example, see where I recently made a small edit to this page where I added a equivalent to: note about skos:definition here to the Gloss explanation: "Enter a gloss (very short phrase defining the meaning)(equivalent to: skos:definition) — Gloss" but ideally this would be applied in the data model itself somehow? Thadguidry (talk) 17:05, 20 April 2020 (UTC)

@Thadguidry: good idea, I think this has at least partly already be done for the query service (where you indeed use skos:definition to get the gloss). Not sure exactly where it has been documented though, I have used mw:Extension:WikibaseLexeme/RDF mapping when I started to do SPARQL on Lexemes, not sure if there is more or a better documentation somewhere @Lea Lacroix (WMDE): ? Anyway, integration the main documentation page sounds like a really good idea. Cdlt, VIGNERON (talk) 17:33, 20 April 2020 (UTC)
Hey @Thadguidry:,
I don't fully understand what you mean. Do you want to improve the documentation? Or the descriptions on Lexemes themselves? Can you provide some examples so I can understand what the request is? Thanks! Lea Lacroix (WMDE) (talk) 07:08, 21 April 2020 (UTC)
Hi @Lea Lacroix (WMDE):!,
Sure, we already have a few properties that allow mapping both Properties and Entities via external ontologies such as P2888 exact match and P2236 external subproperty and others. Here's an example of the mapping being done inside Wikidata itself P2561 name So the idea is that we could be able to apply a few statements or assertions on the various fields (like Gloss) in the Wikibase Lexeme RDF mapping itself showing relations to external ontologies, as well as on individual Senses. For example: Gloss is already mapped, but hardcoded, so I think its best to expose this somehow to allow edits to be made as there are other ontologies (now as shown in LOV and future) that should be aligned (not just Lemon and SKOS). If the Wikibase Lexeme does allow writing this kind of information, then that is the documentation that I am missing. Thadguidry (talk) 12:29, 21 April 2020 (UTC)
UPDATE: It looks like it IS POSSIBLE to directly apply the mapping however!! (this wasn't working for me initially several months ago and don't know why). Take a look at the "exact match" I applied here for the concept/Sense of "first". So, I think the only thing that is left to do is documentation showing how to apply mapping to the Lex data model itself? How was Gloss linked to skos:definition in the RDF? And as a user, I would probably expect to see that information on Wikidata:Lexicographical_data/Documentation Thadguidry (talk) 12:29, 21 April 2020 (UTC)
Thanks a lot @Thadguidry:! Yes, I think Wikidata:Lexicographical_data/Documentation would be a good place to mention the mapping of the Lexeme data model to a few other models. Feel free to continue improving the page :) Lea Lacroix (WMDE) (talk) 16:01, 21 April 2020 (UTC)

Abrreviation as separate lexeme or not?[edit]

See kilomètre (L19811) (has "km" as form). --So9q (talk) 10:19, 28 April 2020 (UTC)

  • In some languages (pl?), they are separate lexemes to allow addition of the usual 100? forms present on any lexeme. --- Jura 10:36, 28 April 2020 (UTC)
I think this would strongly depend on language. When I was thinking how would I deal with abbreviations in Czech language I did not make any final conclusion. Generally I would follow some self-imposed rules with reasoned exceptions: 1) Abbreviations that act and are read as any other word (=acronyms?) should be separate lexemes; 2) abbreviations that are meant for writing only and are usually read unabbreviated (ex.; b.; etc.) should not be considered separate lexemes, IMO some kind of (new?) property would be best for them but listing them as forms might work as well; 3) Symbols should not be considered separate lexemes and some kind of (new?) property would be best for them (and I would not list them as forms). --Lexicolover (talk) 20:57, 1 May 2020 (UTC)

New Version of MachtSinn[edit]

Many of you might already know my tool "MachtSinn", that allows users to easily add senses to lexemes, generated from wikidata items. In the last weeks I have again improved it significantly. You might want to take another look at it.

Compared with the first version announced here some time ago, there are a lot of improvements:

  • You can add glosses not only in the lexemes language, but in arbitrary languages at the same time!
  • There are now Keyboard shortcuts
  • New address: – thanks to the Cloud Services team at WMF
  • More potential matches especially for languages with smaller Wikipedias
  • Less false-positives – especially for English
  • To have fewer false-positives it's possible to add separate optimized queries for each language (Currently only English).
  • Improved, responsive design – should work on mobile devices
  • English verbs are prefixed with to, German nouns are prefixed with Der/Die/Das. You can add prefixes for other languages.
  • You can edit the gloss, if it is not an appropriate description
  • Your browser language is used as default language
  • The grammatical gender and lexicographical category is displayed
  • The database is regularly updated
  • There is an help text and improved statistics
  • The amount of network traffic is minimized so that is should work well even with slow connection
  • Fixed lot's of bugs

Big thanks to @Incabell, @DDuplinszki, @Ainali, @so9q and @dzarmola for their help. And thanks to everyone using it to add senses – already 15'000 senses have been added. 35'000 potential matches are currently waiting for you. ;) The code is available on github – Pull Requests are welcome. Have fun and stay safe. -- MichaelSchoenitzer (talk) 23:26, 30 April 2020 (UTC)

CEFR language competence level for lexeme[edit]

Hi! For many languages, there are so called "word lists" published, which a person, whose knowledge competence corresponds to a particular level (say, A2), is expected to know. I think the community would benefit a lot, if those lists could be imported into WikiData Lexemes as well, so on the lexeme page we could see which CEFR (or any other scale) level will this word correspond to. But I could not find any suitable property. What do we think on this? Could a property be added, and what should the ontology be?  – The preceding unsigned comment was added by 62mkv (talk • contribs).

I have been wondering how best to mark that a word appears in the Goethe Institute's word lists for German. I think rather than a specific property for CEFR or level, I would use something more general, such as "on word list", with a link to an item like "Goethe Institute B1 word list" which would have information about who published it, when and where. That way we could link to non-CEFR word lists too. - Nikki (talk) 10:11, 5 May 2020 (UTC)
Sounds good to me.--So9q (talk) 16:15, 8 May 2020 (UTC)

How broad should the senses reach?[edit]

Hi, I stumbled upon this lexeme today It has senses not only covering the sense in the language of the lexeme but the similar concept in other countries. I imagine we could continue down this road and add senses for the similar concept in all countries that have it. But is that really what we want?--So9q (talk) 03:59, 9 May 2020 (UTC)

  • @So9q: I have been generally NOT adding country-specific senses even where Wikidata has items - military ranks are a very common case, for example lieutenant has both general items and specific items for "British Army and Royal Marines", "french military", "Canadian Armed Forces", "Royal Navy", "Starfleet" etc. I would advocate for only adding the general meanings as senses, the specific ones really don't add anything significant to the meaning. ArthurPSmith (talk) 14:38, 11 May 2020 (UTC)
    • Anyone else have an opinion on this? I agree with @ArthurPSmith:.--So9q (talk) 18:52, 13 May 2020 (UTC)
  • @So9q: When adding senses to Russian lexemes I run into the same problem (especially with military ranks too). I tend to do the same way as ArthurPSmith. --Infovarius (talk) 02:57, 15 May 2020 (UTC)
  • Sorry to break the consensus but I disagree a bit Face-wink.svg. For me, specific senses seems both useful and necessary in a lot of cases. Despite having the same name and etymology, some senses can cover very different reality (maybe it should even be separate Lexemes? probably not but I wonder...). For instance canton (L18778), in Switzerland a "canton" is a very big administrative unit (similar to a region or a state in the United States) while in France, a "canton" is a very small administrative unit (similar to a county or more often to a quarter of a city). That said, the structure in place on socken (L58286) is maybe not the best (and by the way, I'm notifying @Vesihiisi: who is the best to talk about this lexeme she created Face-wink.svg) and maybe we can find a better as it's indeed not truly separated senses but more "derivated" senses. Cheers, VIGNERON (talk) 14:42, 15 May 2020 (UTC)

Wikidata:WikiProject Lexicographical Data ?[edit]


Is there a team of lexicographers hidden somewhere? Did the people adding lexicographical data had gathered already around a place where initiatives and personal projects are discussed? And finally, is there is a logo for this team/group/bunch of people or for Lexicolovers in general? I was looking for a userbox icon saying the interested for Lexeme data but I haven't found it. Is it really too early to create a team spirit here? Noé (talk) 06:34, 16 May 2020 (UTC)

@Noé: the « team » is hidden here in plain sight ;)
Yes, again it's here. Welcome!
Not that I know of, but feel free to propose one!
Second time (at least, after #User box for the Lexicographical data project ?) it is asked, so I created one: {{User LexData}} (with the glyph of ama/𒂼 (L1) as the image in the meantime, can someone activate the translation balise?). Team spirit exist without symbol but symbol are indeed useful for team spirit.
Cheers, VIGNERON (talk) 10:05, 16 May 2020 (UTC)
So, if it's a WikiProject, I made a recat into the category for Wikiprojects. For the team, there is no list of participants, like for other projects, but it is maybe more a Wikipedian habits than a Wikidata one. Great for the userbox! I though, for the logo, of a L made of Wikidata lines, with the same colors, but it could be hard to read in a small size. The ama sign is pretty! Noé (talk) 10:29, 16 May 2020 (UTC)

Gender in French[edit]

@Lepticed7: Does Lexeme:L241 really have 2 different genders? I'd propose to separate this into 2 lexemes: "chien" and "chienne" with definite genders. --Infovarius (talk) 00:20, 17 May 2020 (UTC)

@Infovarius: Hi! The fact is that in French, two things characterize nouns: gender and number. For gender, we have both masculine and feminine, and (almost?) every noun in French are either masculine, feminine, or sometimes both (it can change depending on multiple factors. One word in this case is « chips »). We even have nouns that are masculine when singular, but feminine when plural (like « amour » (love)). For the classic pet animals (dog, cat) or for the farm animals (cow, pig, chicken, etc.), we have a version of the word to identify male animals (« chien », « chat », « canard », « verrat »), these words are masculine; and we have the feminine words to identify female animals (« chienne », « chatte », « canne », « truie »). Some are "just" inflections (is this the right word?) using suffixes, generally « -e », and some are not (like « verrat » and « truie »). It could be great to have more points of views on this topic, but because gender and number characterize a form, and not a lexeme, and words in French (nouns or adjectives) are presented by giving this pair, I think we should not present this information on the lexeme, but on the forms. And we should not separate these two pieces of information. Lepticed7 (talk) 02:04, 17 May 2020 (UTC)
flexion in French is inflection in English, désinence is verbal inflection. It sounds like a heavy metal band to me! Noé (talk) 08:55, 17 May 2020 (UTC)
My opinions is that "verrat"/"truie" are not different forms of 1 word but are different words (linked with some relation). We have the same in Russian: кабан/свинья. Nouns are not inflected by genders (like adjectives)! They have gender! --Infovarius (talk) 23:52, 18 May 2020 (UTC)
It’s okay for me this way. I modified in first place because the lexeme presented the masculine and feminine gender. But, if for nouns, we do separate lexemes for genders, I agree :D Lepticed7 (talk) 07:42, 19 May 2020 (UTC)
  • It's already appropriately separated: there is chienne (L29225) for the female. We just need to find a good way to link the two entities. --- Jura 09:01, 17 May 2020 (UTC)
Separating or not lexeme based on gender is an open question since the beggining of the Lexemes (and even before, as it was an example also during the test plateform), qv. Wikidata_talk:Lexicographical_data/Archive/2019/05#Lexemes_and_gender_of_noun for instance (where I list some cases).
I don't have a strong opinion but must say, I'm not really convinced but the "two lexemes based on gender" solution by default. "chien" and "chienne" is the same lexeme, same lexical category, same etymology, morphology, almost same everything, except for gender. @Infovarius: « Nouns are not inflected by genders » really ? cases like chien/chienne (in French) or perro/perra (in Spanish), Lehrer/Lehrerin (in German) are clearly inflections (and this is in fact the most common case, most nouns have forms depending on gender), this is also not what the sources seems to say on en:Grammatical gender or look for "gender inflection" on Google books which give many results. « They have gender! » yes, but this gender is not always unique or even existant, they can have 0, 1, 2, 3 or more gender ; having one gender is maybe the most common case but there is a lot of exceptions (especially if you consider diachronic or dialectal data).
In the end, the situation is *very* complicated, maybe two lexemes can help be more precise but it is also more complicated and raise many more new questions. Should we create a duplicate for each gender even when the gender is unmarked like ministre (L19816)? And what about suppletion (Q324982) (when the inflected forms are not related, like "verrat"/"truie" for gender but there is also the same phenomenon for number, like "ki"/"chas" ki (L69) in Breton).
Cheers, VIGNERON (talk) 11:09, 20 May 2020 (UTC)
True but gender behave more or less the same in most languages and no language has never had a "one lexeme has only one gender" iron law, there is always exeptions. At least for French, many words don't have only one gender so we should talk in depth about how to deal with that. To start, here is the query for the (currently) 43 Lexemes with both masculine and feminine gender.
I didn't notice that you create this table {{Single or multiple lexeme}}. It's seems interresting but I'm a bit confused, why is it in the data model? was it announce somewhere? where does it come from ? (you put "from talk archive" in the summary edit but it's very vague, which talk archive?) and how should it be read?
Cdlt, VIGNERON (talk) 18:12, 22 May 2020 (UTC)

Some help to use LexData[edit]

Hi, I'm trying to use LexData to create lexemes, but I got an message like: INFO:root:Maxlag hit, waiting for 5.0 seconds. If the lexeme already exists, everything is okay, but when the lexeme doesn’t exist, I’ve got this message. What am I supposed to do? Thanks! Lepticed7 (talk) 08:58, 18 May 2020 (UTC)

@Lepticed7: You are very likely not doing anything wrong; Wikidata disallows "bot" edits when the "maxlag" value is too high, due to too many backlogged edits that need to be processed. This happens quite often, I suggest you just wait a few minutes and try again (maybe several times). There are also grafana charts that can show you the current maxlag value so you know when it won't work, this one in particular. ArthurPSmith (talk) 17:39, 19 May 2020 (UTC)

Please double check this lexeme[edit]

Hi, please check låne (L300647) and tell me if it is wrong. I'm new at this. Iwan.Aucamp (talk) 00:29, 20 May 2020 (UTC)

@Jon Harald Søby: could you take a look? Cheers, VIGNERON (talk) 10:28, 20 May 2020 (UTC)
@Iwan.Aucamp: It looks alright to me, except I don't understand why there are grammatical gender (P5185) and requires grammatical feature (P5713) statements on the forms, that seems redundant to the grammatical features listed. Also the S2 is – AFAIK – only in the phrase "låne tid", not for the base form "låne". Jon Harald Søby (talk) 14:26, 20 May 2020 (UTC)
@Jon Harald Søby: Thanks for the feedback, I adjusted it and removed the sense. I'm not very familiar with best practices and just trying to get the feel for it so the input is much appreciated. Iwan.Aucamp (talk) 14:53, 20 May 2020 (UTC)

Model lexemes and language communities[edit]

Is there a place where we can define model lexemes for languages? Maybe a new property is needed for that similar to model item (P5869)? I think if we have some model lexemes for each language it will make it easier to manage.

Also is there some approach for community coordination around specific languages similar to wikiprojects?

Iwan.Aucamp (talk) 14:54, 20 May 2020 (UTC)

@Iwan.Aucamp: some people started pages for specific languages listed on Wikidata:Lexicographical data/Documentation/Languages. Some of these pages are not bad (I worked a lot on Breton : Wikidata:Lexicographical data/Documentation/Languages/br) but most are still stubs with only basic informations. Feel free to create one. Also, to all lexical lovers, it would be nice to have feedbacks to improve them and having a coherent structuration (not to be strongly enforce but suggestions most sections could be similar). Any comments are welcome on Wikidata talk:Lexicographical data/Documentation/Languages. Cheers, VIGNERON (talk) 15:56, 20 May 2020 (UTC)

Modeling etymologies[edit]

Hello! We have now a lot of lexemes in many languages, including Latin, and this helps us building etymologies. I have been talking about this with a friend of mine who works in this area and he has given me some advices on how to proceed, but I would need your opinion on how to model it.

The derived from (P5191) logic would be this:

But he proposes to use something like this:

1 This word is still in use in some Basque dialects

How could we model the cf. (Q1048501) items? Is that a property? Or do we have a model for that?

@VIGNERON, Uziel302: -Theklan (talk) 09:06, 22 May 2020 (UTC)

@Theklan: very interresting suggestion (etymology is indeed more than just a straight line).
For « eleiza1 », I would simply put it as a form.
Indeed a "confer" property could be useful, maybe we could simply use an existing property (is there a fitting one? not see also (P1659) only for property) but a new property may be cleaner. And we use it as qualifier of derived from (P5191) or as direct property?
Cdlt, VIGNERON (talk) 09:52, 22 May 2020 (UTC)
@VIGNERON: The property "confer" is problematic, because... should it be another lexeme, or could be any string? -Theklan (talk) 09:59, 22 May 2020 (UTC)
@Theklan: definitely not a string but it coud be either a lexeme or a form of a lexeme. I would be leaning toward the first a lexeme, in most case it's precise enough and then you can use the qualifier demonstrates form (P5830) (which means that in this case, "confers" must be a direct property and not a qualifier, since qualifiers cannot get qualifiers themselves), plus a lexeme can have (and often have) multiples homographic forms so again in this configuration, Lexeme is better.
Here a suggestion for the real and simple example of bara (L2283) :
Normal rank bara (L2284) Arbcom ru editing.svg edit
▼ 1 reference
stated in (P248) Q50915490
+ add reference

+ add value


For the name "confer" is probably a bit too general, maybe "cognate" would be a better name for this (but maybe it's too narrow and too pedantic, and not exactly what you mean…), maybe "related lexeme" (or as an alias?).
Cheers, VIGNERON (talk) 10:29, 22 May 2020 (UTC)
I'm not an etymologist, but I think that confer is a way to compare something. If I say that eleiza (L300882) ↔ *egleisa (old Occitan) I can add there as qualifier "confer" Old Gascon gléisa.
derived from
Normal rank *egleisa Arbcom ru editing.svg edit
confer gléisa (Old Gascon)
▼ 0 reference
+ add reference

+ add value

-Theklan (talk) 10:41, 22 May 2020 (UTC)

I have shown him this conversation and we proposes that bot cognate and confer should be created. In most of the cases both of them may be interchangeable (gléisa is cognate of eliza and bot are coming from *egleisa), but in some cases the confer property may demostrate how a word can change by analogy, and wouldn't be related to the word itself. You can show with confer a well attested vowel change that should be noted for a not well attested change in another word, as a process. Cognate would be more used in most of the cases, then. -Theklan (talk) 11:35, 22 May 2020 (UTC)

@VIGNERON: There is this proposal by Fnielsen (talkcontribslogs). Theklan (talk) 13:04, 22 May 2020 (UTC)
Here an example of that could be done: -Theklan (talk) 17:41, 22 May 2020 (UTC)
Hello ! Confer or derivated is not good IMHO. I do etymology on french Wiktionary since long time and if you want to link properly two lexeme with an etymology link, you need to explain four angles of analysis : morphology, historical phonetic, historical semantic and contextual analysis (like history of peoples who created the word). If you link a word with an etymon, you need to explain why, in the relation. And, pay attention that very lots of words can’t be simply linked to other (especially because of the etymological structures of the lexicon). Derivated is a specific linguistic term in etymology used only for certain cases. Lyokoï (talk) 11:35, 24 May 2020 (UTC)
@Theklan: if tools (like the Wikidata lexeme graph builder, SPARQL quary and many others) can already give you the cognates, why storing it explicitly? That said, a "confer" (or whatever name) as qualifier (your solution above is indeed beter than mine) can be useful to point to a specific cognate relevant for the etymology.
@Lyokoï: very interresting, could you tell us more. How would you model that? (with direct properties or qualifiers? I imagine the later ; and to compare to the current model where we already have two properties for etmology and morphology). And ideally, do you have any references about that?
Cdlt, VIGNERON (talk) 12:54, 24 May 2020 (UTC)
My 2 cents: Wikidata is a secondary database, I uploaded Latin here based on Whitaker's WORDS and I think any relation between words should be based on sources. If we have source claiming that one word is derived from another word, we should be able to link those forms, not just the lexemes. We can have multiple sources offering different etymologies. We shouldn't force any systematic relations between words, unless they are consensus on the academic sources. Let alone guessing etymologies based on forms and sound, it's very easy to get false etymology (Q17013103). Uziel302 (talk) 18:50, 24 May 2020 (UTC)
Some terms in this proposal are reconstructed words (Q55074511) and based on prior discussions, I feel it is still unclear if those are Lexeme or not and how to source them. Is it the right time to restart this conversation? Noé (talk) 10:13, 25 May 2020 (UTC)

I think Wikidata would greatly benefit if we had clear vision how it should look in the end. At this point we rely on derived from (P5191) and combines (P5238) (and it is not always done correctly). Confer property might give us some interesting views on etymologic data. But there are other issues we should deal with. For example my dictionary is full of entries saying something like "origin is unclear, maybe it has something with X, there are opinions it is unlikely" or "probably onomatopoetic origin". Sometimes we can't be sure what the most direct predecessor is (for example whether the relation between lexemes A, B, C is A → B → C or B ← A → C or A → C → B, ...). @Theklan: Do you think you could along with your friend come up with some best practices for common etymology issues on Wikidata? --Lexicolover (talk) 10:19, 26 May 2020 (UTC)
@Lexicolover: I don't think we can model something universal, and that's why we need to discuss a good practice. . In Basque language etimologies are very unclear, so only really clear etimologies are added. I would go with that, and that's why confer may give interesting information to the reader on how this etimology has been guessed. This example is a good example of what can be done: .
@Noé: I think we need to restart that conversation, if it was closed. -Theklan (talk) 11:05, 26 May 2020 (UTC)