Wikidata talk:Lexicographical data

From Wikidata
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.


On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2018/07.

Project
chat

Lexicographical
data

Administrators'
noticeboard

Development
team

Translators'
noticeboard

Request
a query

Requests
for deletions

Requests
for comment

Bot
requests

Requests
for permissions

Property
proposal

Properties
for deletion

Partnerships
and imports

Interwiki
conflicts

Bureaucrats'
noticeboard


New tool: graph builder[edit]

Yesterday evening I spent some time assembling the etymology of L129, and then I wanted to see the result graphically, so I hacked together a version of the Wikidata Graph Builder that works for lexemes: the Wikidata Lexeme Graph Builder. (It actually supports items and properties as well, but for those you might as well use the Wikidata Graph Builder, with its extra features and all.) The website is really rudimentary (I’ll add a title, instructions, etc. later), and it only supports forward searching, but it’s enough for simple graphs (start from one or more entities and follow statements of a certain property). --Lucas Werkmeister (talk) 10:23, 29 May 2018 (UTC)

@Lucas Werkmeister: what about inverse properties graph? (If I want to know all words derived from some Proto-European lexeme...) --Infovarius (talk) 14:35, 9 June 2018 (UTC)
@Infovarius: You can find those lexemes using Special:WhatLinksHere, but I don’t plan to integrate that into the tool. --Lucas Werkmeister (talk) 20:05, 10 June 2018 (UTC)
@Lucas Werkmeister: It's up to you, of course. But generally it seems more interesting to see a tree than a line (German nouns like L129 is an exception I suppose). And yes, I ask for an upgrade of the tool :) --Infovarius (talk) 09:55, 12 June 2018 (UTC)
@Infovarius: ✓ Done, turned out to be not so difficult to implement after all :) here are the words derived from *uber, for example: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L2087&predicate=P5191 --Lucas Werkmeister (talk) 17:54, 15 June 2018 (UTC)
Great! Thank you! --Infovarius (talk) 12:46, 16 June 2018 (UTC)
@Lucas Werkmeister: would it be possible to have several properties at the same time? For instance for having both derived from (P5191) and compound of (P5238) (since Aberystwyth (L4730) derived from (P5191) Aberystwyth (L4729) and Aberystwyth (L4729) compound of (P5238) aber (L4732)+ Ystwyth (L4731) would do a nice graph on https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L4730&predicate=P5191P5238 ). Cdlt, VIGNERON (talk) 07:06, 27 June 2018 (UTC)
@VIGNERON: sure, should be possible. I’ve filed #7, but I won’t have the time to implement this for a few days at least, unfortunately. --Lucas Werkmeister (talk) 20:58, 29 June 2018 (UTC)
@VIGNERON: Done, you just have to separate the property IDs with a comma (just like for the entity IDs), so the correct link for your example is https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L4730&predicates=P5191,P5238. --Lucas Werkmeister (talk) 13:02, 4 July 2018 (UTC)
@Lucas Werkmeister: wonderful, a small suggestion, could it have different colours and/or textures for different properties? Cdlt, VIGNERON (talk) 11:45, 5 July 2018 (UTC)
@VIGNERON: better now? --Lucas Werkmeister (talk) 15:30, 8 July 2018 (UTC)
@Lucas Werkmeister: yessss ! now I just have to add more lexemes to have a nice graph. Cdlt, VIGNERON (talk) 17:06, 8 July 2018 (UTC)
@VIGNERON: try this: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L5573&predicates=P5191%2CP5238 KaMan (talk) 17:16, 8 July 2018 (UTC)
Minor update: support for forms should be much better now – see e. g. https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L123&predicates=P5188. --Lucas Werkmeister (talk) 14:50, 4 July 2018 (UTC)

@Lucas Werkmeister: could it be possible to automatically scale graph so it fits screen? I've just created https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L6291&predicates=P5191 based on wiktionaries and cannot fit it in view area. KaMan (talk) 12:38, 10 July 2018 (UTC)

@KaMan: sorry, I’m not sure what the problem is – can you perhaps take a screenshot? (And does it help if you zoom out and then reload the page?) --Lucas Werkmeister (talk) 14:02, 10 July 2018 (UTC)
@Lucas Werkmeister: yes, zoom out and reload solved the problem, thanks. KaMan (talk) 07:48, 11 July 2018 (UTC)

New tool: Wikidata Lexeme Forms[edit]

I created another tool for lexicographical data: Wikidata Lexeme Forms. It shows you a form to create a new lexeme with a standard set of forms, e. g. the declensions of a German or Latin noun. It should be possible to support most languages and lexeme types – if the data model for your language is already clear, please let me know! (English verbs don’t seem to be ready yet, as discussed above, so I didn’t add those yet. For English nouns, the tool probably doesn’t make sense anyways, since English lost case forms in nouns.) --Lucas Werkmeister (talk) 11:38, 13 June 2018 (UTC)

Yes, it is the way it's should be done. But what to do with ~hundred Russian declinations? And I thought that German nouns has much more paradigms... --Infovarius (talk) 13:34, 13 June 2018 (UTC)
@Infovarius: well, different declinations don’t really matter to the tool, I think, since it asks you to enter all the different forms manually. The tool only needs enough information to form a grammatically correct sentence around the form, that’s why it’s split up by grammatical gender in German and Latin. (Other languages might need other criteria here.) Does that help with the situation in Russian as well? --Lucas Werkmeister (talk) 14:41, 13 June 2018 (UTC)
  • An interesting start. Looks like we lack a away to state the applicable declension in Wikidata and store (or retrieve) the relevant parts.
    --- Jura 13:46, 13 June 2018 (UTC)

@Lucas Werkmeister: I've tried it to create Lexeme:L2879. It works quite well. Just one question: why having 3 entries masculinum, femininum, neutrum. One entry and then a selector for the gender would be better to use, no? PS: among the Latin cases, the vocative case (Q185077) and locative case (Q202142) are missing (not the most used Latin cases but it would be nice to have them). Cdlt, VIGNERON (talk) 15:11, 13 June 2018 (UTC)

@VIGNERON: regarding three entries: to some extent, development so far has been governed by what is better to implement, not to use ;) but I’m not sure what this selector would look like. A dropdown selector on the page where you enter the forms directly would require a lot of work, I think (to dynamically update the page each time – currently, the tool doesn’t use JavaScript at all), and I’m not sure if it would make much sense: if you started entering some forms and then switch the gender, what should happen? But I could probably group the entries on the index page (one entry for “German/Latin noun” and then three sub-entries for the genders) – would that be an improvement?
regarding cases: hm, yeah, I’m not sure about that… should all lexemes have forms for those cases? If I understand correctly, vocative and locative don’t really apply to most nouns. --Lucas Werkmeister (talk) 15:56, 13 June 2018 (UTC)
@Lucas Werkmeister: Ok, I understand. An other and simpler solution is indeed to group, for example you could put the entry like this:
  • deutsches Substantiv [without link]
    • Maskulinum
    • Femininum
    • Neutrum
Locative is indeed rare and it could be skipped but (as far as I know) the vocative always exists (very often the same as the nominative but not always; identical or not, it should the vocative form should be stored in the lexeme).
Cdlt, VIGNERON (talk) 18:42, 13 June 2018 (UTC)
@VIGNERON: Okay, vocative added. (You can use the new “advanced” mode, where forms may be left empty, for words that really don’t have a vocative form.) --Lucas Werkmeister (talk) 23:00, 13 June 2018 (UTC)
@VIGNERON: I’ve now grouped the entries on the index page by their language codes, which is the simplest solution for now. Perhaps I’ll improve it later. --Lucas Werkmeister (talk) 13:41, 17 June 2018 (UTC)
@Lucas Werkmeister: Singular and Plural are capitalized in German. -- IvanP (talk) 21:02, 13 June 2018 (UTC)
@IvanP: WTF, indeed… I thought it must be used as an adjective in „Nominativ [Ss]ingular“ etc., because otherwise I don’t understand how that construction works (just two nouns next to each other?). Thanks! --Lucas Werkmeister (talk) 23:04, 13 June 2018 (UTC)

It would be nice to enhance the form, so that alternative forms (for a German noun typically genitiv sg. masculinum & neutrum with endings "-s" or "-es") can be entered simultaneously. Surely it can be added behindhand, though I don't expect too many tool-using additors to do it, considering the tool is intended (and expected) to make the job easier, not harder ;) --Shlomo (talk) 06:07, 14 June 2018 (UTC)

@Shlomo: Hm, I’m not sure what a good user interface for this would look like, to be honest :/ --Lucas Werkmeister (talk) 15:34, 14 June 2018 (UTC)
@Lucas Werkmeister: As for German, I suppose it would be enough to add another input field for Genitiv Singular and for Dativ Singular; something like:
Genitiv Singular
 Das Eigentum des [Hundes_____].
 Das Eigentum des [Hunds______].
Dativ Singular
 Das gehört dem   [Hund_______].
 Das gehört dem   [Hunde______].
In the long run, it would surely be better to find a more general solutuion. There are languages with more than two variant forms, languages where variant forms can appear in different cases, languages where variants are not 100 % equivalent, so that the hint sentence has to vary as well.--Shlomo (talk) 07:48, 18 June 2018 (UTC)
@Shlomo: I finally found a good user interface for this :) it’s really simple: you enter „Hund/Hunde“ in the input fields, and two forms will be created. (I’ve updated the placeholders to indicate this where it makes sense.) --Lucas Werkmeister (talk) 20:41, 21 July 2018 (UTC)
@Lucas Werkmeister: I've tested this new feature on aronia (L7313) and it works great, it made work very easy, thank You. KaMan (talk) 09:03, 22 July 2018 (UTC)

Update: the tool now supports an “advanced” mode (click the corresponding button next to the “submit” one) where you can leave out some forms (e. g. for words that only have singular forms) and also specify a lexeme ID so that the forms are added to that lexeme instead of a new lexeme being created. --Lucas Werkmeister (talk)

@Lucas Werkmeister: I suggest an option to create a Lexeme without stating a grammatical gender for pluralia tantum such as Großeltern (cf. Genus von Pluraliatantum). -- IvanP (talk) 18:50, 15 June 2018 (UTC)
@IvanP: good point, does the proposal at User:Lucas Werkmeister/Wikidata Lexeme Forms/German#New template: deutsches Substantiv (Pluraletantum, kein Genus) look okay to you? --Lucas Werkmeister (talk) 12:24, 17 June 2018 (UTC)
@Lucas Werkmeister: Yes. -- IvanP (talk) 12:30, 17 June 2018 (UTC)
@IvanP: ✓ Done, thanks for the suggestion! --Lucas Werkmeister (talk) 13:02, 17 June 2018 (UTC)

I’ve started sketching out a process to add more templates to the tool. The general page about the tool (for now) is at User:Lucas Werkmeister/Wikidata Lexeme Forms, and to add support for your language, you take everything in the “adding support for a new language” section and copy it into a new subpage of the page, and then you replace all the explanations and examples (the “definition” parts of the definition lists – : in Wikitext) with the appropriate values you fill in the input box there and follow the instructions. (And at some point in this process please ping me as well, of course!) See /English and /German for two examples (discussed below and above this message, respectively). --Lucas Werkmeister (talk) 12:24, 17 June 2018 (UTC)

Contrary to what I said above, I think it would make sense to add support for some English templates after all, so that people can at least look at the tool in a language they’re likely to understand :) I’ve sketched out a very basic model for English nouns on User:Lucas Werkmeister/Wikidata Lexeme Forms/English#New template: nouns, based on some noun lexemes I’ve seen (e. g. lemon (L1921)) – does that look okay to you? --Lucas Werkmeister (talk) 12:24, 17 June 2018 (UTC)

@Lucas Werkmeister: I've finished /Polish with first basic Polish template. Please let me know if there is anything to fix. Before next Polish template it would be valuable to add possibility to add claims to forms which is not possible now with current template. Only claims of lexems are allowed. KaMan (talk) 08:22, 20 June 2018 (UTC)
@KaMan: Thank you very much! Adding claims to forms makes a lot of sense, I’ve added that now – do you want to edit the Polish template or is it only necessary for future templates? --Lucas Werkmeister (talk) 13:18, 21 June 2018 (UTC)
@Lucas Werkmeister: for future templates only KaMan (talk) 13:25, 21 June 2018 (UTC)
@KaMan: alright, thanks! The template is live at toolforge:lexeme-forms/template/polish-noun/, can you try it out? --Lucas Werkmeister (talk) 13:48, 21 June 2018 (UTC)
@Lucas Werkmeister: I've tried it out and it works great both in easy and advanced mode. Thank You. I'll let you know when I prepare next Polish templates. KaMan (talk) 07:48, 22 June 2018 (UTC)
@KaMan: great, thank you! --Lucas Werkmeister (talk) 12:32, 22 June 2018 (UTC)
Update: I’ve added the template for English nouns, without possessive case for now (but do let me know if you have opinions on that). --Lucas Werkmeister (talk) 16:24, 12 July 2018 (UTC)

@Lucas Werkmeister:, I've created next Polish template. Please let me know whether it's fine with claims for forms. KaMan (talk) 14:33, 24 June 2018 (UTC)

@KaMan: thanks! I’d like to understand what this means, though: the first “noun” template can be used for most nouns (including non-personal masculine ones), is that correct? Should its label perhaps be updated to clarify, or will someone who speaks Polish understand it anyways (e. g. because personal nouns are fairly rare)? --Lucas Werkmeister (talk) 21:41, 24 June 2018 (UTC)
@Lucas Werkmeister:, You're right. I've clarified label for the first template that it's basic and simplest declension. KaMan (talk) 07:21, 25 June 2018 (UTC)
@KaMan: Thanks! Sorry that I’m so late to reply, and unfortunately I’m also going offline for a few days now – but I’ll update the tool next Tuesday or Wednesday. --Lucas Werkmeister (talk) 20:53, 29 June 2018 (UTC)
@KaMan: sorry, I need one more thing – the identifier for the template (the part that’s used in the URL). For the first one I guessed polish-noun, but now I’m not sure what the identifier for the second template should be, so I think it would be better if you provided it… (I’ve updated the first template on /Polish already.) --Lucas Werkmeister (talk) 21:04, 4 July 2018 (UTC)
@Lucas Werkmeister: identifier added. KaMan (talk) 07:24, 5 July 2018 (UTC)
@KaMan: thanks, template is live now. --Lucas Werkmeister (talk) 14:31, 8 July 2018 (UTC)

@Lucas Werkmeister: I have on Special:Preferences#mw-prefsection-watchlist enabled adding pages I create to my Special:Watchlist. But pages I created with your tool are not added to watched pages. Could it be fixed or this is effect of using external tool. KaMan (talk) 08:38, 4 July 2018 (UTC)

@KaMan: hm, that’s odd – I tried creating a lexeme via the wbeditentity API on Special:ApiSandbox and it ended up on my watchlist, but when created via the tool (which uses the same API) it didn’t. Not sure what could be the reason… perhaps OAuth? I’ll look into this, thanks for reporting! --Lucas Werkmeister (talk) 21:04, 4 July 2018 (UTC)

@Lucas Werkmeister: something strange happened with my new Polish template. On template page Form 2 and Form 10 both have genitive case (Q146233) in grammatical features but when noun is created there is DR Congo at the 2004 Summer Olympics (Q146223) instead. I fixed it by hand. KaMan (talk) 09:36, 9 July 2018 (UTC)

@KaMan: sorry, I typo’ed the item ID :( should be fixed now. --Lucas Werkmeister (talk) 14:18, 9 July 2018 (UTC)

@Lucas Werkmeister: I created third Polish template. Can You add it in your spare time? Thanks in advance. KaMan (talk) 12:26, 12 July 2018 (UTC)

@KaMan: done, thank you! Hopefully I didn’t typo any item IDs this time :) --Lucas Werkmeister (talk) 16:14, 12 July 2018 (UTC)
@Lucas Werkmeister: unfortunatelly there is six typos this time. Here is the change I had to fix it to be in line with template: https://www.wikidata.org/w/index.php?title=Lexeme%3AL6494&type=revision&diff=710070701&oldid=710069560 KaMan (talk) 09:17, 13 July 2018 (UTC)
@KaMan: oof, that sucks, I’m sorry. Better now? --Lucas Werkmeister (talk) 11:41, 14 July 2018 (UTC)
@Lucas Werkmeister: Nope, I had to apply again six fixes: https://www.wikidata.org/w/index.php?title=Lexeme%3AL6597&type=revision&diff=710701749&oldid=710699998 In forms F9..F14 there should be potential form (Q54944750) in claims like in form F8. KaMan (talk) 12:53, 14 July 2018 (UTC)
@KaMan: grr, I forgot to actually apply the fix before restarting the service. Try again please? --Lucas Werkmeister (talk) 13:39, 14 July 2018 (UTC)
@Lucas Werkmeister: Works fine now, thank You. KaMan (talk) 14:42, 14 July 2018 (UTC)

@Lucas Werkmeister: Hi, I created User:Lucas Werkmeister/Wikidata Lexeme Forms/Finnish. It includes a template for Finnish nouns. The template still missing comitative case (Q838581), because I don't yet know how to model it. However, due to the amount of grammatical cases in Finnish, it would be helpful if the ones listed in the template could be added to the tool. Shinnin (talk) 11:28, 16 July 2018 (UTC)

@Shinnin: thank you, the template should be live now. Please try it out and check that I didn’t typo any of the item IDs! --Lucas Werkmeister (talk) 19:42, 16 July 2018 (UTC)
@Lucas Werkmeister: the item IDs seem to be correct. Thanks for adding them. On another topic, when editing an existing lexeme kirjasto (L6795) in the advanced mode, I get the warning message stating that L6795 has the same language and lemma as the one I'm 'trying to create'. This happens when I try to add 'kirjasto' as a nominative singular form to L6795. I'm editing a lexeme, not creating a new one, so it's confusing to get the error message in this situation. Shinnin (talk) 10:58, 17 July 2018 (UTC)
@Shinnin: thanks, that’s a great suggestion! Implemented. --Lucas Werkmeister (talk) 20:29, 17 July 2018 (UTC)

Lexeme's should have the ability to have Wiki-links[edit]

In the current version the lexeme feature doesn't allow interwikilinks to be added to lexemes. Given that there are some Wikipedia articles that are about individual words I think the ability to create those links is valuable. Otherwise, whenever Wikipedia has an article about a concept and another article about a name for the concept we have a mess in our ontology. With the ability to add sitelinks to interwikilinks we can also clean up problems like the human (Q5)/Homo sapiens (Q15978631) doublication. ChristianKl❫ 15:48, 30 June 2018 (UTC)

@ChristianKl: items can have sitelink and soon it will be possible to link lexemes and items, wouldn't it solve the problem ? (and more elegantly I think, Homo sapiens (Q15978631) is not a word, it a concept that can be represented by thousands of words) Cdlt, VIGNERON (talk) 16:07, 30 June 2018 (UTC)
The problem is that both "human" and "homo sapiens" refer to the same concept. Currently, that means Albert Einstein (Q937) isn't instance of (P31) of Homo sapiens (Q15978631) and as a result it's not possible to infer that Albert Einstein (Q937) is a Primates (Q7380). It leads to further questions whether the taxon in which a "human bone" exists is "human" or "homo sapiens". It's messy. The ability to link lexemes with items doesn't help at all with the problem. ChristianKl❫ 11:23, 1 July 2018 (UTC)
@ChristianKl: First "human" and "homo sapiens" are not *exactly* the same concept (if they were, we would only have one items). But, yes, you're right, they are very close and not well managed ontology-wise right now. But I'm don't understand how linking en:Human to the lexeme "human"@en and en:Homo sapiens to the lexeme "Homo sapiens"@en will solve the situation. Plus, If we really want links on Lexemes it should be to wiktionaries, not to Wikipedias (we already have items for that). More importantly, I fail to see why you want to directly link en:Human to the lexeme "human"@en when there will probably be a indirect link: lexeme "human-S1"@en to human (Q5) and human (Q5) to en:Human. Could you explain a bit, please? Cdlt, VIGNERON (talk) 17:22, 1 July 2018 (UTC)
I didn't advocate linking en:Homo sapiens to the lexeme "Homo sapiens"@en but to link it to human (Q5).
Every Wikipedia article is supposed to be linked to exactly one object in Wikidata. This is necessary because it's the way Wikidata provides sitelinks for Wikipedia. I would like to keep that guarantee but say that sometimes that link is to an object in the Q-namespace and sometimes it's to an object in the L-namespace.
If we only link one of the two to Wikidata, we say that human (Q5) is instance of (P31) common name (Q502895). Saying that Albert Einstein (Q937) is instance of (P31) of something that's instance of (P31) a name, feels like it violates ontological assumptions.
To me Albert Einstein (Q937) feels like something that should be instance of (P31) of something that can have properties like a heart rate. feels lie
For the human/homo sapiens case this might seem like a hack. However there are articles like https://en.wikipedia.org/wiki/While on Wikipedia that are clearly about individual words. It makes sense to link articles like that directly to lexemes. The relationship between house@enwiki and house@wikitionary is not the same relationship as the relationship between while@enwiki and while@enwikitionary.
If we have items like while (Q7993606) in the Q-namespace then it's hard to explain why similar items for other words shouldn't be notable in the Q-namespace. ChristianKl❫ 18:18, 1 July 2018 (UTC)
Oh, now I understand you better and I totally agree that the P31 of Albert Einstein (Q937) (or any individual living being, eg. Bear JJ1 (Q492389) has for P31 a taxon (Q16521)) can be problematic. When we do queries, we had to do circumvolutions to get correct results (which can be a bit speciesist BTW), but it's not undoable and it's possible to infer that Albert Einstein (Q937) is a Primates (Q7380) (through human (Q5) said to be the same as (P460) Homo sapiens (Q15978631) and then through parent taxon (P171)). We may need to improve this. What I don't get is how having sitelinks on lexemes would help. while (Q7993606) is an exception (in fact, AFAIK, it's the only item about a word of a specific language), we shouldn't built the entire structures around one unique case, but true the system should take exceptions into account. Cdlt, VIGNERON (talk) 23:15, 1 July 2018 (UTC)
@VIGNERON: "only item about a word of a specific language"? Hm. How about these 173 (and probably many more on deeper levels):
SELECT ?item ?itemLabel ?lang ?langLabel 
WHERE 
{
  ?item wdt:P407 ?lang.
  { ?item wdt:P31 wd:Q8171. } UNION
  { ?item wdt:P31/wdt:P279 wd:Q8171. } UNION
  { ?item wdt:P31/wdt:P279/wdt:P279 wd:Q8171. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,ru,de,fr". }
}

Try it! --Infovarius (talk) 21:40, 9 July 2018 (UTC)

@ChristianKl: I believe that this discussion is out of scope of lexicographical project (because human (Q5) corresponds to many thousands of lexemes, which is much more than 2 mentioned items), and rather belongs to Wikidata:Wikiproject Taxonomy (which is more tough, I must admit). --Infovarius (talk) 21:40, 9 July 2018 (UTC)

misspelling (Q1984758)[edit]

@VIGNERON: and others. Are forms supposed to contain wrong forms like in aurochs (L5143)? KaMan (talk) 13:25, 1 July 2018 (UTC)

Hi KaMan (talkcontribslogs), thank you for asking.
First in general, it depends on what we call « wrong ». An obvious mistake made just one time should not be on Lexeme but what about a very common mistake ? (so common that sometimes the error is more common than the right form) I see no reason to not have a common form, no matter if it's right or wrong.
Then, for the specific case of "aurochs", it's especially tricky: it was consider by most dictionnaries to be a misspelling (Q1984758) (and still is) but because it's was so common, since 1990 and the orthographic corrections of French in 1990 (Q486561), it's not really a misspelling (Q1984758) any more (but the 1990 is kind of optionnal and most French speaking people don't know about it so "aurochs" is still the ). I have no precise idea on how to model that... (it's probably a bit too complex for the system), that's why I chose (choosed Face-wink.svg) a simple way which is not perfect, I'm open to any suggestion.
Cdlt, VIGNERON (talk) 15:09, 1 July 2018 (UTC)
@VIGNERON: what about new property for forms "commonly misspelled as" with text value? KaMan (talk) 16:08, 1 July 2018 (UTC)
KaMan why not but why? I see no need for a property and I can see some disadvantages, like making harder to query forms if not all forms are stored in forms. Cdlt, VIGNERON (talk) 17:08, 1 July 2018 (UTC)
It is an issue of broader scope than just misspelling (Q1984758). The forms can also be "wrong" due to their incorrect inflection, vocalization, accentuation, capitalization etc. Besides, as VIGNERON wrote, the "wrong" qualification is not a simple boolean value, but it has many colo(u)rs and shades. Some languages have a normative grammar that says what's correct and what's wrong. Other languages have just recommendations, and yet other ones only descriptions what is used in which layer of the language. Also, the "wrongness" can be limited to certain context (region, time, style, even sense) while in other context it is considered correct. For these reasons I consider creating a separate record for the "wrong" form and describe the type and the scope of it's wrongness using appropriate statements to be a good solution for most cases.--Shlomo (talk) 06:26, 2 July 2018 (UTC)
@VIGNERON, Shlomo: Ok, I'm convinced to use separate forms but I feel uncomfortable that misspelling (Q1984758) is placed among grammatical features. I think it would be better to use instance of (P31) in form declarations with qualifiers describing "wrongness" and with references to the external statements where it is stated that's "wrong". KaMan (talk) 07:09, 2 July 2018 (UTC)
I agree that a statement in the form section is a better practice. I'm not sure about using instance of (P31), I think we need a specific property for this. Maybe several ones.--Shlomo (talk) 05:34, 11 July 2018 (UTC)

@VIGNERON: I've changed aurochs (L5143) so it now contains reference to orthographic corrections of French in 1990 (Q486561) (thought I'm unsure applies to part (P518) is most suitable here). KaMan (talk) 12:43, 4 July 2018 (UTC)

@KaMan: IMHO it isn't, though I'm not sure which one is. Maybe determination method (P459), statement is subject of (P805) or start time (P580)? I'm not 100% happy with any of them. Or maybe let's make it a reference and link it through stated in (P248) to a specific publication.--Shlomo (talk) 05:34, 11 July 2018 (UTC)
@Shlomo: I agree that something better has to be choosen, but I know nothing about nature of orthographic corrections of French in 1990 (Q486561) so I think @VIGNERON: is better situated to choose right markup. KaMan (talk) 07:44, 11 July 2018 (UTC)

Another issue: If the misspelling (or other attribute) is related to the word's stem, should it be stated in the lexeme section, or in the form sections of every single form? Or maybe in both of them? Also, should a word with a commonly misspelled lemma have a separate lexeme, or should the "wrong" lemma be given as an additional representation (with corresponding code)? Or should we have only the "correct" lemma, and the misspelled variants just as forms?--Shlomo (talk) 05:51, 11 July 2018 (UTC)

@Shlomo: I would say it depends on how common is the mistake. But yes, in theory, the mistake should be on all forms. If the mistake is very common, it could be in specific lexeme (like common mistake have their own Wiktionnary entry, see fr:wikt:aréoport for aéroport). For me, the goal is having all words of the world in Lexemes, no matter how "correct" their supposed to be (especially as "correct" can be a tricky and subjective point of view). Cdlt, VIGNERON (talk) 08:20, 11 July 2018 (UTC)

@VIGNERON, Shlomo: I've created property proposal in relation to this disscusion: Wikidata:Property proposal/correct form KaMan (talk) 15:12, 11 July 2018 (UTC)

@KaMan: thanks, a property could be useful but I'm not sure exactly if it a good idea and how it should be used... (to be discussed on the proposal). Cdlt, VIGNERON (talk) 15:40, 11 July 2018 (UTC)

Some thoughts on what's missing[edit]

Since I've been adding a lot of English lexemes the last few weeks, I've had some thoughts about what is still missing (perhaps we need additional properties, or something else?) - a short list (not including the obvious things like merging, searching, etc):

  • A way to indicate that two lexemes are distinct despite having the same label, language and category - this is for the French tour case: Lexeme:L2330, Lexeme:L2331 and Lexeme:L2332 but it happens in English too, for example lie: Lexeme:L4180 and Lexeme:L4181. Once we have merging, we need a way to be clear these are cases that should NOT be merged. In item space we have different from (P1889), so a similar property for Lexemes perhaps?
  • Some understanding/agreement on what it means for a single form to have multiple "Grammatical features". For Lexeme:L4180, I've indicated form L4180-F2 with the features "simple present" and "third-person singular", as it is the form when both of those conditions apply. However, for form L4180-F4 I've indicated both "simple past" and "past participle", as it is the form when either of the conditions apply. Is this ok? Can we think of this consistently as a Boolean "OR" when the grammatical features are in the same category (tense for F4) but a Boolean "AND" when in different categories (tense vs number for F2)? Should this be better formalized somehow?
  • Some way of indicating that a particular form is part of the language, but used rarely. For example the "thou" forms of English verbs are a part of the language but used only very rarely - should we add "liest" as a form for Lexeme:L4180? How should it be indicated that this is a rare form? What about lexemes that are rare as a whole, perhaps we need some kind of frequency measure attached to lexemes and/or forms?
  • It might be helpful to indicate that a lexeme can have multiple categories with essentially the same meaning. This is true for a number of comparable adjectives in English that have the same form as adverbs - can we indicate they are both adjective and adverb at once? Or adjectives which are also nouns, nouns which are also adjectives, etc. The Oxford English Dictionary often lists multiple lexical categories on a single entry. Is there some way of doing this here (maybe this has already been discussed previously here??)

That's all for now - comments appreciated! I may mull on this a bit more and propose another property or two, if I don't hear better suggestions.... ArthurPSmith (talk) 01:07, 2 July 2018 (UTC)

@ArthurPSmith:
  • as for indicating that two lexemes should not be merged we actually have property homograph lexeme (P5402) which can partially cover markup for this problem.
  • as for multiple grammatical features I've thought that there should not be "OR" relation. Separate forms should be created for such cases even if they are identical. If we could have ability to add example sentences of usage of the form in the future then each example would need to be assigned according its grammatical features. With "OR" this assigment could be ambiguous.
  • as for rare forms we have already solution for it. Use instance of (P31) with rare form (Q55094451) in declarations of form which was crated exactly for this purpose (you can find more in Template:Lexicographical properties). We can attach frequency measure with qualifiers and add referenced source for this claim. There is also case when one of senses of lexeme is rare.
KaMan (talk) 06:48, 2 July 2018 (UTC)
@KaMan: Thanks, I hadn't noticed that line in the template (I thought it was just about properties, but that's very useful thanks!). On the "OR" question - I would think "example sentences of usage" would be for senses, not for the forms. I do think the way I've been doing it is logically consistent and sustainable. Under your approach, for the English verb "put", you would enter it 3 times then, one for simple present, one for simple past, one for past participle? Also I'm not sure what under this suggestion would be the right way to handle most English verbs that have just one form for all but third-person singular - I've been assuming that just putting "simple present" there is sufficient, with the "simple present" + "third-person singular" on the other form as an overriding condition, is that your understanding also? ArthurPSmith (talk) 14:28, 2 July 2018 (UTC)
@ArthurPSmith: yes, for "put" (put (L4464)) there should be 3 forms identical in spelling but with different grammatical fatures. On this assumption works Wikidata Lexeme Forms tool. It always creates separate forms regardles identity. For extreme example in Kot (L2876) there is one spelling "Kot" repeated 14 times as forms with different grammatical features. And that's the way how it is described in external resources. And it's not something I've come up with myself. I read about describing forms this way here, but I don't remeber thread. KaMan (talk) 07:14, 3 July 2018 (UTC)
@KaMan: I really don't believe that's the correct approach to word forms. Every source I can find, for example here suggests that by "word form" is meant a particular "shape" of a lexeme, and there's only one form per shape. If two forms are spelled the same but pronounced differently, I can see that being a reason to have two entries rather than one (for example English "read" in present tense vs past). But otherwise it seems to me it's a single word, i.e. the same set of letters between two spaces, pronounced the same, it doesn't make sense to have multiple entries. Quoting from the above reference - "The point about "crown", for example, is that as a transitive verb it would get one entry despite the existence of four different shapes in which it appears: crown, crowns, crowned, crowning. These different shapes spell out word forms that belong to the verb lexeme crown." That's 4 word forms, not 5 or more that distinguishing grammatical tenses would require. ArthurPSmith (talk) 14:31, 3 July 2018 (UTC)
@ArthurPSmith: That's four shapes; the source doesn't state how many forms they can represent. Actually, I can't see there any statement or implication that generally there's only one form per shape. On the contrary, in the end of the article the author mentions a possibility of distinct word forms that have the same shape (... while he says at the same time he won't count them for the specific purpose of finding the English word with most "forms"). I think his approach is understandable for English and other analytic and isolating languages which have (if any) a rather limited possibility of inflection, but it is not so suitable for other types of languages.--Shlomo (talk) 06:19, 4 July 2018 (UTC)
@ArthurPSmith: ok, so take a look at this reallife example with marchew (L5595). There is form "marchwi" repeated a few times. Let's take two:
With your "OR" version of grammatical features together with single occurence of form there would be form:
How can one query what were the original features from this "or"ed list? KaMan (talk) 11:54, 4 July 2018 (UTC)
Hmm, I suppose in a case like that I would advocate to used combined features - "singular dative" OR "plural genitive". But I see there are issues here. Perhaps the better solution would be in the English cases with the same "simple past" and "past participle" form to use a combined feature there, as it's quite common. Maybe there's something already defined for that? I'm not a linguist! If anybody has a proposal for clearly explaining how we should use the "grammatical feature" aspect of forms I'd love to see it! ArthurPSmith (talk) 14:19, 4 July 2018 (UTC)
@ArthurPSmith: another example when there is need for two identical forms is when one of them instance of (P31) rare form (Q55094451). There is no way to assign claim of form to one of grammatical features in "OR"ed version. I had today such example in coś (L5916) where forms L5916-F1, L5916-F3 and L5916-F5 are identical but only form L5916-F3 needs markup as rare form. KaMan (talk) 12:51, 5 July 2018 (UTC)
@ArthurPSmith: As for Example sentences of usage would be for senses, not for the forms: Not necessarily. It's true when we use the example to precise the sense (and to distinguish it from a similar sense or even other lexeme). Sometimes, however, we need an example to show the difference in using different variants of an inflected form, and in that case the right place for the example would be in the forms section.--Shlomo (talk) 06:38, 4 July 2018 (UTC)

derived from (P5191)[edit]

This property takes lexem. I think it's wrong. It should take form of the lexem as input. There is often situation where word comes from inflected form rather than basic form. For example I linked marchew (L5595) with derived from (P5191) to *mrchy (L5600) but in fact it should be linked to form L5600-F2 with accusative singular. Are we able to link to forms or should it be requested with phabricator? KaMan (talk) 12:16, 3 July 2018 (UTC)

You would have to propose a new property ("derived from form") for this case, I think. Phabricator wouldn't help, it's something we can decide here on our own. ArthurPSmith (talk) 14:19, 3 July 2018 (UTC)
@ArthurPSmith: Ok, I have proposed new property but what I was reffering to with phabricator is that we do not have in interface suggester to pick up form. There seems to be only picker to lexemes. KaMan (talk) 12:26, 4 July 2018 (UTC)
Form selection does work for form-valued properties. If you try the sandbox item Lexeme:L123 you will see the Sandbox-Form property where you can try this out. ArthurPSmith (talk) 14:14, 4 July 2018 (UTC)
Thanks, that's exactly what I was looking for. KaMan (talk) 12:41, 5 July 2018 (UTC)
@KaMan, ArthurPSmith: I agree, that form should be involve in derivation. And going a bit further shouldn't be "derive" a property between form and form? At least for the suppletive lexeme it's seems needed, except if there is an other way. I think we need to focus on the big pictures here. Cdlt, VIGNERON (talk) 14:46, 19 July 2018 (UTC)

present participle (Q13923816), no label (Q24133704), and no label (Q24577575)[edit]

Is there some difference between them, or should they be merged?--Shlomo (talk) 06:01, 9 July 2018 (UTC)

I'd be in favor of merging, but the Danish labels on the last two seem to differ, so maybe a Danish speaker should look at this? ArthurPSmith (talk) 13:47, 9 July 2018 (UTC)
@Fnielsen: Mahir256 (talk) 14:13, 9 July 2018 (UTC)

I have already posted the question here. Repeating: 'It is unclear whether these two items should be merged. The Danish article talks about general "present participle" in Danish and English, while the Turkish and Russian (AFAI can read) talks about English only.' The Danish article that links to no label (Q24577575) describe mostly the Danish verb form (with -ende postfix), but also briefly mentions the English verb form (with -ing postfix). Both the Turkish and the Russian Wikipedia article linked to no label (Q24133704) seem to only describe the English form, while present participle (Q13923816) seems to be only about Dutch. It wouldn't mind me to merge them. I do not know if the Dutch agree with that. In Danish, "lang tillægsform" = "Nutids tillægsform" = "præsens participium" [1] and these are also similar to "lang tillægsmåde" = "nutids tillægsmåde" — Finn Årup Nielsen (fnielsen) (talk) 19:38, 9 July 2018 (UTC)

And what about this one: participe présent (Q1763348)? That seems to be about French only. — Finn Årup Nielsen (fnielsen) (talk) 19:45, 9 July 2018 (UTC)
Yet another: present active participle (Q430255). Shouldn't both present active participle (Q430255) and participe présent (Q1763348) use "subclass of" property rather than "instance of" class? — Finn Årup Nielsen (fnielsen) (talk) 19:47, 9 July 2018 (UTC)

I have now taken the liberty to merge present participle (Q13923816) and no label (Q24577575). — Finn Årup Nielsen (fnielsen) (talk) 19:48, 9 July 2018 (UTC)

@Fnielsen, Shlomo: Thanks! The Russian link was only to a redirect to an article English grammar; the Turkish article was English-specific, but I think the idea is really the same general one that the Danish and Dutch articles cover, so I have merged those two items. I also modified the French and Latin ones to be subclasses rather than instances; they do seem to be somewhat distinct concepts. However - I wonder now if we have a bot that can fix the links to no label (Q24133704) in the grammatical features of Lexemes? ArthurPSmith (talk) 13:05, 10 July 2018 (UTC)

Showcase Lexeme[edit]

Hey folks :)

For my talk at Wikimania it would be nice to show a Lexeme that is pretty well described with Forms and statements. Any suggestions which Lexeme would be a good candidate? --Lydia Pintscher (WMDE) (talk) 10:03, 15 July 2018 (UTC)

@Lydia Pintscher (WMDE): Try any of these:
Note that all above were created with Wikidata Lexeme Forms tool so they are good showcase for the available tools too.
KaMan (talk) 10:31, 15 July 2018 (UTC)
Awesome! Thank you. I'll use one of these. --Lydia Pintscher (WMDE) (talk) 18:35, 15 July 2018 (UTC)
What about the biggest and the fullest (and not Latinic) вода (L189)? Sense water (Q283). It also has the biggest graph, however no homographs... --Infovarius (talk) 23:32, 15 July 2018 (UTC)

Forms for German verbs[edit]

Since I think this topic could be interesting for other languages as well, I would prefer to keep it accessible to people who don’t understand German. However, I recognize that I can’t force everyone who might reply to use {{LangSwitch}}.

I’ve started drafting how the set of forms of a lexeme for a German noun could look like, in the form of a template for the Wikidata Lexeme Forms tool at User:Lucas Werkmeister/Wikidata Lexeme Forms/German#deutsches Verb. I’ve tried to keep the number of forms down, so I omitted everything that derives from adding auxiliary verbs to the infinitive or past participle („ich werde tragen“, „ich würde tragen“; „ich werde getragen“, „ich wurde getragen“, „ich hatte getragen“, „ich war getragen worden“, etc.); however, merging e. g. first and third person plural forms seems excessive to me. I’ve also completely skipped the present participle („tragend“, „tragendes“, „tragender”, etc.), as well as all the inflected forms of the past participle („getragenes“, „getragener“, etc.), because I’m not sure what to do with this yet, since it kinda moves into adjective territory, and I haven’t tackled adjectives yet either.

What are your thoughts on this? If you think it looks alright, I can add the template to the Wikidata Lexeme Forms tool and we can start creating some verbs with these forms, and then see how well the model fares in practice, I guess.

--Lucas Werkmeister (talk) 22:04, 21 July 2018 (UTC)

on Special:NewLexeme should Language of Lexeme no just allow to put languages?[edit]

on the page Special:NewLexeme you should on the empty page not have the possibility to type other words then languages in the language field. Indeed if you put another word there e.g. house the software recognizes it is not a language and pouts it automatically in the field language of lexeme and adds a field language of lemma for which just languages are available in the drop down list. Should this not be just the same behavior for the field language of Lexeme? or is there something i missed here?

Duplicates[edit]

How we deal for now with duplicates? We have series of lexems from Wikimania workshop and there is run (L7339), manger (L7395), ser (L7397) and libro (L7370) already described in run (L279), manger (L309), ser (L5140) and libro (L317). KaMan (talk) 13:49, 22 July 2018 (UTC)

Before creating a new lexeme you should always check whether it already exists by using the auto-complete search on lexeme-valued properties (the test lexeme sandbox (L123) has a test property for this). ArthurPSmith (talk) 23:59, 22 July 2018 (UTC)
@ArthurPSmith: There is also useful Ordia tool to detect duplicates, but the question is not how to detect them but what to do with them if they occured as shown above. KaMan (talk) 10:29, 23 July 2018 (UTC)
@KaMan: I believe we are waiting on the development team to add a "merge" capability. I have "repurposed" a few duplicates (if the same string can act as both "noun" and "verb" for example, changing the type if that is otherwise missing) but I think this is frowned upon, so we need to just leave them for now. ArthurPSmith (talk) 11:39, 23 July 2018 (UTC)