Wikidata talk:Lexicographical data

From Wikidata
(Redirected from Wikidata talk:Wiktionary)
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.


On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2018/11.

Project
chat

Lexicographical
data

Administrators'
noticeboard

Development
team

Translators'
noticeboard

Request
a query

Requests
for deletions

Requests
for comment

Bot
requests

Requests
for permissions

Property
proposal

Properties
for deletion

Partnerships
and imports

Interwiki
conflicts

Bureaucrats'
noticeboard


Senses before forms[edit]

Senses are more important for identifying the word and are generally what people are more likely to look for than forms. Can they be listed before forms? This is especially important for words with dozens of forms. —Rua (mew) 12:09, 21 October 2018 (UTC)

I agree with that. Because of it as a temporary solution for myself I wrote small script to quickly jump to senses from ToC at the top of the page: User:KaMan/ToC_to_lexemes.js. KaMan (talk) 12:57, 21 October 2018 (UTC)
The informations like the grammatical type of the lexeme are nethertheless very important as we have several lexemes for the same string (one example for the same string with different grammatical gender, see L:L2330 and L:L2332), so these informations may be important for disambiguation and to avoid mistakes. (By the way, I noticed that there did not seem to be differences in the type, gender and forms of twe homographs, L:L2331 and L:L2332, why are they different lexemes ?) author  TomT0m / talk page 18:41, 21 October 2018 (UTC) removed lates this argument as the discussion below proved it’s not really founded. Semantic is carried by Lexemes in « Lexical semantics », so senses are essential to identify an item. author  TomT0m / talk page 11:57, 22 October 2018 (UTC)
They might differ in information that hasn't been provided yet. —Rua (mew) 19:44, 21 October 2018 (UTC)
@TomT0m: exactly as Rua said (etymology for instance, you can already see the reverse etymology for tour de Babel (L474) or tourneur (L2334)), this 3 lexemes "tour"@fr has been discussed a lot already, look at the past discussions (Special:WhatLinksHere/Lexeme:L2330). In the other way round, do you know even one source who says it's one lexeme? (*all* the dictionaries describe them at least as 2 lexemes, often 3 lexemes). Cdlt, VIGNERON (talk) 09:57, 22 October 2018 (UTC)
Sorry for the naive question, I don’t follow very closely this page :) I must admit I find articles such as fr:Lexème_(linguistique) as it’s full of specialized terms and infinite nuances and lacks of example. Maybe we need a Help:Lexeme page which sum up the current discussions, give examples and details the definition and practices used on Wikidata for the layman ? I’m just searching intel on guidelines right now so I looked on the data model to see if there was basic definition and I see that the Lemma definition does mention that in mw:Extension:WikibaseLexeme/Data_Model#Lemma it’s written
Two distinct lexemes with the same lexical category can exist in the same language if they have different morphology, that is, different forms.
that does not mention the etymology on lexeme, so it’s inconsistent with actual use of the page. My own intuition would have suggested that etymology is tight to senses, but it’s just me :). author  TomT0m / talk page 10:36, 22 October 2018 (UTC)
@TomT0m: no problem, it's always better to ask. It is hard to explain simply what is a lexeme, like it's hard to explain what is a concept for items. For a very simplistic approach: a lexeme = a word. An more precise approach would be: an entry of dictionary (usually one lemma only has one entry but sometimes there is thing like « 1. tour and 2. tour » when one lemma is bore by several lexemes, see "tour" in the TLFi). And more technically: an entity with specific informations, including but not limited to morphology.
« etymology is tight to senses », yes but more exactly « etymology is tight to senses of a word ». Anyway, "tour"@fr is a weird exception were several homographs with similar informations are different lexemes, don't focus too much on it (and just look at references ;) if dictionaries says it's two lexemes, just follow them).
Cdlt, VIGNERON (talk) 11:17, 22 October 2018 (UTC)
@VIGNERON: I understand that practical approach but that does not really answer my question :) I finally took the approach of browsing the enwp article, and I found that the en:lexeme is defined by a field called en:Lexical_semantics which actually takes into account the meaning of the different lexical entity and carries semantics, so that explains the fact a little bit more. author  TomT0m / talk page 11:51, 22 October 2018 (UTC)
If there is general agreement that we should switch around Senses and Forms I'm happy to do that. Some more opinions please? --Lydia Pintscher (WMDE) (talk) 03:48, 26 October 2018 (UTC)
I Symbol support vote.svg Support putting senses before forms (but after statements) on the page for a lexeme; people looking for a particular word should be informed quickly if they've gone to the wrong place. ArthurPSmith (talk) 15:13, 26 October 2018 (UTC)
Alright. I've opened phabricator:T208592 for it. --Lydia Pintscher (WMDE) (talk) 15:06, 2 November 2018 (UTC)

It's already live, though in older lexemes it appears after purge of page or some edit and refresh of the page. KaMan (talk) 10:09, 8 November 2018 (UTC)

Vote: Do we allow phoneme in the Lexeme namespace?[edit]

Hello, from this discussion and this one, I understood that there is a relative consensus on the fact that we do not allow storing phonemes in the Lexeme namespace. Thus, I added this fact in Wikidata:Lexicographical data/Notability. However Jura1 (talkcontribslogs) considers that the discussion is not over. Because, I think it is (nobody write new message on that topic for a while), I propose to vote in order to validate this point. @KaMan, Nikki, VIGNERON, Njardarlogar, Rua, Infovarius: @Jura1, ArthurPSmith, Circeus, Lexicolover: I ping you because you participated to previous discussions on that topic.

Do we allow phoneme in the Lexeme namespace?

Support[edit]

Oppose[edit]

  • Phonemes have to be store in the Q-namespace. Pamputt (talk) 07:46, 25 October 2018 (UTC)
  • Despite asking several times, there as no explanation, justification or reason why the Qitems are not enough. VIGNERON (talk) 07:57, 25 October 2018 (UTC)
  • IIRC I already told it twice, phonemes are not lexemes. KaMan (talk) 11:26, 25 October 2018 (UTC)
  • None of the lexeme features (language, lexical category, forms with grammatical features, etc) apply to phonemes, as far as I am aware, so I don't see any benefit to putting them in Lexeme namespace at all. ArthurPSmith (talk) 15:00, 25 October 2018 (UTC)
  • Rua (mew) 15:58, 25 October 2018 (UTC)
  • Phonemes are not lexemes. Although it is theoretically possible to store non-lexeme phenomenons as L-entities, such a practice is quite problematic and should be only allowed, if there is a strong reason for it. The same applies to graphemes, btw.--Shlomo (talk) 06:40, 26 October 2018 (UTC)
  • --Njardarlogar (talk) 09:31, 29 October 2018 (UTC)

Discussion[edit]

  • @Pamputt: As you don't contribute actively to lexemes on Wikidata, it's not clear how this would affect you and why you'd vote on this. Maybe you could outline the problem you are trying to solve. What alternatives do you propose? What is the urgency of this point? I find your overall posts to this page rather nonconstructive (who would start topic called "L21070_should_not_exist" to seek positive input?). Please avoid breaking things. --- Jura 07:55, 25 October 2018 (UTC)
  • @Pamputt: What now with this voting results? KaMan (talk) 09:45, 9 November 2018 (UTC)
    @KaMan: I wait until this section is archived and I will add a section to specify explicitly that phonemes and graphemes are excluded from the Lexeme namespace with a reference to this vote (that is why I need the discussion is archived in order to use a perennial link). Something similar to this. Pamputt (talk) 12:25, 9 November 2018 (UTC)

Linking with Wiktionary[edit]

Now we have nearly all main features (Forms, grammatical categories, Senses, translations...) I would like to have links to Wiktionary pages (wasn't it the main idea to have Wiktionary repository in Wikidata?). When (and in which form) is it planned? --Infovarius (talk) 15:11, 25 October 2018 (UTC)

@Infovarius: Wiktionary URLs are based on lemmata so it's trivial to generate a link (and one fitting your needs, either the main lemma or a specific form, not senses as wiktionaries don't have a structures for senses, and some Wiktionary have different structure for anchor link to languages and section inside a page). Someone from the Lexeme team can confirm (or infirm) but I remember that there is no plan to explicitely store link to wiktionary on Wikidata (some people did with some hack around this though). Cdlt, VIGNERON (talk) 16:26, 25 October 2018 (UTC)
That's exactly the problem. Wiktionary pages are based on lemmata, so that one page can contain several lexemes with the same lemma. Wikidata lexicographical pages are based on lexemes and one page can contain several lemmata. Meaningful linking is in this situation surely not trivial. It could be done via statements (on WD side) and wikilinks in appropriate sections of each wiktionary, but I can't imagine the maintenance of such system.--Shlomo (talk) 05:45, 26 October 2018 (UTC)
Yes, there are several lemmata in "each" Wiktionary page, so I suppose we should have some template (using Lua access to WD Lexemes) in each section of it. But inversely, most Lexemes should correspond to unique Wiktionary page so it should be simple to add such linking. --Infovarius (talk) 08:07, 26 October 2018 (UTC)
  1. Many Wikidata Lexemes (L-items) have multiple lemmas and may (or may not) correspond to several Wiktionary pages. We don't know.
  2. The fact that there is a Wiktionary page with the name corresponding to a lemma doesn't mean, the Wiktionary page contains section with a lexeme corresponding to the Wikidata L-item.
--Shlomo (talk) 09:16, 26 October 2018 (UTC)
Pardon, Shlomo? Can you please provide an example of Lexeme which have multiple lemmas? I cannot imagine it... --Infovarius (talk) 12:12, 29 October 2018 (UTC)
Sure. Check these: color/colour (L1347), colour/color (L791), מזל/mazal (L12373), вода/voda (L2068), 大きな/おおきな (L661), מֵם/מים (L8305), ном/ᠨᠣᠮ (L7957).--Shlomo (talk) 16:45, 29 October 2018 (UTC)
And this is just for the main lemma (which has just an indicative value), each forms is a different Wiktionary entry and each senses is a different section of these entries. I'll try to make a schema to make it more clear on how linking is complex here. Cdlt, VIGNERON (talk) 11:51, 31 October 2018 (UTC)
We have no "forms" in ru-Wiktionary as separate so it's not a problem. Links to sections are not necessary too. But "color/colour" is the problem, yes. May be to suppose they are not numerous and just to ignore them?.. Infovarius (talk) 14:35, 1 November 2018 (UTC)
I created phab:T195411 a while back asking for a special page which could be used with Cognate to make it possible to navigate between Wikidata and Wiktionary. I get the impression the developers aren't convinced though. - Nikki (talk) 11:28, 26 October 2018 (UTC)
  • Yes, I think it's time to add them. Aren't they already all centrally stored? So one could easily display at least the ones leading to the Wiktionary in the same language. --- Jura 18:39, 25 October 2018 (UTC)
  • I've been somewhat disappointed that enwiktionary seems to have pretty much everything I look at covered already pretty well. However, I just worked on fencing (L33095) where I realized the number of senses one gets from wikidata items with that label is quite a bit more than what enwiktionary had. So there's hope that we actually can be useful beyond interlanguage linking :) ArthurPSmith (talk) 20:59, 25 October 2018 (UTC)
To answer Vigneron's question: there is no plan for a specific development to add Wiktionary links, as it can easily be covered by statements. This would also mean that we don't have to follow the 1-n rule of the Wikipedia interwikilinks. Depending on how you decide to model it, a Lexeme could link to several Wiktionary pages, and several Lexemes could link to the same Wiktionary page. Lea Lacroix (WMDE) (talk) 09:37, 26 October 2018 (UTC)
Sounds good. Thanks for the quick reply. --- Jura 10:51, 26 October 2018 (UTC)
Could you explain what you have in mind when you say it can be easily covered by statements? I can't see any sane way of doing it. - Nikki (talk) 11:28, 26 October 2018 (UTC)

In the FAQ, the very first question is "Why will this project be useful for Wiktionary editors?" and the reply talks about Indonesian Wiktionary populating Estonian words from Wikidata. Was that just a dream, or has anything like that been implemented? Maybe the reply should be changed into something that resembles actual reality? --LA2 (talk) 17:44, 26 October 2018 (UTC)

@LA2: as far as I know, nothing has been implemented yet (no surprise there, Lexemes are still at an early stage) but this section and the example still is true, it's up to the wiktionaries to decide to use Wikidata data (or not). Cdlt, VIGNERON (talk) 17:42, 27 October 2018 (UTC)
Yeah, I had it in mind! I thought that the initial plan was to have "central repository of lexicographical data" for Wiktionaries. But how can Wiktionary use it if they are not linked to each other? --Infovarius (talk) 12:12, 29 October 2018 (UTC)
@Infovarius: true, explicit links can make our life easier but per se, they are not needed to reuse Wikidata data. For proof, see all the templates on Commons who explicitly call for a specific Qid, Commons:Template:Creator or commons:Template:Artwork for the more common examples. Cheers, VIGNERON (talk) 11:51, 31 October 2018 (UTC)
  • It would be good to set up a version of Wikibase lexemes for Wiktionaries that are interested in having structured data without having to pay a high price. --- Jura 08:07, 30 October 2018 (UTC)

Extensions of Ordia[edit]

The Toolforge tool Ordia at https://tools.wmflabs.org/ordia/ has now been extended. There are overviews of languages, lexemes, forms and senses. There is also a text-to-lexeme functionality https://tools.wmflabs.org/ordia/text-to-lexemes though currently only enabled for four languages. Example: [1]Finn Årup Nielsen (fnielsen) (talk) 20:39, 1 November 2018 (UTC)

  • Nice. I had done something similar (offline). Just lacks the select lexical category and create buttons. It seemed to time-out when I tried with a Wikipedia article. --- Jura 13:18, 4 November 2018 (UTC)
    • Rather than time-out I suspect it is a CORS problem I need to look into. I hope to extend Ordia with button for input. — Finn Årup Nielsen (fnielsen) (talk) 00:20, 8 November 2018 (UTC)

Relate Form to Sense[edit]

The forms of Bank (L34723) f. in German are not identical in the sense of bench and bank (e.g. nominative plural in the sense of bench is Bänke in contrast to Banken). How should these forms be related to the senses? --Mfilot (talk) 21:30, 1 November 2018 (UTC)

They should be entered as separate lexemes. As well as different forms, they also have different etymologies. - Nikki (talk) 21:37, 1 November 2018 (UTC)
Ok, makes sense. I created a new lexeme Bank (L34791) for the financial institution and cleaned up the translation (P5972) (see Bank (L34791), banque (L15448), bank (L3354)). --Mfilot (talk) 22:16, 1 November 2018 (UTC)
  • I think we still need a way to indicate forms that may be linked to specific senses. --- Jura 13:16, 4 November 2018 (UTC)

Forms dependant on first letter of the following word[edit]

What grammatical category should we use, or how should we tag forms that are dependent on the first letter of the following word? For example: the English article 'a' has the form 'an' if followed by a word beginning with a vowel; and the English prefix 'in-' has the form 'im-' if it is followed by a base word starting with 'p' or 'b', etc. How should we / should we distinguish such forms? Liamjamesperritt (talk) 01:09, 4 November 2018 (UTC)

I have such cases in Polish too. For example nad (L14478) which can have forms "nad" and "nade" depending on next word. I mark them with vocalic form (Q55082724) and non-vocalic form (Q55082712) as it is taged in Grammatical dictionary of Polish (Q55214514) used by me as source of grammatical forms. KaMan (talk) 07:13, 4 November 2018 (UTC)
  • Normally, we would list the required forms somewhere. Maybe with the same as for Latin/French? Wikidata:Property_proposal/requires_form. --- Jura 12:04, 4 November 2018 (UTC)
    • Interesting idea. I left some thoughts about the "requires form" property on the proposal page, as I think something like this could solve the problem. Liamjamesperritt (talk) 00:18, 5 November 2018 (UTC)

Translations[edit]

I have been adding translation (P5972) to Dienstag (L6818) in the sense of day of the week. Currently (12:37, 4 November 2018 (UTC)) these are 16 translations. From these senses a translations points to the sense on the German sense object resulting in 272 translations entries. Of course this is a bit redundant, but this is the way it is intended to work, isn't it? I'm aware that most senses have item for this sense (P5137) pointing at Tuesday (Q127), but this might be cumbersome when looking for a translation taking the path over the Q-item instead of translation (P5972). --Mfilot (talk) 13:03, 4 November 2018 (UTC)

  • Maybe the idea is that people focus on a limited number of language pairs for an unlimited number of lexemes rather than the opposite.
    The layout for these could be improved. The language should be visible at least as (language) code and the gloss could fall back to the language of the lexeme (sorry for the digression). --- Jura 13:16, 4 November 2018 (UTC)
  • That is a good thought to focus on limited number of languages which will happen anyway since for most lexemes it will not be that easy to identify the translations. I agree that the language of the translation (P5972) should be visible, and a more compact layout would also help. Some translations in Dienstag (L6818) display as code instead of label e.g. L34322-S1 instead of вівторок. Is this related to my language settings? --Mfilot (talk) 13:47, 4 November 2018 (UTC)
  • It happens if there is no gloss in your interface language. With "code" I had in mind the language code (rather then id of the sense). I'm not even sure if it's a good idea to show the gloss in the interface language rather than the language of the linked lexeme. Maybe that should be a setting in preferences. --- Jura 14:01, 4 November 2018 (UTC)
Don’t we have items to represent meanings ? This seem like the same situation than the interwikis on the pre-Wikidata era on wikipedias. My understanding was that « item for this sense » had the same role of a central item to represent a meaning each of the exact meanings for that item would connect to. This would avoid the translation explosion, and the exact translation pair are easy queryable.
< ?sense of a word in french > item for that sense search < Wikidata item A >
< ?sense of a word in english > item for that sense search < Wikidata item A >
. author  TomT0m / talk page 21:43, 4 November 2018 (UTC)
  • It seems redundant to have item for this sense (P5137) and translation (P5972) together in the same sense. Would it be possible to have a constraint that only one of those properties should be used for each sense? When the number of translation (P5972) statements grows beyond a certain number, a new item could be created for that sense to act as a hub for all languages. In my view a "sense" is similar to a Qitem that is embedded in a lexeme because it is convenient, but we shouldn't fear creating new Qitems when necessary.--Micru (talk) 18:44, 5 November 2018 (UTC)
  • @Micru: I've added a lot of senses (for English) - it seems relatively easy to find item for this sense (P5137) when the lexeme is a noun, but we almost never have existing Q items for other parts of speech (verbs or adjectives most commonly). Are we comfortable with adding Q items for verbs and adjectives? ArthurPSmith (talk) 20:05, 5 November 2018 (UTC)
  • @ArthurPSmith: I'm inclined to think that verbs and adjectives meet our notability criteria since they refer to clearly identifiable entities (specially if they exist in several languages) and they would fulfill a structural need. Of course, I'm open to hear more arguments about this.--Micru (talk) 21:16, 5 November 2018 (UTC)
General comment: implementing sense entities still seems like a cleaner approach to me than putting senses for verbs, adjectives, adverbs etc. in the main namespace. It should also have a lot of benefits, like making it much easier to search through existing senses. We would also not have to worry about the notability policy for the main namespace; if one lexeme uses a sense entity, that should be good enough to keep it. We could also hard-code certain sense-specific behaviours, like making it possible to set the narrowest hyperonym (supersense, like blue for light blue) in a special field on the entity rather than using a property, and listing up all the widest hyponyms (subsenses) as well to aid navigation. --Njardarlogar (talk) 21:18, 5 November 2018 (UTC)
@Njardarlogar: How is a "sense entity" different than a regular item? --Micru (talk) 22:06, 5 November 2018 (UTC)
@Micru: I don't think of the extent to which a sense entity would be different from an item as the most important point, but rather that it is actually a separate type of entity with its own namespace. Having senses as their own entity type does mean that we can tailor them to their specific purpose, like I suggested above, and which I think potentially could get quite useful; but I don't think it is the most important reason.
Simply by keeping senses separate in their own namespace, we get
  • improved experience with the user interface: now we have to first enter a gloss, then select item for this sense (P5137) as the property to use, and only then can we select an entity. With sense entities, we would have a special field that would only accept sense entities as input and that would not require a gloss - just start searching among senses right away.
  • dedicated entities that do not contain content irrelevant for senses; for example, on a sense entity for a country, we would have no information on head of state, population, et cetera, et cetera. We could instead have a more prominent position for links to related senses, such as an inhabitant of the country, its language(s) and so on
  • easier navigation among existing senses: the main namespace is mostly composed of entries that will never be used as senses
--Njardarlogar (talk) 13:12, 6 November 2018 (UTC)
@Njardarlogar: I understand your point, however I find that adding an additional namespace for senses would increase the complexity unnecesarily. 1) The feature that you suggest of a special field for senses could be thought to accept q-items as input, no additional namespace necessary. 2) Entering again the information for existing entities would mean duplication of effort for creation and maintenance. 3) There is nothing that indicates that adding an additional namespace would make navigation easier.
On the other hand, using items for senses where relevant doesn't require any additional infrastructure, we can start doing it now if there is the will.--Micru (talk) 22:14, 6 November 2018 (UTC)
@Micru: Regarding 1), you can still expect many irrelevant suggestions from the main namespace; items that will never be used on a lexeme for a sense. As long as we can e.g. set example lexemes on the sense entities (or have them generated automatically by the MediaWiki software), it must necessarily be easier to navigate a dedicated sense namespace than the main namespace because the main namespace is filled with irrelevant items. There would be duplication with sense entities, particularly for nouns; but it would likely not be 100% unless adverbs and similar concepts were included in the main namespace independently of their use by the lexicographical project. The essence of a sense should not change over time, so maintenance should mainly be about dated language in the descriptions/definitions (including altered classifications of the concept the sense corresponds to).
All that said, there is one potentially important difference between how we would use sense entities versus items: items are currently not supposed to have lengthy definitions, an item description is not meant for a dictionary definition but to be brief and act as a disambiguator. On senses entities, on the other hand, we could accommodate precise definitions and have a specific field for this purpose. Without definitions, the lexicographical project would be incomplete, surely. A property could potentially be used for this purpose on items, yes. --Njardarlogar (talk) 17:49, 7 November 2018 (UTC)
@Njardarlogar: Regarding irrelevant suggestions, that could be improved with a better suggester. From my experience it is not that bad, but if you have some examples of where the suggester was not offering you the items that you needed, then you should post them so that the developer team is informed.
As for the definitions, in the past I was under the impression that we need them. However the more I thought about it, the more I came to realize that the statements *are* the definition. For Wikidata it is not so relevant to come up with textual explanations of words (which btw normally have copyrights), that is the job of the wiktionaries, but what we can do is to transform those definitions into structured data (CC0).--Micru (talk) 12:24, 8 November 2018 (UTC)
  • @Micru: So far, all Wikidata items have been "conceptual or material entities", or some notable "thing". To start adding Q-Items for verb and adjective senses in mass would represent a substantial change to the essence of the Wikidata ontology. Since Wiktionary (and by extension, dictionary) entries are generally disallowed from entering Wikidata's Main namespace, it feels to me that entering senses of various lexical categories (verb, adjective, adverb, etc.) goes against that policy. A key reason the Lexeme namespace was created was to keep encyclopedic data separate from lexical data, and now we are coming to realise that lexemes alone are potentially insufficient to efficiently describe lexical data. Do we need to create another namespace to more fully attain to what the Lexeme namespace was meant to fulfil, or do we shift the usage of the Main namespace, and potentially go against the purpose of creating a separate lexical namespace in the first place? Either way, it seems to me that the current data model appears to be missing an important piece of the puzzle. Liamjamesperritt (talk) 21:28, 6 November 2018 (UTC)
  • @Liamjamesperritt: I have difficulties understanding why an adjective or a verb could not be considered a conceptual entity. In my view the definition of conceptual entity seems quite arbitrary and perhaps related to what can be considerated encyclopedic, which generally does not apply to Wikidata. It is true that individual dictionary entries are generally disallowed from entering Wikidata's Main namespace, because for that we have the lexeme namespace, however here we are talking about senses that are shared among a high number of languages. Even if we created an item for the sense "important", we still would need to create lexeme entities for each of the languages that have lexemes that represent that concept, because each language has its own peculiarities regarding pronunciation, use, etc. By allowing the senses of verbs and adjectives in the main namespace, we are not going against the purpose of creating a separate lexical namespace, because such q-items would be a complement to lexeme entities, not a replacement. As you say, we are missing a piece of the puzzle.--Micru (talk) 22:14, 6 November 2018 (UTC)
  • @Micru: Although I still feel that adding senses for verbs, adjectives and adverbs on mass would represent a substantial shift in the usage of the Wikidata Main namespace, if it is eventually concluded that such senses are valid Q Items, then I agree that this would be a nice solution to the problem, as we are already using Q Items to link noun senses. The next question is then: what would we make these items instances of / subclasses of? Or would we instead link them to their noun counterparts with a new property (e.g. "run" -> "running"; "beautiful" -> "beauty")? Or both? Liamjamesperritt (talk) 22:45, 6 November 2018 (UTC)
  • @Liamjamesperritt: It would definitely be a new practice to add items for senses, and as such it should be discussed thoroughly. I find your question about "instances of / subclasses of" too generic, because each one will have a different value, plus several other statements might help outline their meaning. I would say that "beautiful"<indicates quality>"beauty", about "running" I am not so sure because the item seems to conflate different concepts (sport, terrestrial locomotion), so probably it should be split.--Micru (talk) 00:00, 7 November 2018 (UTC)
So I've been reading up a bit on WordNet - https://wordnet.princeton.edu/ - what they've done for English is sort of create the conceptual items we are talking about. See the section on "Relations" on that page for how they differently handle noun, verb, and adjective relations; the hierarchies or groupings are quite different. The total number of items that might need to be created for verbs, adjectives and adverbs could be estimated from their counts - I would guess it will be well under 100,000 (most of their synsets are nouns already). So I don't think it would be in any way a big burden on Wikidata to add these concepts as items, they'll be at about the 1/1000 level of items. ArthurPSmith (talk) 16:30, 7 November 2018 (UTC)
  • @ArthurPSmith: Very interesting. This idea of a "synset" does make a lot of sense. And with the current state of the Wikidata data model, it certainly seems as though Items would be the simplest choice for representing these concepts, especially if it is determined that it would not be a burden on the Main namespace. Perhaps there should be a discussion about whether this direction should be taken? Liamjamesperritt (talk) 02:48, 8 November 2018 (UTC)
I agree, we need more input on this. I have started a thread on the project chat.--Micru (talk) 13:48, 8 November 2018 (UTC)
I would say that the synsets for nouns are already the Q-items. At least that is how I have used it. For instance, tape recorder (Q213777) is linked to WordNet via http://wordnet-rdf.princeton.edu/wn30/04392985-n and exact match (P2888). — Finn Årup Nielsen (fnielsen) (talk) 01:50, 13 November 2018 (UTC)

How to split or merge pronoun (Q36224)[edit]

Some individual pronoun (Q36224) can either get their own lexeme or be grouped into one lexeme. Consider the English we (L483) which groups we, us, our, ourselves, while in Danish (Q9035) I have split vi (L35288) (basic form) and vores (L35289) (possessive pronoun). For Danish (Q9035), the dictionary Den Danske Ordbog (Q1186741) split these words/lexemes [2] [3]. I am unsure which way is the best. If we do not split, it seems that individual forms can be attached to different word classes. The same might go for the etymology where Danish (Q9035) vi/vor is based on vár/várr. — Finn Årup Nielsen (fnielsen) (talk) 17:23, 4 November 2018 (UTC)

In the Indo-European languages, many possessive determiners can inflect by themselves, unlike genitive cases which don't have any further inflections. This makes me think that they should be treated as lexemes in their own right. —Rua (mew) 20:19, 4 November 2018 (UTC)
I was surprised when I saw that "my" wasn't entered as a lexeme of its own. That's not what I was expecting at all, so I would be favour of splitting them for English. The online OED also has separate entries for them all. - Nikki (talk) 20:29, 4 November 2018 (UTC)
I think I took care of most of the pronouns in English; the grouping was based on my reading of how to handle lexemes in such cases, such as this question and discussion. There aren't many of them so I suppose they could be split - but then how do we link them properly to indicate they have such related meanings? ArthurPSmith (talk) 23:01, 4 November 2018 (UTC)
I don't think we have anything suitable right now, we don't have many properties for linking lexemes. - Nikki (talk) 09:37, 5 November 2018 (UTC)

Lexemes that are defined grammatically in terms of other lexemes[edit]

One way that English Wiktionary avoids having to re-define words several times is by using special definitions that refer to another lexeme. For example, a word might be defined as the verbal noun or passive of some other verb. An example is the Northern Sami pair gávdnat (L35329) "to find" and gávdnot (L35330) "to be found", where the latter is a passive derivation of the former. Is there a way to give something like "passive of gávdnat" as the sense, instead of repeating all the senses of the base verb but in passive form? —Rua (mew) 20:17, 4 November 2018 (UTC)

Why should you regard them as different lexemes and not forms of one lexeme? --Infovarius (talk) 10:39, 7 November 2018 (UTC)
Because they are full lexemes in their own right. Passive verbs are full verbs, and have an infinitive and all the forms that any other verb might have. You can even derive new lexemes from one. That latter point is important, because we can only derive lexemes from other lexemes with our properties. —Rua (mew) 11:19, 7 November 2018 (UTC)
For example, gávdnon (L35767) "occurrence" derives from the aforementioned passive verb. —Rua (mew) 11:28, 7 November 2018 (UTC)

List of properties for Lexemes and List of Lexemes by language[edit]

Where can I find list of properties for lexemes? I think proporty with its use examples would be very helpful to understand how/where to use them. It is quite difficult to understand which propoerty should be used for what. List of Lexemes by language would be very helpful to find language specific lexemes and edit them. Regards,-Nizil Shah (talk) 05:16, 6 November 2018 (UTC)

@Nizil Shah: list of all lexeme related properties is here: Template:Lexicographical properties. Some of properties have Wikidata property example for lexemes (P5192) specified but if you have problem how to use some property just ask here. You can get list of lexemes in your language two ways. Easy one is to list all linkings to language item in lexeme namespace. For example here are all lexemes in Gujarati (Q5137) see here. Second method is to run query. Hope this helps. KaMan (talk) 11:17, 6 November 2018 (UTC)
@Nizil Shah: Besides of the template that KaMan has linked, there is also Wikidata:List of properties/linguistics, or you can also browse properties using Prop explorer. In any case you can also look at the showcased lexemes or already existing lexemes in your language and find inspiration there.--Micru (talk) 12:42, 6 November 2018 (UTC)
@Nizil Shah: You can also get a list of properties in Ordia: https://tools.wmflabs.org/ordia/property/Finn Årup Nielsen (fnielsen) (talk) 01:36, 13 November 2018 (UTC)
Thank you for the lists. I will look into them and ask for clarification if I need to understand any property. I will propose missing property if any missing.-Nizil Shah (talk) 06:25, 16 November 2018 (UTC)

Understanding properties and data model[edit]

I am following the Lexeme project since its proposal. Over the years, the words used in data model and properties have became too technical to understand for new people as well as a person like me who have no linguistic knowledge. Sometimes I could not even understand a property and where to use them. I am working with small Gujarati Wiktionary and other Gujarati Wiki people who were waiting for Sense to add 200000 words from a public domain dictionary. Now we are stuck because it has became difficult to explain the data model and what from normal print dictionary should go where. Some technical things like "Gloss" is difficult to understand/explain. Broad and simple explanation in context of print dictionary will be a great help to people like us. We tried to map (which thing should go where) our public domain print dictionary to Wikidata Lexeme but we are stuck. The print dictionary has limited type of data in it. If we can map them, we might be able to create simple editing tool via OAuth to edit Wikidata Lexemes without confusing about too many things while editing. We could not even figure out Gujarati labels for properties and other technical labels due to lack of simple explanation. In short, people need simpler explanation in context of print dictionary because all editors are not linguists. Properties should be also explained this way with simple clear examples. Can we have it? If Wikidata Lexemes wants to attract editors, it need simplicity in explaining technical things. Regards,-Nizil Shah (talk) 05:47, 6 November 2018 (UTC)

  • Maybe it's easier if you list a sample entry we try to figure out how to map it. The problem with properties is that many haven't been created yet. --- Jura 06:30, 6 November 2018 (UTC)
  • @Nizil Shah: Is Wikidata:Lexicographical data/Glossary in any way helpful? KaMan (talk) 11:23, 6 November 2018 (UTC)
  • The model of BhagwadGomadal Gujarati dictionary is something like this:
Word | Meaning No. (one word can have multiple meaning) | Origin: Origin Language + Origin Word in Gujarati with its Gujarati Meaning (can be multiple or single words for origin along with meaning of each origin word) | Grammar Category | (Subject of the word for this meaning e.g. Music or Computing) | Meaning: Gloss sentence? + Synonyms | More info: More info (detailed aricle like infor or short 2-3 senence info) + More info related sentences etc. + Meaning of these sentences| Example: Example sentence + Example sentence Reference | Mutiple or single Phrases: Phrase No. + Phrase + Its Meaning + Explanation of the Meaning
@KaMan, Jura1: It is complex and I have already drawn flowchart to organise the information but could not understand which data can be handled where in wikidata. Feel free to ask for clarification in above model. I will try to answer. The digital non-machine readable dictionary is available here.-Nizil Shah (talk) 07:20, 16 November 2018 (UTC)
Please explain with example: what is gloss and what is not?-Nizil Shah (talk) 07:38, 16 November 2018 (UTC)

List of your lexemes that need senses[edit]

This URL lists all the lexemes you’ve created until 18 October 2018, the date senses became available. (Ever since, you’ve always added senses to your new lexemes, right? 😉) Perhaps it’s time to add some sense(s) to them? --Lucas Werkmeister (talk) 23:15, 7 November 2018 (UTC)

The other way is to query for all lexemes without any sense in your preffered language: here is example for esperanto. KaMan (talk) 10:00, 8 November 2018 (UTC)

Forms that also have idiomatic meanings[edit]

How are cases handled where a form has acquired meanings that can't be predicted grammatically from which form it is, but are idiomatic to that particular form? An example that comes to mind is English broken, which has meanings that don't follow from it being the past participle of break. —Rua (mew) 17:30, 10 November 2018 (UTC)

I would say create separate lexemes for new set of form(s) of new meaning(s). KaMan (talk) 18:07, 10 November 2018 (UTC)
How would you define its etymology? —Rua (mew) 18:34, 10 November 2018 (UTC)
With derived from form (P5548) of derived from (P5191). KaMan (talk) 18:40, 10 November 2018 (UTC)
@Rua, KaMan: Almost(?) all English participles can act as adjectives; I'm not sure it's really worth having a separate lexeme for all of them. And almost all the senses I see in enwiktionary for "broken" are also possible to associate with the original verb. But one or two of them perhaps not, so yes a distinct lexeme for that sort of case makes sense to me. ArthurPSmith (talk) 19:11, 10 November 2018 (UTC)
According to Wikipedia, participles are adjectival or adverbial by definition, so it's not surprising to see them acting like adjectives. The question is what to do with the ones that have semantically separated from the verb and become independent words. Even if "broken" is not a good example, there are plenty of examples across languages that are. Another example I can think of is ukudla in Zulu, which is both the infinitive of -dla "to eat" and lexicalised in the meaning "food". —Rua (mew) 23:00, 10 November 2018 (UTC)

Lemmata for Latin verbs[edit]

In Wiktionary, the main lemma for most Latin verbs is the first person conjugation. However, most of the Latin verbs in Wikidata so far have the infinitive as the lemma. I want to start adding Latin verbs to Wikidata, but I'm not sure if I should set lemmata to the first person or the infinitive. Any advice? Liamjamesperritt (talk) 04:02, 11 November 2018 (UTC)

  • Just follow the existing ones, similar to other languages. --- Jura 04:16, 11 November 2018 (UTC)
  • There's no reason to have the "main" lemma in infinitive. AFAIK every serious dictionary uses first person, and so do many Wiktionaries, including the Latin one. Alas, the English and German Wiktionariesy uses infinitive lemmas, so you can expect a strong opposition from this side.--Shlomo (talk) 07:13, 11 November 2018 (UTC)
    Agreed with Shlomo, we should follow the way specialists and reference works on that language use. Pamputt (talk) 10:21, 11 November 2018 (UTC)
    Yeah, let's follow what specialist contributors already do. As Wiktionaries shouldn't be copied, we can't really follow them. --- Jura 12:57, 11 November 2018 (UTC)
    Many Wiktionaries follow standard dictionary protocols and contain a wealth of valuable lexical information, so why do you say that Wiktionaries should not be followed? Should we not do our best to align the Lexeme namespace with Wiktionaries in order to provide structured data support for Wiktionaries, as the Main namespace has done for Wikipedias, Wikiquotes, etc.? Liamjamesperritt (talk) 19:49, 11 November 2018 (UTC)
    This discussion is not about the information itself, but about structural issues. There are good reasons, why the data model of Wiktionaries shouldn't (and can't) be followed:
    1. The software used for powering Wiktionaries is text-based, which is an appropriate solution for Wikipedias, Wikisources, Wikibooks etc., not so much for a dictionary, which is primarily a database of lexical information. Still it can be used after introducing many workarounds and strict rules concerning the pages' structure. Wikidata software is more appropriate to process the lexical information as data and doesn't have to follow all the Wiktionaries' workarounds and limitations.
    2. There are many Wiktionaries and they are autonomous projects. Various Wiktionaries use different solutions for the problems mentioned sub (1) and Wikidata can't follow all of them. Even in the case discussed here, we can follow the en.wikt and de.wikt (etc.) and use infinitive as "main" lemma, or we can follow la.wikt and fr.wikt (etc.) and use 1st person sg. for this purpose. Or we can have multiple lemmas without saying, which one is the main one, and let the user decide — this is not possible in Wiktionaries, but it is possible here.
    --Shlomo (talk) 09:14, 12 November 2018 (UTC)
    Um, just for your information, en.wiktionary uses the first-person singular present active indicative as the lemma for Latin verbs. See wikt:WT:Lemmas. —Rua (mew) 10:43, 12 November 2018 (UTC)
    Ehm, thanks, mea culpa. It was stuck somwhere in my memory, which seems to be not so reliable any more ;)--Shlomo (talk) 16:48, 12 November 2018 (UTC)
    • As Wiktionaries aren't CC0, we can't import most of their content and model. Obviously we shouldn't otherwise these valuable projects would get aborted by another WMF project. We can still link to their pages and they can do the same. Lexemes at Wiktionary would probably have been preferable, but somehow a series of users we have hardly seen editing lexemes since wanted otherwise. Now people like you and me who actually contribute are stuck with the current situation. --- Jura 05:39, 13 November 2018 (UTC)
    • We should not be "stuck" with anything; everything can be changed. I agree Latin lemmata should follow standard practice among classicist (I am a lapsed classicist myself) and use the first-person present active indicative form for verbs. Ijon (talk) 00:07, 14 November 2018 (UTC)
  • Useful could be to include several forms in the lemma. --- Jura 12:57, 11 November 2018 (UTC)
    That's possible. The question was about the "main" lemma (whatever it is), which should be, per definition, only one. The way I understood it is, that main lemma is the one with plain language code (in this case la). The infinitive can be added as alternative lemma, e.g. with the code la-x-Q179230.--Shlomo (talk) 15:28, 11 November 2018 (UTC)
  • I don't see any problem with having multiple lemmata. I feel the that the first person present tense makes most sense for Latin lemmata and I will start contributing as such, but also adding the infinitive for people who are more familiar with the infinitive as a lemma could be useful as well. Liamjamesperritt (talk) 19:24, 13 November 2018 (UTC)

Senses are now displayed before Forms[edit]

Hello all,

Based on several requests, we now display the Senses section before the Forms one on Lexemes.

If you still see the Forms first on a Lexeme, you should purge the page or do an edit on it.

If you have any issue, the related ticket is this one, you can also ping me.

Cheers, Lea Lacroix (WMDE) (talk) 15:17, 12 November 2018 (UTC)

I noticed this change Friday or Saturday, was wondering when it would be announced - thanks! ArthurPSmith (talk) 17:11, 12 November 2018 (UTC)

Documenting dialects?[edit]

Does Lexeme support dialects yet? How can one mark a form one adds as belonging to a particular dialect, or as being a substandard (but very common) variant? Ijon (talk) 00:04, 14 November 2018 (UTC)

Or senses that only exist in a particular dialect. —Rua (mew) 00:26, 14 November 2018 (UTC)
There is a new property location of sense usage (P6084) that kind of addresses it (not so much dialect, but location), but I don't think there is anything like that for forms yet. Liamjamesperritt (talk) 01:07, 14 November 2018 (UTC)

Additional ISO 639-3 languages[edit]

What needs to happen for Lexeme to accept additional languages? I am going to be presenting to some speakers of the (indigenous) Noongar (or Nyungar) language of Western Australia, but Lexeme does not accept 'nys' as a valid language code. Ijon (talk) 00:11, 14 November 2018 (UTC)

There are still many language codes that have not yet been added to Wikidata. You can request to add language codes here. Liamjamesperritt (talk) 08:38, 14 November 2018 (UTC)
That page describes the process specifically for monolingual text properties. That's not what lexemes use. Perhaps @Lea Lacroix (WMDE): can tell us what the process for lexemes is. - Nikki (talk) 09:07, 14 November 2018 (UTC)
Could you describe what you're doing? nys is in the list of available languages, so it should be possible to use it. - Nikki (talk) 09:07, 14 November 2018 (UTC)
@Ijon: I tested the following process:
  • Enter the lemma
  • in the field Language of Lexeme, type "Noongar" and select the item
  • in the field Spelling variant of the lemma that appears, type "nys" and select the language
  • enter the lexical category
It seems to work. At which stage do you encounter an issue? Lea Lacroix (WMDE) (talk) 09:52, 14 November 2018 (UTC)
It was in adding a 'nys' gloss to an existing lexeme in another language. 'nys' was not accepted as a language there. Ijon (talk) 12:28, 15 November 2018 (UTC)
That's weird, I just tested here and it works. What kind of error do you get? Lea Lacroix (WMDE) (talk) 14:18, 15 November 2018 (UTC)
@Lea Lacroix (WMDE): I notice you didn't answer Ijon's original question. How do we get missing languages added? Noongar might already be there, but there are already plenty of others we're missing. - Nikki (talk) 21:46, 15 November 2018 (UTC)

I've just created balga (L37468) in Nyungar without problem. Meanwhile, it is always possible to use the general code "mis" (alone or with a precision in the private use : mis-x-Qid) as a (more or less) temporary solution if no code ISO is working/available. Cdlt, VIGNERON (talk) 07:38, 16 November 2018 (UTC)

Marking of language genre/style/variety[edit]

In Danish dictionaries, a word may be associated with an indication of the style of usage of the word. For instance, vovhund (L194) would be associated with children's language (Q1741898) and fucking (L37283) associated with vulgarism (Q1521634). Other possibilities could be slang (Q8102), technical jargon, informal, etc. For vovse (L128), I have been using has quality (P1552) setting it to children's language (Q1741898). Is that property the best way or have we another property or way? — Finn Årup Nielsen (fnielsen) (talk) 17:05, 15 November 2018 (UTC)

has quality (P1552) doesn't seem quite right; maybe part of (P361) or a (new?) subproperty would be better? In some cases instance of (P31) might also work. Also this seems like it would be more associated with specific senses than the lexeme as a whole, generally, no? ArthurPSmith (talk) 18:53, 15 November 2018 (UTC)
Of the properties already available, I would agree that part of (P361) is the most appropriate. It should also definitely be placed on specific senses, not on the lexeme. Liamjamesperritt (talk) 08:08, 16 November 2018 (UTC)
I don't find part of (P361) appropriate for vulgarism (Q1521634), euphemism (Q83464) or humorous sense (Q58233068). I prefer instance of (P31) and I already used it a lot. In Template:Lexicographical properties there is section for categorisation of senses. KaMan (talk) 08:19, 16 November 2018 (UTC)