Wikidata talk:Lexicographical data

From Wikidata
(Redirected from Wikidata talk:Wiktionary)
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2020/05.

Project
chat

Lexicographical
data

Administrators'
noticeboard

Development
team

Translators'
noticeboard

Request
a query

Requests
for deletions

Requests
for comment

Bot
requests

Requests
for permissions

Property
proposal

Properties
for deletion

Partnerships
and imports

Interwiki
conflicts

Bureaucrats'
noticeboard

Requests
for checkuser


Milestone - 200k lexemes[edit]

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).

Property suggestions for Lexeme/Form/Sense[edit]

On blank items, currently instance of (P31) and subclass of (P279) are suggested (see Help:Suggesters_and_selectors#Property_suggester about the feature). The idea is that the user would input a statement with one of these properties (not all).

I think it would be good to have some default suggestions for Lexemes/Form/Sense as well, e.g.

Are there other properties that could be included? --- Jura 11:01, 12 March 2020 (UTC)

For Form the main property is pronunciation (P7243) and two above as qualifiers. According to new model. --Infovarius (talk) 18:25, 12 March 2020 (UTC)
It seems to be in used mainly for Russian and had been added by bot, but why not. Latin could probably use it too. Only one of the properties needs to be chosen and one can always use others. I added it above and expanded others.--- Jura 21:54, 12 March 2020 (UTC)
If it's mainly added by bot and in Russian, maybe we should skip that. Also, maybe the property should have string datatype (but this is not really relevant to this discussion). --- Jura 16:37, 26 March 2020 (UTC)
Why only Russian? Or other languages have no variations in pronunciation of forms? --Infovarius (talk) 21:45, 27 March 2020 (UTC)
It was just an observation from current uses. --- Jura 13:58, 24 April 2020 (UTC)
For forms there could also be hyphenation (P5279). And language style (P6191) for both forms and senses. --Lexicolover (talk) 11:04, 13 March 2020 (UTC)
Sure. I added them. --- Jura 16:37, 26 March 2020 (UTC)

@Lea Lacroix (WMDE): is there a way to activate this? --- Jura 13:58, 24 April 2020 (UTC)

We would need to estimate how much work would be required for such a feature. I can't promise that it could happen any time soon, as our roadmap is already packed. Lea Lacroix (WMDE) (talk) 14:58, 27 April 2020 (UTC)

Common misspellings data[edit]

Hi. I recently found http://petnoga.se/ and really liked that they provide suggestions for correct spelling. I'm wondering if it would be a fruitful endevour to collect common misspellings under our lexeme forms? This would enable us to easily make something like that. If yes, then I suggest we team up ask LibreOffice for help with collecting the data we need. WDYT?--So9q (talk) 10:50, 19 March 2020 (UTC)

I personally do not think misspellings should be listed as forms. Perhaps they could be listed with regular statements using a special property. ArthurPSmith (talk) 17:34, 19 March 2020 (UTC)
I do think listing misspellings as forms is a great idea, especially if it's common misspellings. Also, the line between misspellings and variants can be blurry, especially for poorly documented or mostly oral languages. And even for well documented languages, it still useful, for example in French "occurence" (with only one "r" instead of "occurrence" with two "r") is (sadly) very common, almost 20% in the 80s according to Google Ngrams. Cheers, VIGNERON (talk) 20:05, 19 March 2020 (UTC)
Hmm, I guess that's a good point, and we do allow archaic forms... English also has some very common misspellings - for example Nobody can spell “fuchsia”. ArthurPSmith (talk) 17:06, 20 March 2020 (UTC)
Interesting statistic @VIGNERON:. Here is the equivalent for fuchsia which is actually only half as bad as your example.
I added a misspelled form to fuchsia (L3280) with a form with has quality (P1552) misspelling (Q1984758) / of (P642) fuchsia (Q5005364) but is it better to create a lexeme and add the form there? Also what is best, to refer to fuchsia (Q5005364) or the correct form invalid ID (L3280-F1) (right now the latter is impossible it seems as form support is missing in the input field). WDYT?--So9q (talk) 09:51, 21 March 2020 (UTC)
Oh, I actually meant what Arthur suggested: list with regular statements using a special property: e.g. common misspellings misspelling 1 misspelling 2. But having them as forms with a property indicating a misspelling might be a better way to model it. Obviously this also begs for annotating non-misspellings with at least a reference to an authorative source e.g. the danish correct spelling dictionary [1]. @fnielsen: What do you think about misspellings and modeling them?--So9q (talk) 12:54, 20 March 2020 (UTC)
I'm not 100% sure about the best way to model it; fuchsia (L3280) looks good though. The ideal would be a way to do all sorts of annotation, not only "misspelling" but also "variant", archaic form", and so on. Cdlt, VIGNERON (talk) 10:08, 21 March 2020 (UTC)
I agree, I created a new property proposal for misspellings.--So9q (talk) 21:29, 22 March 2020 (UTC)
Misspelling lexeme forms design mockup
I just added two misspellings to https://www.wikidata.org/wiki/Lexeme:L36116 but seeing there are already 8 forms and imagining 8 forms for each wrong spelling we end up with a mess with a total of 24 forms on one lexeme. It might be better to create a misspelling-lexeme like [2], creating of these would really be helped by e.g. updating Lexeme forms to help map form the misspelled form to the correct one side by side/drag drop or intelligently by analyzing grammatic features. The tool could also visually enable adding a one new correct lexeme and one or multiple misspelling-lexemes side-by-side with propagation to minimize typing, see the mockup to the right. WDYT @Lucas Werkmeister:?--So9q (talk) 21:41, 22 March 2020 (UTC)
In Wikipedia there are already a lot of misspellings documented: swedish, english. If we could find a way to import these we have a real good start IMO, they might be covered by copyright though.--So9q (talk) 22:28, 22 March 2020 (UTC)
Previously it's proposed that misspellings are stored as individual lexemes: see Wikidata:Property proposal/correct form. Alternatively, it may also be possible that misspellings are stored as statements.--GZWDer (talk) 00:49, 23 March 2020 (UTC)

I thought about it and I think the best approach is to have misspellings as statements for forms:

Correct Form (Lxxxx-Fx) --> Common misspelling (Pxxxx) :  misspelled word

IMO mispellings are neither lexemes nor forms. There is one thing I am not sure of - what actually common misspelling is? How do we define it? We all probably have some thought about what it is and that thought is probably different for each of us. There would be more strict opinions and more loose opinions. Would it differ language by language? Should it always be attested by lingustic science? It is probably not good thing to have future arguments about number of Google search hits. --Lexicolover (talk) 20:58, 23 March 2020 (UTC)

This looks good to me. I changed my proposal to reflect this. https://www.wikidata.org/wiki/Wikidata:Property_proposal/correct_spelling --So9q (talk) 22:19, 23 March 2020 (UTC)
I agree we should define "common misspelling". Here is my suggestions: the misspelling has to appear in one of the following sources:
  1. an authoritative source such as e.g. Retskrivningsordbogen (Q3398246)
  2. appears in one of wikipedias list of misspellings, e.g. [3]
  3. appears in WD with p31 misspelling (Q1984758) e.g. Rzehakinacea (Q33188867)
  4. appears in a corpus approved explicitly by this community with an occurrence over a certain threshold. (yet to be created, decided)  – The preceding unsigned comment was added by So9q (talk • contribs).
  • I think the forms should be included as forms, not in some random string property. Accordingly, I prefer the initial version of the proposal. Wikidata is generally descriptive, not prescriptive. --- Jura 16:31, 26 March 2020 (UTC)
    @Jura1: In my understanding misspellings are not forms. Forms have some function in the language, misspellings don't. There might be mispellings that match other forms of the word (different form, archaic form, colloquial form etc.) but there is not form that has "quality" of being misspelling - such thing has no use for language.--Lexicolover (talk) 21:59, 27 March 2020 (UTC)
    • So you prefer the normative view? Some languages have a language board with a fairly authoritative view of its role or in some countries, there is an official list of first names to use. --- Jura 12:33, 1 April 2020 (UTC)
@So9q: Thank you for your effort. 1) is okay for me; 2) and 3) are IMO not good - it's basicaly selfreferencing. The link at 2) IMO combines misspellings and mistypings (there's a bit of difference for me); the example at 3) is not even misspelling it's incorrect name. 4) might be good but I am afraid most of corpuses would not be annotated to decide what word is real misspelling.--Lexicolover (talk) 21:59, 27 March 2020 (UTC)

User box for the Lexicographical data project ?[edit]

Hi,

the title is the question Face-smile.svg is there already a user box ? if not, could we create one ? --Hsarrazin (talk) 12:55, 1 April 2020 (UTC)

I don't think there is one yet. Feel free to create one :) Lea Lacroix (WMDE) (talk) 12:27, 3 April 2020 (UTC)
✓ Done {{User LexData}} (see also #Wikidata:WikiProject Lexicographical Data ?). Cheers, VIGNERON (talk) 09:28, 24 May 2020 (UTC)

How to model the two different comparisons of germanic languages?[edit]

Hi, I discussed briefly with Lucas W about how to model this and we did not reach a solution so now I would like your input. My example where both syntactic and morphological comparison applies was this:

1) happy happier happiest

2) happy, more happy, most happy

So "the most happy day of my life" == "the happiest day of my life"?

If this is correct then I believe we should have a total of 5 forms on happy (L1349) but today we only have the -ier -iest forms. I think that is an omission we should decide if we want to correct. WDYT?--So9q (talk) 17:05, 1 April 2020 (UTC)

@So9q: I don't know about other languages, but in English I would view the "more" and "most" variants as applicable to any adjective or adverb; "more" and "most" are lexemes in themselves that have those comparative meanings, and there's no need to add them as forms. Most English adjectives do not have "-er" and "-est" ending forms, but require "more" and "most" to create those comparatives. So I think it's most useful just to add those single-word forms where they do exist, and not add extra forms with "more" and "most". ArthurPSmith (talk) 19:37, 1 April 2020 (UTC)

Huge increase of number of Forms[edit]

Hello all,

We noticed a huge increase of the number of Forms these past few days (+100K) (see Grafana board). Is one of you working on an import project?

Thanks, Lea Lacroix (WMDE) (talk) 12:39, 3 April 2020 (UTC)

I'm guessing that at least some of that increase is due to @Uziel302: who is working on adding forms in Latin (and with Latin declension the numbers of forms go up very quickly!) Cdlt, VIGNERON (talk) 14:49, 3 April 2020 (UTC)
Lea Lacroix (WMDE), indeed I've uploaded around 1M forms with their lexicographical analysis based on Whitaker's WORDS. The graph is confusing, mixing all types on the same graph. Let me know if any issue was found in the import. Uziel302 (talk) 20:47, 3 April 2020 (UTC)
The Elhuyar Fundazioa bot has also been adding some more Basque lexemes lately. Mahir256 (talk) 20:52, 3 April 2020 (UTC)
User:Elhuyar Fundazioa is indeed uploading forms but since he uploads each form in a separate edit, it takes more than an hour for him to upload a thousand forms. It also affects load on the system and we reach high maxlag. I think he should group the forms and upload them all in one edit like I did in my last upload. Code is here. Uziel302 (talk) 21:04, 3 April 2020 (UTC)
I've finished uploading Latin forms now, about 1.1M were uploaded in total.Uziel302 (talk) 13:41, 6 April 2020 (UTC)

Adding Kurdish data[edit]

Hi,

Sorry in advance, if it's not the best place to ask this question.

I am wondering if there is a way to import data automatically and merge them with the existing entries or create new ones. We have developed a few dictionaries for three dialects of the Kurdish languages, namely Hawrami, Sorani and Kurmanji. The datasets (available here) are originally created in Ontolex-Lemon, particularly using the Lexicog module.

Any help would be greatly appreciated.

--Sina.ahm (talk) 23:41, 5 April 2020 (UTC)

Hi Sina.ahm, this is the right place to ask help for this kind of question/request. The first point to solve, before taliking about the technical side, is the licence. On the Github repository, I see that your data are released under CC by-nc-sa. This licence is not compatible with Wikidata nor any Wikimedia project. Wikidata uses a CC-0 licence which means, in short, that is very close to the public domain. Other Wikimedia projects use mostly the CC by-sa licence, that is similar to the licence you have chosen but that authorises to reuse the data also for a commercial purpose. So, before going further, you should see to change the licence of your data. Changing to CC-0 will allow you to import your data on Wikidata. Moving to CC by-sa will allow you to import your lexicographical data on any Wiktionary project. Pamputt (talk) 06:44, 6 April 2020 (UTC)
Sina.ahm, Pamputt, this is true for copyrightable content, grammatical information is not copyrightable so you may upload it and leave the copyrightable data aside (definitions, encyclopedic information etc.). Uziel302 (talk) 13:46, 6 April 2020 (UTC)
Thanks very much, Pamputt and Uziel302. As the owner of the data, I will change the license to the appropriate one. Assuming that the problem with the license is solved, is there a bot or a tool to import the data automatically? Where should I start importing? --Sina.ahm (talk) 13:49, 6 April 2020 (UTC)
Sina.ahm, you can upload the data using the wikibase API, this is the code I used to upload Latin lexemes. You should do some testings and when your script is ready and has some valid example edits, you should request permission to run as a bot. Uziel302 (talk) 14:09, 6 April 2020 (UTC)

New LexData version 0.2[edit]

I just released the version 0.2 of LexData – my python library to create and edit Lexicographical data. The changes include:

  • Support for bot-passwords
  • Respecting an properly handling maxlag
  • claims can be added to exsisting Lexemes, Forms and Senses
  • Claims of arbitrary data types can be created
  • Updates via JSON are possible
  • Improved search
  • I added tests, so there hopefully should be considerably less bugs and regressions
  • Consistent logging
  • and more…

As before you can get it from Github or pypi and the documentation can be found here. There are some minor incompatible changes but I tried to keep it compatible with the 0.1 version branch and only deprecate the old apis. In most case there should be no changes necessary. Happy Hacking! -- MichaelSchoenitzer (talk) 22:39, 17 April 2020 (UTC)

Very nice! I hopw it will be put to good use in our tooling going forward.--So9q (talk) 22:52, 23 April 2020 (UTC)

Described by Wiktionaries[edit]

Hello,

It seems complicated to use wikibase:sitelinks to a Wiktionary, so I suggest to use P1343 to indicated when a form is described in a Wiktionary and when a sense is described. I made a test to say that the sequence of letters chat is described in French Wiktionary, and the meaning associate with a pet is also described in French Wiktionary. Do you think it is acceptable? Another strategy may be to create a serie of property similar as P7829 (for Wiktionaries instead of Vikidia) to add links to the dedicated pages. Noé (talk) 07:52, 18 April 2020 (UTC)

@Noé: This seems reasonable to me, but it would be nice to also include a link to the specific Wiktionary page referred to, perhaps with a reference URL (P854) reference statement, or maybe there's a qualifier that would work? ArthurPSmith (talk) 15:16, 20 April 2020 (UTC)
I don't have any preference, but I'll be please to include also links to the wiktionaries pages and to the meanings (with anchor to a definition when available). If a Sense is also connected to a Qid with a Wikipedia page via P5137, we may be able to have one or more links to Wiktionary in the side menu on the Wikipedia page, next to the other projects. That may definitively prove Lexeme is answering to the initial project of Wikidata:Wiktionary, helping Wiktionaries Noé (talk) 15:34, 20 April 2020 (UTC)
I love the idea of linking Lexemes to Wiktionary (this was indeed one of the orignal goal after all), and vice-versa (that's exactly why I built the gadget mentioned here: #Gadget to link Wiktionary and Wikidata). This is definitely an idea worth looking deeper into it and doing the things right. So for me, it's a big YES, this is acceptable and even desirable!
In practice, I'm not sure what is the best way, it could be described by source (P1343) (alone or with qualifier, eg. reference URL (P854)), it could be a linking property, or anything else. I'm also wondering what is the best place, described by source (P1343) is often used on main Lexeme level (again alone or with qualifiers, for instance demonstrates form (P5830)) ; in the test, it on both sense and form. It make sense but not sure if it's really optimal.
If we put on form level and/or indicate the form as qualifier, then there is no need to explicitely store the link as the link is trivial to built (it could probably even be done by javacript inside the Wikidata interface itself) and even without this precision, most of the times, the main lemma can be used by default. In fact, it seems to me a more reliable and elegant way to do it than using reference URL (P854) and even more than a solution like English Vikidia ID (P7829). Caveat: it may depends both on the granularity wished (linking to the page itself or to an anchor in the page) and on the wiktionary targeted as each wiktionary has a different structure, but on the other hand, it allows the reuser to built as they wish Face-wink.svg.
Cdlt, VIGNERON (talk) 16:59, 20 April 2020 (UTC)
Hey, thanks to VIGNERON live contribution yesterday, I was able to write a query to have all French entries that have a connection to a Qid with associated with a Wikipedia page : https://w.wiki/Qhc
I am quite happy to see 500+ pages. I am wondering now if it could be a good sample to do a proof of concept for the idea exposed up here. So, it could be great if someone have an idea on how to check if those pages exists in French Wiktionary and then automatically add this statement to those entries. Then, when the last column of the query is fulfilled (now it is only filled for "chat", you can search for it in the results), it could be possible to develop a gadget to indicate in Wiktionary the relate Wikipedia page, and in Wikipedia to indicate the related French Wiktionary pages. It could be nice! Noé (talk) 08:45, 13 May 2020 (UTC)
@Noé: the gadget I spoke earlier #Gadget to link Wiktionary and Wikidata could be adapted for your purpose. You should contact @Darmo117: who helped me built this gadget for the tech part and obviously I can help for the SPARQL part. Cheers, VIGNERON (talk) 09:28, 13 May 2020 (UTC)

Mapping of Lexicographical data model[edit]

I was wondering if the data model itself can have statements that apply external mapping? I am doing most of the Schema.org/Wikidata mapping as some of you know and I noticed that the Lexicographical data model could possibly be mapped to other ontologies where it made sense. For example, see where I recently made a small edit to this page where I added a equivalent to: note about skos:definition here to the Gloss explanation: "Enter a gloss (very short phrase defining the meaning)(equivalent to: skos:definition) — Gloss" but ideally this would be applied in the data model itself somehow? Thadguidry (talk) 17:05, 20 April 2020 (UTC)

@Thadguidry: good idea, I think this has at least partly already be done for the query service (where you indeed use skos:definition to get the gloss). Not sure exactly where it has been documented though, I have used mw:Extension:WikibaseLexeme/RDF mapping when I started to do SPARQL on Lexemes, not sure if there is more or a better documentation somewhere @Lea Lacroix (WMDE): ? Anyway, integration the main documentation page sounds like a really good idea. Cdlt, VIGNERON (talk) 17:33, 20 April 2020 (UTC)
Hey @Thadguidry:,
I don't fully understand what you mean. Do you want to improve the documentation? Or the descriptions on Lexemes themselves? Can you provide some examples so I can understand what the request is? Thanks! Lea Lacroix (WMDE) (talk) 07:08, 21 April 2020 (UTC)
Hi @Lea Lacroix (WMDE):!,
Sure, we already have a few properties that allow mapping both Properties and Entities via external ontologies such as P2888 exact match and P2236 external subproperty and others. Here's an example of the mapping being done inside Wikidata itself P2561 name So the idea is that we could be able to apply a few statements or assertions on the various fields (like Gloss) in the Wikibase Lexeme RDF mapping itself showing relations to external ontologies, as well as on individual Senses. For example: Gloss is already mapped, but hardcoded, so I think its best to expose this somehow to allow edits to be made as there are other ontologies (now as shown in LOV and future) that should be aligned (not just Lemon and SKOS). If the Wikibase Lexeme does allow writing this kind of information, then that is the documentation that I am missing. Thadguidry (talk) 12:29, 21 April 2020 (UTC)
UPDATE: It looks like it IS POSSIBLE to directly apply the mapping however!! (this wasn't working for me initially several months ago and don't know why). Take a look at the "exact match" I applied here for the concept/Sense of "first". So, I think the only thing that is left to do is documentation showing how to apply mapping to the Lex data model itself? How was Gloss linked to skos:definition in the RDF? And as a user, I would probably expect to see that information on Wikidata:Lexicographical_data/Documentation Thadguidry (talk) 12:29, 21 April 2020 (UTC)
Thanks a lot @Thadguidry:! Yes, I think Wikidata:Lexicographical_data/Documentation would be a good place to mention the mapping of the Lexeme data model to a few other models. Feel free to continue improving the page :) Lea Lacroix (WMDE) (talk) 16:01, 21 April 2020 (UTC)

Abrreviation as separate lexeme or not?[edit]

See kilomètre (L19811) (has "km" as form). --So9q (talk) 10:19, 28 April 2020 (UTC)

  • In some languages (pl?), they are separate lexemes to allow addition of the usual 100? forms present on any lexeme. --- Jura 10:36, 28 April 2020 (UTC)
I think this would strongly depend on language. When I was thinking how would I deal with abbreviations in Czech language I did not make any final conclusion. Generally I would follow some self-imposed rules with reasoned exceptions: 1) Abbreviations that act and are read as any other word (=acronyms?) should be separate lexemes; 2) abbreviations that are meant for writing only and are usually read unabbreviated (ex.; b.; etc.) should not be considered separate lexemes, IMO some kind of (new?) property would be best for them but listing them as forms might work as well; 3) Symbols should not be considered separate lexemes and some kind of (new?) property would be best for them (and I would not list them as forms). --Lexicolover (talk) 20:57, 1 May 2020 (UTC)

New Version of MachtSinn[edit]

Many of you might already know my tool "MachtSinn", that allows users to easily add senses to lexemes, generated from wikidata items. In the last weeks I have again improved it significantly. You might want to take another look at it.

Compared with the first version announced here some time ago, there are a lot of improvements:

  • You can add glosses not only in the lexemes language, but in arbitrary languages at the same time!
  • There are now Keyboard shortcuts
  • New address: Machtsinn.toolforge.org – thanks to the Cloud Services team at WMF
  • More potential matches especially for languages with smaller Wikipedias
  • Less false-positives – especially for English
  • To have fewer false-positives it's possible to add separate optimized queries for each language (Currently only English).
  • Improved, responsive design – should work on mobile devices
  • English verbs are prefixed with to, German nouns are prefixed with Der/Die/Das. You can add prefixes for other languages.
  • You can edit the gloss, if it is not an appropriate description
  • Your browser language is used as default language
  • The grammatical gender and lexicographical category is displayed
  • The database is regularly updated
  • There is an help text and improved statistics
  • The amount of network traffic is minimized so that is should work well even with slow connection
  • Fixed lot's of bugs

Big thanks to @Incabell, @DDuplinszki, @Ainali, @so9q and @dzarmola for their help. And thanks to everyone using it to add senses – already 15'000 senses have been added. 35'000 potential matches are currently waiting for you. ;) The code is available on github – Pull Requests are welcome. Have fun and stay safe. -- MichaelSchoenitzer (talk) 23:26, 30 April 2020 (UTC)

CEFR language competence level for lexeme[edit]

Hi! For many languages, there are so called "word lists" published, which a person, whose knowledge competence corresponds to a particular level (say, A2), is expected to know. I think the community would benefit a lot, if those lists could be imported into WikiData Lexemes as well, so on the lexeme page we could see which CEFR (or any other scale) level will this word correspond to. But I could not find any suitable property. What do we think on this? Could a property be added, and what should the ontology be?  – The preceding unsigned comment was added by 62mkv (talk • contribs).

I have been wondering how best to mark that a word appears in the Goethe Institute's word lists for German. I think rather than a specific property for CEFR or level, I would use something more general, such as "on word list", with a link to an item like "Goethe Institute B1 word list" which would have information about who published it, when and where. That way we could link to non-CEFR word lists too. - Nikki (talk) 10:11, 5 May 2020 (UTC)
Sounds good to me.--So9q (talk) 16:15, 8 May 2020 (UTC)

How broad should the senses reach?[edit]

Hi, I stumbled upon this lexeme today https://www.wikidata.org/wiki/Lexeme:L58286. It has senses not only covering the sense in the language of the lexeme but the similar concept in other countries. I imagine we could continue down this road and add senses for the similar concept in all countries that have it. But is that really what we want?--So9q (talk) 03:59, 9 May 2020 (UTC)

  • @So9q: I have been generally NOT adding country-specific senses even where Wikidata has items - military ranks are a very common case, for example lieutenant has both general items and specific items for "British Army and Royal Marines", "french military", "Canadian Armed Forces", "Royal Navy", "Starfleet" etc. I would advocate for only adding the general meanings as senses, the specific ones really don't add anything significant to the meaning. ArthurPSmith (talk) 14:38, 11 May 2020 (UTC)
    • Anyone else have an opinion on this? I agree with @ArthurPSmith:.--So9q (talk) 18:52, 13 May 2020 (UTC)
  • @So9q: When adding senses to Russian lexemes I run into the same problem (especially with military ranks too). I tend to do the same way as ArthurPSmith. --Infovarius (talk) 02:57, 15 May 2020 (UTC)
  • Sorry to break the consensus but I disagree a bit Face-wink.svg. For me, specific senses seems both useful and necessary in a lot of cases. Despite having the same name and etymology, some senses can cover very different reality (maybe it should even be separate Lexemes? probably not but I wonder...). For instance canton (L18778), in Switzerland a "canton" is a very big administrative unit (similar to a region or a state in the United States) while in France, a "canton" is a very small administrative unit (similar to a county or more often to a quarter of a city). That said, the structure in place on socken (L58286) is maybe not the best (and by the way, I'm notifying @Vesihiisi: who is the best to talk about this lexeme she created Face-wink.svg) and maybe we can find a better as it's indeed not truly separated senses but more "derivated" senses. Cheers, VIGNERON (talk) 14:42, 15 May 2020 (UTC)

Wikidata:WikiProject Lexicographical Data ?[edit]

Hello,

Is there a team of lexicographers hidden somewhere? Did the people adding lexicographical data had gathered already around a place where initiatives and personal projects are discussed? And finally, is there is a logo for this team/group/bunch of people or for Lexicolovers in general? I was looking for a userbox icon saying the interested for Lexeme data but I haven't found it. Is it really too early to create a team spirit here? Noé (talk) 06:34, 16 May 2020 (UTC)

@Noé: the « team » is hidden here in plain sight ;)
Yes, again it's here. Welcome!
Not that I know of, but feel free to propose one!
Second time (at least, after #User box for the Lexicographical data project ?) it is asked, so I created one: {{User LexData}} (with the glyph of ama/𒂼 (L1) as the image in the meantime, can someone activate the translation balise?). Team spirit exist without symbol but symbol are indeed useful for team spirit.
Cheers, VIGNERON (talk) 10:05, 16 May 2020 (UTC)
So, if it's a WikiProject, I made a recat into the category for Wikiprojects. For the team, there is no list of participants, like for other projects, but it is maybe more a Wikipedian habits than a Wikidata one. Great for the userbox! I though, for the logo, of a L made of Wikidata lines, with the same colors, but it could be hard to read in a small size. The ama sign is pretty! Noé (talk) 10:29, 16 May 2020 (UTC)

Gender in French[edit]

@Lepticed7: Does Lexeme:L241 really have 2 different genders? I'd propose to separate this into 2 lexemes: "chien" and "chienne" with definite genders. --Infovarius (talk) 00:20, 17 May 2020 (UTC)

@Infovarius: Hi! The fact is that in French, two things characterize nouns: gender and number. For gender, we have both masculine and feminine, and (almost?) every noun in French are either masculine, feminine, or sometimes both (it can change depending on multiple factors. One word in this case is « chips »). We even have nouns that are masculine when singular, but feminine when plural (like « amour » (love)). For the classic pet animals (dog, cat) or for the farm animals (cow, pig, chicken, etc.), we have a version of the word to identify male animals (« chien », « chat », « canard », « verrat »), these words are masculine; and we have the feminine words to identify female animals (« chienne », « chatte », « canne », « truie »). Some are "just" inflections (is this the right word?) using suffixes, generally « -e », and some are not (like « verrat » and « truie »). It could be great to have more points of views on this topic, but because gender and number characterize a form, and not a lexeme, and words in French (nouns or adjectives) are presented by giving this pair, I think we should not present this information on the lexeme, but on the forms. And we should not separate these two pieces of information. Lepticed7 (talk) 02:04, 17 May 2020 (UTC)
flexion in French is inflection in English, désinence is verbal inflection. It sounds like a heavy metal band to me! Noé (talk) 08:55, 17 May 2020 (UTC)
My opinions is that "verrat"/"truie" are not different forms of 1 word but are different words (linked with some relation). We have the same in Russian: кабан/свинья. Nouns are not inflected by genders (like adjectives)! They have gender! --Infovarius (talk) 23:52, 18 May 2020 (UTC)
It’s okay for me this way. I modified in first place because the lexeme presented the masculine and feminine gender. But, if for nouns, we do separate lexemes for genders, I agree :D Lepticed7 (talk) 07:42, 19 May 2020 (UTC)
  • It's already appropriately separated: there is chienne (L29225) for the female. We just need to find a good way to link the two entities. --- Jura 09:01, 17 May 2020 (UTC)
Separating or not lexeme based on gender is an open question since the beggining of the Lexemes (and even before, as it was an example also during the test plateform), qv. Wikidata_talk:Lexicographical_data/Archive/2019/05#Lexemes_and_gender_of_noun for instance (where I list some cases).
I don't have a strong opinion but must say, I'm not really convinced but the "two lexemes based on gender" solution by default. "chien" and "chienne" is the same lexeme, same lexical category, same etymology, morphology, almost same everything, except for gender. @Infovarius: « Nouns are not inflected by genders » really ? cases like chien/chienne (in French) or perro/perra (in Spanish), Lehrer/Lehrerin (in German) are clearly inflections (and this is in fact the most common case, most nouns have forms depending on gender), this is also not what the sources seems to say on en:Grammatical gender or look for "gender inflection" on Google books which give many results. « They have gender! » yes, but this gender is not always unique or even existant, they can have 0, 1, 2, 3 or more gender ; having one gender is maybe the most common case but there is a lot of exceptions (especially if you consider diachronic or dialectal data).
In the end, the situation is *very* complicated, maybe two lexemes can help be more precise but it is also more complicated and raise many more new questions. Should we create a duplicate for each gender even when the gender is unmarked like ministre (L19816)? And what about suppletion (Q324982) (when the inflected forms are not related, like "verrat"/"truie" for gender but there is also the same phenomenon for number, like "ki"/"chas" ki (L69) in Breton).
Cheers, VIGNERON (talk) 11:09, 20 May 2020 (UTC)
True but gender behave more or less the same in most languages and no language has never had a "one lexeme has only one gender" iron law, there is always exeptions. At least for French, many words don't have only one gender so we should talk in depth about how to deal with that. To start, here is the query for the (currently) 43 Lexemes with both masculine and feminine gender.
I didn't notice that you create this table {{Single or multiple lexeme}}. It's seems interresting but I'm a bit confused, why is it in the data model? was it announce somewhere? where does it come from ? (you put "from talk archive" in the summary edit but it's very vague, which talk archive?) and how should it be read?
Cdlt, VIGNERON (talk) 18:12, 22 May 2020 (UTC)

Some help to use LexData[edit]

Hi, I'm trying to use LexData to create lexemes, but I got an message like: INFO:root:Maxlag hit, waiting for 5.0 seconds. If the lexeme already exists, everything is okay, but when the lexeme doesn’t exist, I’ve got this message. What am I supposed to do? Thanks! Lepticed7 (talk) 08:58, 18 May 2020 (UTC)

@Lepticed7: You are very likely not doing anything wrong; Wikidata disallows "bot" edits when the "maxlag" value is too high, due to too many backlogged edits that need to be processed. This happens quite often, I suggest you just wait a few minutes and try again (maybe several times). There are also grafana charts that can show you the current maxlag value so you know when it won't work, this one in particular. ArthurPSmith (talk) 17:39, 19 May 2020 (UTC)

Please double check this lexeme[edit]

Hi, please check låne (L300647) and tell me if it is wrong. I'm new at this. Iwan.Aucamp (talk) 00:29, 20 May 2020 (UTC)

@Jon Harald Søby: could you take a look? Cheers, VIGNERON (talk) 10:28, 20 May 2020 (UTC)
@Iwan.Aucamp: It looks alright to me, except I don't understand why there are grammatical gender (P5185) and requires grammatical feature (P5713) statements on the forms, that seems redundant to the grammatical features listed. Also the S2 is – AFAIK – only in the phrase "låne tid", not for the base form "låne". Jon Harald Søby (talk) 14:26, 20 May 2020 (UTC)
@Jon Harald Søby: Thanks for the feedback, I adjusted it and removed the sense. I'm not very familiar with best practices and just trying to get the feel for it so the input is much appreciated. Iwan.Aucamp (talk) 14:53, 20 May 2020 (UTC)

Model lexemes and language communities[edit]

Is there a place where we can define model lexemes for languages? Maybe a new property is needed for that similar to model item (P5869)? I think if we have some model lexemes for each language it will make it easier to manage.

Also is there some approach for community coordination around specific languages similar to wikiprojects?

Iwan.Aucamp (talk) 14:54, 20 May 2020 (UTC)

@Iwan.Aucamp: some people started pages for specific languages listed on Wikidata:Lexicographical data/Documentation/Languages. Some of these pages are not bad (I worked a lot on Breton : Wikidata:Lexicographical data/Documentation/Languages/br) but most are still stubs with only basic informations. Feel free to create one. Also, to all lexical lovers, it would be nice to have feedbacks to improve them and having a coherent structuration (not to be strongly enforce but suggestions most sections could be similar). Any comments are welcome on Wikidata talk:Lexicographical data/Documentation/Languages. Cheers, VIGNERON (talk) 15:56, 20 May 2020 (UTC)

Modeling etymologies[edit]

Hello! We have now a lot of lexemes in many languages, including Latin, and this helps us building etymologies. I have been talking about this with a friend of mine who works in this area and he has given me some advices on how to proceed, but I would need your opinion on how to model it.

The derived from (P5191) logic would be this:

But he proposes to use something like this:

1 This word is still in use in some Basque dialects

How could we model the cf. (Q1048501) items? Is that a property? Or do we have a model for that?

@VIGNERON, Uziel302: -Theklan (talk) 09:06, 22 May 2020 (UTC)

@Theklan: very interresting suggestion (etymology is indeed more than just a straight line).
For « eleiza1 », I would simply put it as a form.
Indeed a "confer" property could be useful, maybe we could simply use an existing property (is there a fitting one? not see also (P1659) only for property) but a new property may be cleaner. And we use it as qualifier of derived from (P5191) or as direct property?
Cdlt, VIGNERON (talk) 09:52, 22 May 2020 (UTC)
@VIGNERON: The property "confer" is problematic, because... should it be another lexeme, or could be any string? -Theklan (talk) 09:59, 22 May 2020 (UTC)
@Theklan: definitely not a string but it coud be either a lexeme or a form of a lexeme. I would be leaning toward the first a lexeme, in most case it's precise enough and then you can use the qualifier demonstrates form (P5830) (which means that in this case, "confers" must be a direct property and not a qualifier, since qualifiers cannot get qualifiers themselves), plus a lexeme can have (and often have) multiples homographic forms so again in this configuration, Lexeme is better.
Here a suggestion for the real and simple example of bara (L2283) :
confer
Normal rank bara (L2284) Arbcom ru editing.svg edit
▼ 1 reference
stated in (P248) Q50915490
+ add reference


+ add value

.

For the name "confer" is probably a bit too general, maybe "cognate" would be a better name for this (but maybe it's too narrow and too pedantic, and not exactly what you mean…), maybe "related lexeme" (or as an alias?).
Cheers, VIGNERON (talk) 10:29, 22 May 2020 (UTC)
I'm not an etymologist, but I think that confer is a way to compare something. If I say that eleiza (L300882) ↔ *egleisa (old Occitan) I can add there as qualifier "confer" Old Gascon gléisa.
derived from
Normal rank *egleisa Arbcom ru editing.svg edit
confer gléisa (Old Gascon)
▼ 0 reference
+ add reference


+ add value

-Theklan (talk) 10:41, 22 May 2020 (UTC)

I have shown him this conversation and we proposes that bot cognate and confer should be created. In most of the cases both of them may be interchangeable (gléisa is cognate of eliza and bot are coming from *egleisa), but in some cases the confer property may demostrate how a word can change by analogy, and wouldn't be related to the word itself. You can show with confer a well attested vowel change that should be noted for a not well attested change in another word, as a process. Cognate would be more used in most of the cases, then. -Theklan (talk) 11:35, 22 May 2020 (UTC)

@VIGNERON: There is this proposal by Fnielsen (talkcontribslogs). Theklan (talk) 13:04, 22 May 2020 (UTC)
Here an example of that could be done: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L49331&predicates=P5191 -Theklan (talk) 17:41, 22 May 2020 (UTC)
Hello ! Confer or derivated is not good IMHO. I do etymology on french Wiktionary since long time and if you want to link properly two lexeme with an etymology link, you need to explain four angles of analysis : morphology, historical phonetic, historical semantic and contextual analysis (like history of peoples who created the word). If you link a word with an etymon, you need to explain why, in the relation. And, pay attention that very lots of words can’t be simply linked to other (especially because of the etymological structures of the lexicon). Derivated is a specific linguistic term in etymology used only for certain cases. Lyokoï (talk) 11:35, 24 May 2020 (UTC)
@Theklan: if tools (like the Wikidata lexeme graph builder, SPARQL quary and many others) can already give you the cognates, why storing it explicitly? That said, a "confer" (or whatever name) as qualifier (your solution above is indeed beter than mine) can be useful to point to a specific cognate relevant for the etymology.
@Lyokoï: very interresting, could you tell us more. How would you model that? (with direct properties or qualifiers? I imagine the later ; and to compare to the current model where we already have two properties for etmology and morphology). And ideally, do you have any references about that?
Cdlt, VIGNERON (talk) 12:54, 24 May 2020 (UTC)
My 2 cents: Wikidata is a secondary database, I uploaded Latin here based on Whitaker's WORDS and I think any relation between words should be based on sources. If we have source claiming that one word is derived from another word, we should be able to link those forms, not just the lexemes. We can have multiple sources offering different etymologies. We shouldn't force any systematic relations between words, unless they are consensus on the academic sources. Let alone guessing etymologies based on forms and sound, it's very easy to get false etymology (Q17013103). Uziel302 (talk) 18:50, 24 May 2020 (UTC)
Some terms in this proposal are reconstructed words (Q55074511) and based on prior discussions, I feel it is still unclear if those are Lexeme or not and how to source them. Is it the right time to restart this conversation? Noé (talk) 10:13, 25 May 2020 (UTC)


I think Wikidata would greatly benefit if we had clear vision how it should look in the end. At this point we rely on derived from (P5191) and combines (P5238) (and it is not always done correctly). Confer property might give us some interesting views on etymologic data. But there are other issues we should deal with. For example my dictionary is full of entries saying something like "origin is unclear, maybe it has something with X, there are opinions it is unlikely" or "probably onomatopoetic origin". Sometimes we can't be sure what the most direct predecessor is (for example whether the relation between lexemes A, B, C is A → B → C or B ← A → C or A → C → B, ...). @Theklan: Do you think you could along with your friend come up with some best practices for common etymology issues on Wikidata? --Lexicolover (talk) 10:19, 26 May 2020 (UTC)
@Lexicolover: I don't think we can model something universal, and that's why we need to discuss a good practice. . In Basque language etimologies are very unclear, so only really clear etimologies are added. I would go with that, and that's why confer may give interesting information to the reader on how this etimology has been guessed. This example is a good example of what can be done: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L49331&predicates=P5191 .
@Noé: I think we need to restart that conversation, if it was closed. -Theklan (talk) 11:05, 26 May 2020 (UTC)
@Noé: yes, we absolutely need to (re-)start this discussion. A bot request is not really a good place for a discussion, there was many differents aspects and from what I understand the core problem was the lack of reference more/instead of the reconstructed lexemes themselves. Cheers, VIGNERON (talk) 12:53, 28 May 2020 (UTC)
@Lexicolover: yes, we need a clearer vision of both etymology in general (Theklan and his friend can help there) and on how to use the existing properties (here it is a job for us on the Lexemes side). For the other issues you raised, can't it be simply dealt with qualifiers? sourcing circumstances (P1480) was exactly created for the purpose of saying things like unlikely, probably, maybe, etc. Cdlt, VIGNERON (talk) 13:08, 28 May 2020 (UTC)

Notability of languages[edit]

Have we any policy about notability of languages? Look e.g. Q63449899. Or I can invent my language and add Lexemes in it? --Infovarius (talk) 18:42, 29 May 2020 (UTC)

There is Wikidata:Lexicographical_data/Notability#Languages. Pamputt (talk) 21:40, 29 May 2020 (UTC)
Q63449899 has a wikilink, so it's notable enough per Wikidata general rules (WD:N). For the notability of Lexemes, we need to make this draft page a rule (and maybe to talk about it before validating it, I think it's mostly good but at least the introduction need some rework). Cheers, VIGNERON (talk) 08:55, 30 May 2020 (UTC)