Wikidata talk:Lexicographical data

From Wikidata
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Statistics

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2023/11.



Import single-word senses from Wiktionary[edit]

To increase the amount of senses in Wikidata, I would like to import senses from Wiktionary. Due to the license incompatibility, this can only be done for items below the threshold of copyrightability. In the telegram chat, it was assumed that limiting the import to senses consisting of a single word would be sufficient to meet this criterion. Additionally, senses should only be imported for lexemes without any existing senses to avoid duplication. My current import suggestions can be found at https://static.karl.berlin/wikidata/ and I plan to use https://gitlab.wikimedia.org/toolforge-repos/twofivesixlex to execute the import.

  • Do you agree on the license situation?
  • Does anyone want to proof-read the suggestion list?
  • Is there anything else I should be aware of in this context? This is my first import.

Karlb42 (talk) 15:08, 1 October 2023 (UTC)Reply[reply]

@Karlb42: I don't think there's a copyright issue here, other than possibly the one associated with databases which we routinely ignore. I did look through your English list and I'm not sure these will be super useful. A sense of "chirp" is "insects"? But some of them will be helpful. It's also not all that long a list (915 for English) - is this all the single-word definitions in enwikt (for lexemes with no senses)? Oh - one other thing is how do you expect to handle cases where we have multiple lexemes (for different lexical categories for example - nouns and verbs etc.) that have the same string value? ArthurPSmith (talk) 17:57, 2 October 2023 (UTC)Reply[reply]
> A sense of "chirp" is "insects"?
These terms make sense to disambiguate the word before translating it into different languages (birds and insects making sounds are different words in other languages), but I agree that they are not a good sense description. Limiting the glosses to single words probably highly overrepresents cases like this compared to the total data set. I'll reconsider if the approach is viable.
> is this all the single-word definitions in enwikt (for lexemes with no senses)?
@ArthurPSmith I restricted it to nouns for now to reduce the scope and limit the amount of different problems. But the code works on other parts of speech, too. I also only included glosses with at least one translation in my subset of the Wiktionary dataset (the one used in www.wikdict.com). Apart from these limitations and potential bugs, it is all.
> do you expect to handle cases where we have multiple lexemes
Yes, I match by part of speech. Karlb42 (talk) 16:40, 3 October 2023 (UTC)Reply[reply]
Ok, well it seems to me it wouldn't hurt to do this especially limiting it to nouns; at least this should be better than no sense at all on the lexemes. ArthurPSmith (talk) 19:16, 3 October 2023 (UTC)Reply[reply]
Can you please generate for Russian words too? --Infovarius (talk) 20:46, 5 October 2023 (UTC)Reply[reply]
Unfortunately, there are hardly any single-word senses in the Russian Wiktionary, if I read the data correctly. Karlb42 (talk) 12:41, 8 October 2023 (UTC)Reply[reply]
Hm, I thought you were extracting all languages from en-wikt? I suppose there are single-word senses of Russian words in en-wikt. --Infovarius (talk) 20:12, 8 October 2023 (UTC)Reply[reply]
I'm working on the respective languages Wiktionary, so ru.wiktionary.org for Russian (with potential limitations due to the extraction process from the wiki markup done by the dbnary project). Karlb42 (talk) 12:08, 15 October 2023 (UTC)Reply[reply]
So you take German words from German Wiktionary, English words from English Wiktionary etc.? I suppose your approach is not useful then. Such glosses are often non-demonstrative at all. Probably it is better to take foregin words from each Wiktionary and their "one-word" translations, than native words with their one-word explanations. Argue? Infovarius (talk) 08:19, 26 October 2023 (UTC)Reply[reply]

Mapping toponyms, grouping them in their linguistic family (using lexicographical data)[edit]

Hello, in a previous post in the Wikidata main discussion (https://www.wikidata.org/wiki/Wikidata:Request_a_query#Mapping_toponyms_organized_in_their_linguistic_family) I asked if it is possible to link municipality of Colombia (Q2555896) with some linguistic information, specifically to its language family (Q25295). That way I can organize them and map them. Apparently there is no etymological or toponym (Q7884789) information in Wikidata, that links, for example, a city name (Chía (Q1093102)) to its family language. There is a property called native label (P1705) that may work for this, but there is no consistency in how this property is used, or maybe, the thing that confuses me the most is that I can't find a connection between the toponym and some data that tells something about its language family.

WD's Lexicographical data, I think, can be useful here, and as I was answered in the other discussion: "[...] almost certainly this is where toponym to linguistic family information should be found." My concrete answer is: Is it possible to link the municipality of Colombia or each toponym to something inside de Lexicographical data to identify each of them with a "linguistic root" or Language family?

I am new to all this Wikimedia world. Thank you for reading! Duityors (talk) 16:39, 25 October 2023 (UTC)Reply[reply]

Question about combines lexemes P5238[edit]

How do I correctly define combines lexemes (P5238) for a word where the prefix is only part of a noun? I'm looking at the Swedish noun abborrfena (L1206838) with the prefix (L1211366), which is not a word in itself, it is only part of the whole word for perch, i.e. abborre (L235116). Should the prefix be abborr or abborre? If abborr, how do I define it, and how do I correctly link it to abborre?

I checked another case of combines lexemes (P5238) where an s is needed between two words, but that does not help in this case: havsbotten (L242830) consists of hav, -s- and botten, the -s- being an ekstraudstyr (L1153504). Robert (talk) 08:24, 2 November 2023 (UTC)Reply[reply]

@Robertsilen: One solution that has been adopted for some German lexemes (and even some Swedish lexemes) is to define a new form with the grammatical feature combining form (Q107614077) and use that in any relevant 'combines' statements: e.g. Krawatten (L36557-F9) or användar (L33166-F9). Mahir256 (talk) 15:57, 2 November 2023 (UTC)Reply[reply]
Thanks, I implemented your suggestion. Makes sense. Robert (talk) 20:32, 2 November 2023 (UTC)Reply[reply]
See Wikidata:Lexicographical data/Documentation/Languages/de#Form_used_in_compounds for an explanation of why I do it this way for German. - Nikki (talk) 09:52, 28 November 2023 (UTC)Reply[reply]
You might also be interested in this discussion we had some time ago :) Vesihiisi (talk) 07:34, 3 November 2023 (UTC)Reply[reply]

language to bearers[edit]

The issue not quite related to the project but as related to languages... Recently I found a relation like

⟨ Standard Malay (Q123474569)  View with Reasonator View with SQID ⟩ spoken by (P10894) View with SQID ⟨ Malaysian Malays (Q3543292)  View with Reasonator View with SQID ⟩

and I don't think the P10894 is relevant here. The other variants are indigenous to (P2341), practiced by (P3095), used by (P1535), but all are suboptimal. Let's discuss the best property? Infovarius (talk) 20:00, 18 November 2023 (UTC)Reply[reply]

lexicographic items[edit]

Some items are quite dictionarian (e.g. Q21121474), so we have subject lexeme (P6254) on them, linking to an appropriate lexeme. But do we have a means to link it back from lexeme? And I am not talking about item for this sense (P5137) on senses, because the word can have different meanings and different corresponding items, but simultaneously can have an item about itself (not the meaning). Infovarius (talk) 12:00, 20 November 2023 (UTC)Reply[reply]

Could you use said to be the same as (P460)? ArthurPSmith (talk) 18:48, 28 November 2023 (UTC)Reply[reply]
@Infovarius: of course that would violate some constraints but maybe they should be altered for this case? ArthurPSmith (talk) 18:48, 28 November 2023 (UTC)Reply[reply]
Thanks, it would work until better solution. --Infovarius (talk) 20:52, 29 November 2023 (UTC)Reply[reply]

Standardized corpus of open data sentences[edit]

The idea is to create a new JSON-based standard for sharing open data with sentences that can be referenced and used in tools related to Wikidata lexemes. This would allow for tools to easily incorporate sentences from various sources such as the Swedish Parliament, the Swedish Public Employment Service, EU documents, historical documents, and more. The goal is to standardize this process, making it straightforward to support multiple languages where CC0 data is available. I'm currently working on it, see https://github.com/dpriskorn/riksdagen_sentences for details.

I welcome suggestions for improvement so we can start building a multi million token dataset which we can use to help our users add example phrases to lexemes. @Fnielsen:--So9q (talk) 07:12, 28 November 2023 (UTC)Reply[reply]

There is some initiative in Russian: opencorpora.org. Yet half-dead it seems. Infovarius (talk) 20:59, 29 November 2023 (UTC)Reply[reply]