Wikidata talk:Lexicographical data

From Wikidata
Jump to navigation Jump to search

Overview

 

Documentation

 

Development

 

Tools

 

Support for Wiktionary

 

How to help

 

Lexemes

 

Discussion

 

Wikidata:Lexicographical data

Lexicographical data
Place used to discuss any and all aspects of lexicographical data: the project itself, policy and proposals, individual lexicographical items, technical issues, etc.
On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2022/05.


Milestone - 200k lexemes[edit]

My bot just created spiritualistically (L200000), the 200000th lexeme, while import Wiktionary adverbs! It means "in a way relating to being spiritual".  – The preceding unsigned comment was added by SixTwoEight (talk • contribs) at 22:04, 11 October 2019 (UTC).[reply]

Same (spelling) forms, different pronunciation[edit]

I want to discuss an interesting case: wikt:ru:яр. There are 2 main meanings: one (II) has the simplest declination paradigm (1a), the other (I) can be declinated in the same way but also in "1c"-way which differs only in pronunciation (stress on different syllables). Can we (and should we) model both cases in one lexeme (яр (L184179))? And even if not, how better to model 1c/1a variativity? To create couples of forms with the same spelling and different pronunciations or to collect both pronunciations in single forms? --Infovarius (talk) 21:29, 31 January 2022 (UTC)[reply]

I'd say having two lexemes might be the more cleaner approach in this case, given that half the forms (and not just one or two) are affected by the choice of meaning for that word. As for modeling variation due to choice of declension paradigm, perhaps "applies to part" (class 1a/1c) might qualify "pronunciation" on the affected forms? Mahir256 (talk) 00:51, 1 February 2022 (UTC)[reply]

maps to other lexical resources[edit]

What are the plans to support/include/map to other resources like:

 – The preceding unsigned comment was added by Arademaker (talk • contribs) at 19:15, February 16, 2022‎ (UTC).

But all PWN 3.1 data was 'imported' to Wikidata? I am looking to https://www.wikidata.org/wiki/Q144, it points to http://wordnet-rdf.princeton.edu/pwn30/02084071-n, so for the PWN this is considered the official endpoint? How to decide about other wordnet endpoints? For instance, The Portuguese Wordnet would be http://wn.mybluemix.net/synset?id=02084071-n that maps to PWN 3.0. How should I properly use Wikidata infrastructure to connect to our Portuguese Wordnet?

For verbNet, we can eventually use links like https://uvi.colorado.edu/verbnet/meander-47.7-1? But once more, that would force the ingestion of VerbNet data in Wikidata?

I am testing the modeling of Propbank data in the Wikidata Lexemes. See https://www.wikidata.org/wiki/Lexeme:L116. But what do you (Mahir) mean by 'entirely developed at the moment'? What is missing? But instead of only replicate Propbank data (thematic roles) in wikidata, it would also be interesting to have some reference to the propbank frame. For instance, for https://www.wikidata.org/wiki/Lexeme:L116 we could have `propbank roleset ID' with value locate.01. So the property `propbank roleset ID` would be similar to WordNet 3.1 Synset ID (P8814). That is, we will not need a propbank endpoint, just use its identifiers for now. (by --Arademaker (talk) 20:05, 15 March 2022 (UTC))[reply]

Please advise - place names[edit]

Hello, I'm preparing an import of municipality names in Czech to Wikidata. I have a few questions based on a study of existing lexemes:

  • in senses, should there be more than one language? See Třebíč (L437) which states that the word has (the same) meaning in both Czech and English. Is this desirable?
  • Aberdeen (L494798) has sense S1 for "city in Scotland" and sense S2 for other places of the same name. Is this what is expected? Vojtěch Dostál (talk) 11:52, 24 February 2022 (UTC)[reply]
@Vojtěch Dostál: The issue with L437 is a mistake by the user who added it, which I have fixed. As for L494798, the approach taken is something I would be comfortable with if it were applied more broadly (@ArthurPSmith: might be able to explain more), but at the moment that is not the case (see e.g. Guhkesjávri (L633165) and @Jon Harald Søby: who created it). Mahir256 (talk) 14:58, 24 February 2022 (UTC)[reply]
@Mahir256 Thank you for your help on this. I'll be using the current version of Lexeme:L437 for further work. Happy to hear details from others. Vojtěch Dostál (talk) 15:13, 24 February 2022 (UTC)[reply]
I was long reluctant to add proper nouns as Wikidata lexemes because that would mean every organization, location, person, etc. could have a lexeme in every language as well as an item. However, some proper nouns are very widely used, either because they represent a very prominent real-world entity, or because they have many different real-world entities with that name, and it does seem useful to have them included here. "Aberdeen" is an example of both. Practically it seems silly to create a separate sense for each real-world entity with that name (how would we even do that with given names?) but maybe it's not a big burden for Wikidata, I don't really know. No strong feeling on that from me. ArthurPSmith (talk) 17:17, 24 February 2022 (UTC)[reply]

Different spelling or different words?[edit]

mēnsa (L31224) = mensa (L278590)? --Infovarius (talk) 19:19, 15 March 2022 (UTC)[reply]

@Infovarius: clearly the same lexeme. The diacritic (here a macron) a modern notation, it's useful for prononciation (to indicate a long vowel) but I'm not sure how it should be stored (see also Wikidata_talk:Lexicographical_data/Archive/2018/06#Arabic_diacritics where we mention this question). PS: Uzielbot import had a lot of strange things that should be fixed (but I'm not sure where to start...). Cheers, VIGNERON (talk) 15:49, 9 May 2022 (UTC)[reply]

How to merge lexemes?[edit]

Hello. In the case of two lexemes that are duplicates, what would be the procedure to merge them? --Hameryko (talk) 10:11, 24 March 2022 (UTC)[reply]