Wikidata talk:Lexicographical data/Archive/2020/04

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Common misspellings data

Hi. I recently found http://petnoga.se/ and really liked that they provide suggestions for correct spelling. I'm wondering if it would be a fruitful endevour to collect common misspellings under our lexeme forms? This would enable us to easily make something like that. If yes, then I suggest we team up ask LibreOffice for help with collecting the data we need. WDYT?--So9q (talk) 10:50, 19 March 2020 (UTC)

I personally do not think misspellings should be listed as forms. Perhaps they could be listed with regular statements using a special property. ArthurPSmith (talk) 17:34, 19 March 2020 (UTC)
I do think listing misspellings as forms is a great idea, especially if it's common misspellings. Also, the line between misspellings and variants can be blurry, especially for poorly documented or mostly oral languages. And even for well documented languages, it still useful, for example in French "occurence" (with only one "r" instead of "occurrence" with two "r") is (sadly) very common, almost 20% in the 80s according to Google Ngrams. Cheers, VIGNERON (talk) 20:05, 19 March 2020 (UTC)
Hmm, I guess that's a good point, and we do allow archaic forms... English also has some very common misspellings - for example Nobody can spell “fuchsia”. ArthurPSmith (talk) 17:06, 20 March 2020 (UTC)
Interesting statistic @VIGNERON:. Here is the equivalent for fuchsia which is actually only half as bad as your example.
I added a misspelled form to fuchsia (L3280) with a form with has characteristic (P1552) misspelling (Q1984758) / of (P642) fuchsia (Q5005364) but is it better to create a lexeme and add the form there? Also what is best, to refer to fuchsia (Q5005364) or the correct form fuchsia (L3280-F1) (right now the latter is impossible it seems as form support is missing in the input field). WDYT?--So9q (talk) 09:51, 21 March 2020 (UTC)
Oh, I actually meant what Arthur suggested: list with regular statements using a special property: e.g. common misspellingsmisspelling 1misspelling 2. But having them as forms with a property indicating a misspelling might be a better way to model it. Obviously this also begs for annotating non-misspellings with at least a reference to an authorative source e.g. the danish correct spelling dictionary [1]. @fnielsen: What do you think about misspellings and modeling them?--So9q (talk) 12:54, 20 March 2020 (UTC)
I'm not 100% sure about the best way to model it; fuchsia (L3280) looks good though. The ideal would be a way to do all sorts of annotation, not only "misspelling" but also "variant", archaic form", and so on. Cdlt, VIGNERON (talk) 10:08, 21 March 2020 (UTC)
I agree, I created a new property proposal for misspellings.--So9q (talk) 21:29, 22 March 2020 (UTC)
Misspelling lexeme forms design mockup
I just added two misspellings to https://www.wikidata.org/wiki/Lexeme:L36116 but seeing there are already 8 forms and imagining 8 forms for each wrong spelling we end up with a mess with a total of 24 forms on one lexeme. It might be better to create a misspelling-lexeme like [2], creating of these would really be helped by e.g. updating Lexeme forms to help map form the misspelled form to the correct one side by side/drag drop or intelligently by analyzing grammatic features. The tool could also visually enable adding a one new correct lexeme and one or multiple misspelling-lexemes side-by-side with propagation to minimize typing, see the mockup to the right. WDYT @Lucas Werkmeister:?--So9q (talk) 21:41, 22 March 2020 (UTC)
In Wikipedia there are already a lot of misspellings documented: swedish, english. If we could find a way to import these we have a real good start IMO, they might be covered by copyright though.--So9q (talk) 22:28, 22 March 2020 (UTC)
Previously it's proposed that misspellings are stored as individual lexemes: see Wikidata:Property proposal/correct form. Alternatively, it may also be possible that misspellings are stored as statements.--GZWDer (talk) 00:49, 23 March 2020 (UTC)

I thought about it and I think the best approach is to have misspellings as statements for forms:

Correct Form (Lxxxx-Fx) --> Common misspelling (Pxxxx) :  misspelled word

IMO mispellings are neither lexemes nor forms. There is one thing I am not sure of - what actually common misspelling is? How do we define it? We all probably have some thought about what it is and that thought is probably different for each of us. There would be more strict opinions and more loose opinions. Would it differ language by language? Should it always be attested by lingustic science? It is probably not good thing to have future arguments about number of Google search hits. --Lexicolover (talk) 20:58, 23 March 2020 (UTC)

This looks good to me. I changed my proposal to reflect this. https://www.wikidata.org/wiki/Wikidata:Property_proposal/correct_spelling --So9q (talk) 22:19, 23 March 2020 (UTC)
I agree we should define "common misspelling". Here is my suggestions: the misspelling has to appear in one of the following sources:
  1. an authoritative source such as e.g. Retskrivningsordbogen (Q3398246)
  2. appears in one of wikipedias list of misspellings, e.g. [3]
  3. appears in WD with p31misspelling (Q1984758) e.g. Rzehakinacea (Q33188867)
  4. appears in a corpus approved explicitly by this community with an occurrence over a certain threshold. (yet to be created, decided)  – The preceding unsigned comment was added by So9q (talk • contribs).
  • I think the forms should be included as forms, not in some random string property. Accordingly, I prefer the initial version of the proposal. Wikidata is generally descriptive, not prescriptive. --- Jura 16:31, 26 March 2020 (UTC)
    @Jura1: In my understanding misspellings are not forms. Forms have some function in the language, misspellings don't. There might be mispellings that match other forms of the word (different form, archaic form, colloquial form etc.) but there is not form that has "quality" of being misspelling - such thing has no use for language.--Lexicolover (talk) 21:59, 27 March 2020 (UTC)
    • So you prefer the normative view? Some languages have a language board with a fairly authoritative view of its role or in some countries, there is an official list of first names to use. --- Jura 12:33, 1 April 2020 (UTC)
@So9q: Thank you for your effort. 1) is okay for me; 2) and 3) are IMO not good - it's basicaly selfreferencing. The link at 2) IMO combines misspellings and mistypings (there's a bit of difference for me); the example at 3) is not even misspelling it's incorrect name. 4) might be good but I am afraid most of corpuses would not be annotated to decide what word is real misspelling.--Lexicolover (talk) 21:59, 27 March 2020 (UTC)

How to model the two different comparisons of germanic languages?

Hi, I discussed briefly with Lucas W about how to model this and we did not reach a solution so now I would like your input. My example where both syntactic and morphological comparison applies was this:

1) happy happier happiest

2) happy, more happy, most happy

So "the most happy day of my life" == "the happiest day of my life"?

If this is correct then I believe we should have a total of 5 forms on happy (L1349) but today we only have the -ier -iest forms. I think that is an omission we should decide if we want to correct. WDYT?--So9q (talk) 17:05, 1 April 2020 (UTC)

@So9q: I don't know about other languages, but in English I would view the "more" and "most" variants as applicable to any adjective or adverb; "more" and "most" are lexemes in themselves that have those comparative meanings, and there's no need to add them as forms. Most English adjectives do not have "-er" and "-est" ending forms, but require "more" and "most" to create those comparatives. So I think it's most useful just to add those single-word forms where they do exist, and not add extra forms with "more" and "most". ArthurPSmith (talk) 19:37, 1 April 2020 (UTC)

Huge increase of number of Forms

Hello all,

We noticed a huge increase of the number of Forms these past few days (+100K) (see Grafana board). Is one of you working on an import project?

Thanks, Lea Lacroix (WMDE) (talk) 12:39, 3 April 2020 (UTC)

I'm guessing that at least some of that increase is due to @Uziel302: who is working on adding forms in Latin (and with Latin declension the numbers of forms go up very quickly!) Cdlt, VIGNERON (talk) 14:49, 3 April 2020 (UTC)
Lea Lacroix (WMDE), indeed I've uploaded around 1M forms with their lexicographical analysis based on Whitaker's WORDS. The graph is confusing, mixing all types on the same graph. Let me know if any issue was found in the import. Uziel302 (talk) 20:47, 3 April 2020 (UTC)
The Elhuyar Fundazioa bot has also been adding some more Basque lexemes lately. Mahir256 (talk) 20:52, 3 April 2020 (UTC)
User:Elhuyar Fundazioa is indeed uploading forms but since he uploads each form in a separate edit, it takes more than an hour for him to upload a thousand forms. It also affects load on the system and we reach high maxlag. I think he should group the forms and upload them all in one edit like I did in my last upload. Code is here. Uziel302 (talk) 21:04, 3 April 2020 (UTC)
I've finished uploading Latin forms now, about 1.1M were uploaded in total.Uziel302 (talk) 13:41, 6 April 2020 (UTC)

Adding Kurdish data

Hi,

Sorry in advance, if it's not the best place to ask this question.

I am wondering if there is a way to import data automatically and merge them with the existing entries or create new ones. We have developed a few dictionaries for three dialects of the Kurdish languages, namely Hawrami, Sorani and Kurmanji. The datasets (available here) are originally created in Ontolex-Lemon, particularly using the Lexicog module.

Any help would be greatly appreciated.

--Sina.ahm (talk) 23:41, 5 April 2020 (UTC)

Hi Sina.ahm, this is the right place to ask help for this kind of question/request. The first point to solve, before taliking about the technical side, is the licence. On the Github repository, I see that your data are released under CC by-nc-sa. This licence is not compatible with Wikidata nor any Wikimedia project. Wikidata uses a CC-0 licence which means, in short, that is very close to the public domain. Other Wikimedia projects use mostly the CC by-sa licence, that is similar to the licence you have chosen but that authorises to reuse the data also for a commercial purpose. So, before going further, you should see to change the licence of your data. Changing to CC-0 will allow you to import your data on Wikidata. Moving to CC by-sa will allow you to import your lexicographical data on any Wiktionary project. Pamputt (talk) 06:44, 6 April 2020 (UTC)
Sina.ahm, Pamputt, this is true for copyrightable content, grammatical information is not copyrightable so you may upload it and leave the copyrightable data aside (definitions, encyclopedic information etc.). Uziel302 (talk) 13:46, 6 April 2020 (UTC)
Thanks very much, Pamputt and Uziel302. As the owner of the data, I will change the license to the appropriate one. Assuming that the problem with the license is solved, is there a bot or a tool to import the data automatically? Where should I start importing? --Sina.ahm (talk) 13:49, 6 April 2020 (UTC)
Sina.ahm, you can upload the data using the wikibase API, this is the code I used to upload Latin lexemes. You should do some testings and when your script is ready and has some valid example edits, you should request permission to run as a bot. Uziel302 (talk) 14:09, 6 April 2020 (UTC)

New LexData version 0.2

I just released the version 0.2 of LexData – my python library to create and edit Lexicographical data. The changes include:

  • Support for bot-passwords
  • Respecting an properly handling maxlag
  • claims can be added to exsisting Lexemes, Forms and Senses
  • Claims of arbitrary data types can be created
  • Updates via JSON are possible
  • Improved search
  • I added tests, so there hopefully should be considerably less bugs and regressions
  • Consistent logging
  • and more…

As before you can get it from Github or pypi and the documentation can be found here. There are some minor incompatible changes but I tried to keep it compatible with the 0.1 version branch and only deprecate the old apis. In most case there should be no changes necessary. Happy Hacking! -- MichaelSchoenitzer (talk) 22:39, 17 April 2020 (UTC)

Very nice! I hopw it will be put to good use in our tooling going forward.--So9q (talk) 22:52, 23 April 2020 (UTC)

Mapping of Lexicographical data model

I was wondering if the data model itself can have statements that apply external mapping? I am doing most of the Schema.org/Wikidata mapping as some of you know and I noticed that the Lexicographical data model could possibly be mapped to other ontologies where it made sense. For example, see where I recently made a small edit to this page where I added a equivalent to: note about skos:definition here to the Gloss explanation: "Enter a gloss (very short phrase defining the meaning)(equivalent to: skos:definition) — Gloss" but ideally this would be applied in the data model itself somehow? Thadguidry (talk) 17:05, 20 April 2020 (UTC)

@Thadguidry: good idea, I think this has at least partly already be done for the query service (where you indeed use skos:definition to get the gloss). Not sure exactly where it has been documented though, I have used mw:Extension:WikibaseLexeme/RDF mapping when I started to do SPARQL on Lexemes, not sure if there is more or a better documentation somewhere @Lea Lacroix (WMDE): ? Anyway, integration the main documentation page sounds like a really good idea. Cdlt, VIGNERON (talk) 17:33, 20 April 2020 (UTC)
Hey @Thadguidry:,
I don't fully understand what you mean. Do you want to improve the documentation? Or the descriptions on Lexemes themselves? Can you provide some examples so I can understand what the request is? Thanks! Lea Lacroix (WMDE) (talk) 07:08, 21 April 2020 (UTC)
Hi @Lea Lacroix (WMDE):!,
Sure, we already have a few properties that allow mapping both Properties and Entities via external ontologies such as P2888 exact match and P2236 external subproperty and others. Here's an example of the mapping being done inside Wikidata itself P2561 name So the idea is that we could be able to apply a few statements or assertions on the various fields (like Gloss) in the Wikibase Lexeme RDF mapping itself showing relations to external ontologies, as well as on individual Senses. For example: Gloss is already mapped, but hardcoded, so I think its best to expose this somehow to allow edits to be made as there are other ontologies (now as shown in LOV and future) that should be aligned (not just Lemon and SKOS). If the Wikibase Lexeme does allow writing this kind of information, then that is the documentation that I am missing. Thadguidry (talk) 12:29, 21 April 2020 (UTC)
UPDATE: It looks like it IS POSSIBLE to directly apply the mapping however!! (this wasn't working for me initially several months ago and don't know why). Take a look at the "exact match" I applied here for the concept/Sense of "first". So, I think the only thing that is left to do is documentation showing how to apply mapping to the Lex data model itself? How was Gloss linked to skos:definition in the RDF? And as a user, I would probably expect to see that information on Wikidata:Lexicographical_data/Documentation Thadguidry (talk) 12:29, 21 April 2020 (UTC)
Thanks a lot @Thadguidry:! Yes, I think Wikidata:Lexicographical_data/Documentation would be a good place to mention the mapping of the Lexeme data model to a few other models. Feel free to continue improving the page :) Lea Lacroix (WMDE) (talk) 16:01, 21 April 2020 (UTC)

Property suggestions for Lexeme/Form/Sense

On blank items, currently instance of (P31) and subclass of (P279) are suggested (see Help:Suggesters_and_selectors#Property_suggester about the feature). The idea is that the user would input a statement with one of these properties (not all).

I think it would be good to have some default suggestions for Lexemes/Form/Sense as well, e.g.

Are there other properties that could be included? --- Jura 11:01, 12 March 2020 (UTC)

For Form the main property is pronunciation (P7243) and two above as qualifiers. According to new model. --Infovarius (talk) 18:25, 12 March 2020 (UTC)
It seems to be in used mainly for Russian and had been added by bot, but why not. Latin could probably use it too. Only one of the properties needs to be chosen and one can always use others. I added it above and expanded others.--- Jura 21:54, 12 March 2020 (UTC)
If it's mainly added by bot and in Russian, maybe we should skip that. Also, maybe the property should have string datatype (but this is not really relevant to this discussion). --- Jura 16:37, 26 March 2020 (UTC)
Why only Russian? Or other languages have no variations in pronunciation of forms? --Infovarius (talk) 21:45, 27 March 2020 (UTC)
It was just an observation from current uses. --- Jura 13:58, 24 April 2020 (UTC)
For forms there could also be hyphenation (P5279). And language style (P6191) for both forms and senses. --Lexicolover (talk) 11:04, 13 March 2020 (UTC)
Sure. I added them. --- Jura 16:37, 26 March 2020 (UTC)

@Lea Lacroix (WMDE): is there a way to activate this? --- Jura 13:58, 24 April 2020 (UTC)

We would need to estimate how much work would be required for such a feature. I can't promise that it could happen any time soon, as our roadmap is already packed. Lea Lacroix (WMDE) (talk) 14:58, 27 April 2020 (UTC)

New Version of MachtSinn

Many of you might already know my tool "MachtSinn", that allows users to easily add senses to lexemes, generated from wikidata items. In the last weeks I have again improved it significantly. You might want to take another look at it.

Compared with the first version announced here some time ago, there are a lot of improvements:

  • You can add glosses not only in the lexemes language, but in arbitrary languages at the same time!
  • There are now Keyboard shortcuts
  • New address: Machtsinn.toolforge.org – thanks to the Cloud Services team at WMF
  • More potential matches especially for languages with smaller Wikipedias
  • Less false-positives – especially for English
  • To have fewer false-positives it's possible to add separate optimized queries for each language (Currently only English).
  • Improved, responsive design – should work on mobile devices
  • English verbs are prefixed with to, German nouns are prefixed with Der/Die/Das. You can add prefixes for other languages.
  • You can edit the gloss, if it is not an appropriate description
  • Your browser language is used as default language
  • The grammatical gender and lexicographical category is displayed
  • The database is regularly updated
  • There is an help text and improved statistics
  • The amount of network traffic is minimized so that is should work well even with slow connection
  • Fixed lot's of bugs

Big thanks to @Incabell, @DDuplinszki, @Ainali, @so9q and @dzarmola for their help. And thanks to everyone using it to add senses – already 15'000 senses have been added. 35'000 potential matches are currently waiting for you. ;) The code is available on github – Pull Requests are welcome. Have fun and stay safe. -- MichaelSchoenitzer (talk) 23:26, 30 April 2020 (UTC)