Wikidata talk:Lexicographical coverage

More information on this prototype tool: m:Abstract Wikipedia/Updates/2021-02-10#Corpus coverage dashboard
To propose adding a new language, we need a reasonably clean text corpus, preferably created from the language's Wikipedia.

Current state of PAWS notebook[edit]

The PAWS notebook is a little bit incomplete still - not all languages are covered. This is because my first code was in Python 2, and the Unicode handling is different now. Hopefully that can be improved in the notebook.

The other thing is that I think Japanese and Chinese just look wrong. Not sure if the tool makes any sense for these two languages. Input and suggestions are very welcome. --DVrandecic (WMF) (talk) 17:04, 10 February 2021 (UTC)[reply]

@DVrandecic (WMF): For Chinese:

Install [1]
Run jieba.cut("text without wiki markup", cut_all=True)
Remove anything not matching .*[一-龯㐀-䶵].*

--GZWDer (talk) 17:27, 10 February 2021 (UTC)[reply]

@DVrandecic (WMF): Also: Per Wikidata:Lexicographical_coverage/Missing/en, please use case-insensitive to match current lexemes.--GZWDer (talk) 17:29, 10 February 2021 (UTC)[reply]

I am setting up the case-sensitive thing. For the Chinese, please feel free to set that up and send me a patch, if you don't want to wait until I get around to do it. That would be great! --DVrandecic (WMF) (talk) 01:29, 12 February 2021 (UTC)[reply]

Case-sensitive is fixed now. --DVrandecic (WMF) (talk) 02:31, 12 February 2021 (UTC)[reply]

@DVrandecic (WMF): It seems that Japanese tokenization has failed. You may want to use konoha or something similar.--Afaz (talk) 09:18, 20 February 2021 (UTC)[reply]

@Afaz, GZWDer: I removed Japanese and Chinese from the list for now as the tokenizer completely fails. I'll add them once that is fixed. Happy to receive patches. Thanks for the suggestions, they look good! --DVrandecic (WMF) (talk) 15:15, 20 February 2021 (UTC)[reply]

bn[edit]

@DVrandecic (WMF):, interested to get the Lexicographical coverage data for Bengali. Can you please generate that? Thanks in advance. -- Bodhisattwa (talk) 12:33, 11 February 2021 (UTC)[reply]

I'd love to. I need a reasonably clean text corpus, preferably created from Bengali Wikipedia. If that's somewhere, I'm happy to include it here. --DVrandecic (WMF) (talk) 01:26, 12 February 2021 (UTC)[reply]

@Bodhisattwa: The 'Extracted page abstracts for Yahoo' on the dump page of bnwiki might be a good candidate. If you know anything about the quality of this extraction, I'd be happy to hear before I try. --DVrandecic (WMF) (talk) 15:09, 20 February 2021 (UTC)[reply]

@Bodhisattwa: Does this look sensible to you (in terms of most frequent forms in bn, not in terms of missing forms)? This is based on the Abstracts, which are really supershort: Most frequent bn forms --DVrandecic (WMF) (talk) 22:31, 20 February 2021 (UTC)[reply]

@DVrandecic (WMF): A better corpus to use might be the one linked here. (There was another one, which for the life of me I cannot remember the name of, which was part of a larger release of multiple Indic language Wikipedia corpora; I thought the release was by Google, but I can't seem to find it.) Mahir256 (talk) 03:19, 21 February 2021 (UTC)[reply]

@DVrandecic (WMF), Bodhisattwa: So I ran an analysis of my own using the corpus I linked (adjusted the link to that data) and got some statistics, which I've added to the main coverage page and the missing list. Mahir256 (talk) 04:14, 21 February 2021 (UTC)[reply]

Authentication on PAWS[edit]

For some reason, I am not able to get to the PAWS notebook. I am logged in on PAWS but the authentication for the coverage notebook is not possible. I get "403 : Forbidden Authorization form must be sent from authorization page". Perhas this is a PAWS issue? — Finn Årup Nielsen (fnielsen) (talk) 19:37, 11 February 2021 (UTC)[reply]

@Fnielsen: Fixed! (Working link) It had used the private instead of public subsubdomain. Thanks for the heads-up. :) Quiddity (WMF) (talk) 21:13, 11 February 2021 (UTC)[reply]

Token[edit]

What is it? It's like an initial form (lexeme)? --Infovarius (talk) 20:53, 11 February 2021 (UTC)[reply]

Ah, sorry, it is explained at m:Abstract Wikipedia/Updates/2021-02-10. A token is an occurrence of the form. --Infovarius (talk) 21:05, 11 February 2021 (UTC)[reply]

Correct! Sorry for using jargon. --DVrandecic (WMF) (talk) 01:24, 12 February 2021 (UTC)[reply]

Updates to the missing pages[edit]

The /Missing/<language code> pages (e.g. hi, ko) are quite useful for prioritizing curation. Will they be updated in regular intervals or will there be some other mechanism to assist with keeping track of the backlog? --Daniel Mietchen (talk) 00:40, 12 February 2021 (UTC)[reply]

I plan to update them for a while, but I hope someone else will take this up (volunteers, let me know!). Or set up a bot to automatically update them everytime a new lexemes dump is available or something, so stuff doesn't depend on me. --DVrandecic (WMF) (talk) 01:24, 12 February 2021 (UTC)[reply]

Duplicate forms?[edit]

Looking at the Swedish missing list I see some duplicates, like "the" on place 11 and 20, "a" on 39 and 42 and "södra" on 35 and 164. What makes these different forms? If they really are, it would be helpful to get the base lexeme mentioned as well to make it easier to improve. (Oh, and thanks for this list!) Ainali (talk)

@Ainali: I forgot to normalize some forms re capitalization (but then did so for display), which is why they appeared as seemingly different forms. This is now fixed, I hope. --DVrandecic (WMF) (talk) 15:11, 20 February 2021 (UTC)[reply]

br[edit]

Hi @DVrandecic (WMF):

Would it be possible to add the Breton (Breton (Q12107)), from https://dumps.wikimedia.org/brwiki/20210201/

A tricky part, there is no letter "c" in breton, but there is a letter "c'h". The quote ' usually separate different lexemes (like in most languages) *except* between c and h. For instance "Ur c'hazh m'eus" (I've a cat) you should extract : "ur" (L35220) + "c'hazh" (L458) + "m" + "eus" (L3395). Not sure how you can deal with that... In doubt don't use ' as a separator ("m'eus" would be strange proposition of Lexeme but "c" + "hazh" would be even weirder)

Cheers, VIGNERON (talk) 09:13, 21 February 2021 (UTC)[reply]

fr[edit]

Wikidata:Lexicographical_coverage#fr good to see we found a measure to calculate coverage for French ;)

Looking at Wikidata:Lexicographical coverage/Missing/fr, it shows lots of things I consider covered, but maybe you expect to find "l'" and "d'" differently (Lexeme:L2770#F5 etc.) --- Jura 01:10, 2 March 2021 (UTC)[reply]

@Nikki: it seems that update has still the same problem: [2] --- Jura 17:21, 24 March 2021 (UTC)[reply]

Is this kept up-to-date?[edit]

@Quiddity: When is the data correct for? The page does not seem to explicitly say. Asaf Bartov (talk) 21:21, 28 September 2022 (UTC)[reply]

@Ijon: I've added a tentative note about the bot-update timing. It seems to be running weekly at the moment, but I'm not sure if that's on a timer, or triggered manually when @Nikki has time. Hope that helps! Quiddity (WMF) (talk) 00:28, 29 September 2022 (UTC)[reply]

thank you very much! Asaf Bartov (talk) 07:37, 29 September 2022 (UTC)[reply]

Wikidata talk:Lexicographical coverage

Contents

Current state of PAWS notebook[edit]

bn[edit]

Authentication on PAWS[edit]

Token[edit]

Updates to the missing pages[edit]

Duplicate forms?[edit]

br[edit]

fr[edit]

Is this kept up-to-date?[edit]

Navigation menu

Wikidata talk:Lexicographical coverage

Current state of PAWS notebook[edit]

bn[edit]

Authentication on PAWS[edit]

Token[edit]

Updates to the missing pages[edit]

Duplicate forms?[edit]

br[edit]

fr[edit]

Is this kept up-to-date?[edit]

Navigation menu

Search