Wikidata:Lexicographical coverage

From Wikidata
Jump to navigation Jump to search

This page presents the lexicographical coverage of the Wikidata lexicographical data compared to a corpus of the given language. Unless the entry for the language says otherwise, the corpora are based on Wikipedia (available here).

These pages are updated weekly on Wednesdays by NikkiBot, although no edit will be made if nothing has changed.

The code for the bot is at https://github.com/nikkiwd/lexcover and is based on the original PAWS notebook by Denny. Report issues to Nikki (either on User talk:Nikki or on Telegram). Requests for additional languages, improvements and suggestions are also welcome.

Words can be filtered out by adding them to the "Filter" subpage for the language (e.g. Wikidata:Lexicographical coverage/nb/Filter) and the entries in the list can be customised, e.g. to add search links, by editing the "Missing/row" subpage (e.g. Wikidata:Lexicographical coverage/nb/Missing/row). It is also possible to add things before and after the list, e.g. if you want the output to be a table, by editing the "Missing/head" and "Missing/foot" subpages.

More information:

More statistics:

ar

[edit]
  • Forms in Wikidata: 1,408
  • Forms in Wikipedia: 246,598
  • Tokens: 69,840,956
  • Covered forms: 214 (0.1%)
  • Missing forms: 246,384 (99.9%)
  • Covered tokens: 1,350,476 (1.9%)
  • Missing tokens: 68,490,480 (98.1%)
  • Most frequent missing forms

bg

[edit]
  • Forms in Wikidata: 233
  • Forms in Wikipedia: 118,514
  • Tokens: 33,132,887
  • Covered forms: 200 (0.2%)
  • Missing forms: 118,314 (99.8%)
  • Covered tokens: 775,767 (2.3%)
  • Missing tokens: 32,357,120 (97.7%)
  • Most frequent missing forms

br

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 8,850
  • Forms in Wikipedia: 9,552
  • Tokens: 1,459,030
  • Covered forms: 1,736 (18.2%)
  • Missing forms: 7,816 (81.8%)
  • Covered tokens: 1,021,824 (70.0%)
  • Missing tokens: 437,206 (30.0%)
  • Most frequent missing forms

bn

[edit]

(This analysis was performed separately from all the others on this page, using the corpus linked here and custom counting code.)

  • Forms in Wikidata: 46,504
  • Forms in Wikipedia: 5,34,894
  • Tokens: 1,33,06,025
  • Covered forms: 13,900 (2.60%)
  • Missing forms: 5,20,994 (97.40%)
  • Covered tokens: 50,79,352 (38.17%)
  • Missing tokens: 82,26,673 (61.83%)
  • Most frequent missing forms

bs

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 7
  • Forms in Wikipedia: 35,431
  • Tokens: 3,876,195
  • Covered forms: 4 (0.0%)
  • Missing forms: 35,427 (100.0%)
  • Covered tokens: 392 (0.0%)
  • Missing tokens: 3,875,803 (100.0%)
  • Most frequent missing forms

ca

[edit]
  • Forms in Wikidata: 178
  • Forms in Wikipedia: 176,311
  • Tokens: 108,297,498
  • Covered forms: 130 (0.1%)
  • Missing forms: 176,181 (99.9%)
  • Covered tokens: 14,815,847 (13.7%)
  • Missing tokens: 93,481,651 (86.3%)
  • Most frequent missing forms

cs

[edit]
  • Forms in Wikidata: 193,734
  • Forms in Wikipedia: 261,374
  • Tokens: 74,084,890
  • Covered forms: 46,193 (17.7%)
  • Missing forms: 215,181 (82.3%)
  • Covered tokens: 46,861,260 (63.3%)
  • Missing tokens: 27,223,630 (36.7%)
  • Most frequent missing forms

cy

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 125
  • Forms in Wikipedia: 10,844
  • Tokens: 1,442,683
  • Covered forms: 79 (0.7%)
  • Missing forms: 10,765 (99.3%)
  • Covered tokens: 27,189 (1.9%)
  • Missing tokens: 1,415,494 (98.1%)
  • Most frequent missing forms

da

[edit]
  • Forms in Wikidata: 535,107
  • Forms in Wikipedia: 111,139
  • Tokens: 30,879,404
  • Covered forms: 56,921 (51.2%)
  • Missing forms: 54,218 (48.8%)
  • Covered tokens: 28,085,126 (91.0%)
  • Missing tokens: 2,794,278 (9.0%)
  • Most frequent missing forms

de

[edit]
  • Forms in Wikidata: 204,978
  • Forms in Wikipedia: 1,008,036
  • Tokens: 596,433,479
  • Covered forms: 109,058 (10.8%)
  • Missing forms: 898,978 (89.2%)
  • Covered tokens: 474,411,777 (79.5%)
  • Missing tokens: 122,021,702 (20.5%)
  • Most frequent missing forms

el

[edit]
  • Forms in Wikidata: 38,902
  • Forms in Wikipedia: 129,276
  • Tokens: 40,452,744
  • Covered forms: 16,793 (13.0%)
  • Missing forms: 112,483 (87.0%)
  • Covered tokens: 18,421,537 (45.5%)
  • Missing tokens: 22,031,207 (54.5%)
  • Most frequent missing forms

en

[edit]
  • Forms in Wikidata: 111,881
  • Forms in Wikipedia: 965,225
  • Tokens: 1,508,248,447
  • Covered forms: 76,593 (7.9%)
  • Missing forms: 888,632 (92.1%)
  • Covered tokens: 1,402,094,006 (93.0%)
  • Missing tokens: 106,154,441 (7.0%)
  • Most frequent missing forms

eo

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 5,759
  • Forms in Wikipedia: 27,201
  • Tokens: 4,222,541
  • Covered forms: 2,516 (9.2%)
  • Missing forms: 24,685 (90.8%)
  • Covered tokens: 2,458,658 (58.2%)
  • Missing tokens: 1,763,883 (41.8%)
  • Most frequent missing forms

es

[edit]
  • Forms in Wikidata: 529,475
  • Forms in Wikipedia: 372,589
  • Tokens: 405,914,020
  • Covered forms: 88,404 (23.7%)
  • Missing forms: 284,185 (76.3%)
  • Covered tokens: 371,817,272 (91.6%)
  • Missing tokens: 34,096,748 (8.4%)
  • Most frequent missing forms

et

[edit]
  • Forms in Wikidata: 2,637,161
  • Forms in Wikipedia: 123,073
  • Tokens: 16,832,892
  • Covered forms: 72,704 (59.1%)
  • Missing forms: 50,369 (40.9%)
  • Covered tokens: 13,682,782 (81.3%)
  • Missing tokens: 3,150,110 (18.7%)
  • Most frequent missing forms

eu

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 1,002,593
  • Forms in Wikipedia: 26,466
  • Tokens: 3,138,442
  • Covered forms: 16,173 (61.1%)
  • Missing forms: 10,293 (38.9%)
  • Covered tokens: 2,381,172 (75.9%)
  • Missing tokens: 757,270 (24.1%)
  • Most frequent missing forms

fa

[edit]
  • Forms in Wikidata: 38,248
  • Forms in Wikipedia: 100,251
  • Tokens: 44,426,012
  • Covered forms: 8,436 (8.4%)
  • Missing forms: 91,815 (91.6%)
  • Covered tokens: 15,336,866 (34.5%)
  • Missing tokens: 29,089,146 (65.5%)
  • Most frequent missing forms

fi

[edit]
  • Forms in Wikidata: 9,044
  • Forms in Wikipedia: 276,898
  • Tokens: 46,847,582
  • Covered forms: 5,134 (1.9%)
  • Missing forms: 271,764 (98.1%)
  • Covered tokens: 12,720,934 (27.2%)
  • Missing tokens: 34,126,648 (72.8%)
  • Most frequent missing forms

fr

[edit]
  • Forms in Wikidata: 252,369
  • Forms in Wikipedia: 465,138
  • Tokens: 474,988,250
  • Covered forms: 54,715 (11.8%)
  • Missing forms: 410,423 (88.2%)
  • Covered tokens: 415,474,466 (87.5%)
  • Missing tokens: 59,513,784 (12.5%)
  • Most frequent missing forms

ha

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 1,275
  • Forms in Wikipedia: 4,816
  • Tokens: 859,259
  • Covered forms: 476 (9.9%)
  • Missing forms: 4,340 (90.1%)
  • Covered tokens: 310,590 (36.1%)
  • Missing tokens: 548,669 (63.9%)
  • Most frequent missing forms

he

[edit]
  • Forms in Wikidata: 328,203
  • Forms in Wikipedia: 249,890
  • Tokens: 76,643,376
  • Covered forms: 54,465 (21.8%)
  • Missing forms: 195,425 (78.2%)
  • Covered tokens: 42,008,286 (54.8%)
  • Missing tokens: 34,635,090 (45.2%)
  • Most frequent missing forms

hi

[edit]
  • Forms in Wikidata: 7,630
  • Forms in Wikipedia: 54,443
  • Tokens: 18,734,831
  • Covered forms: 3,059 (5.6%)
  • Missing forms: 51,384 (94.4%)
  • Covered tokens: 12,434,459 (66.4%)
  • Missing tokens: 6,300,372 (33.6%)
  • Most frequent missing forms

hr

[edit]
  • Forms in Wikidata: 4,895
  • Forms in Wikipedia: 135,627
  • Tokens: 28,543,040
  • Covered forms: 2,831 (2.1%)
  • Missing forms: 132,796 (97.9%)
  • Covered tokens: 13,444,820 (47.1%)
  • Missing tokens: 15,098,220 (52.9%)
  • Most frequent missing forms

hu

[edit]
  • Forms in Wikidata: 154
  • Forms in Wikipedia: 274,652
  • Tokens: 64,674,851
  • Covered forms: 100 (0.0%)
  • Missing forms: 274,552 (100.0%)
  • Covered tokens: 268,172 (0.4%)
  • Missing tokens: 64,406,679 (99.6%)
  • Most frequent missing forms

id

[edit]
  • Forms in Wikidata: 392,890
  • Forms in Wikipedia: 100,137
  • Tokens: 40,049,055
  • Covered forms: 16,578 (16.6%)
  • Missing forms: 83,559 (83.4%)
  • Covered tokens: 22,931,252 (57.3%)
  • Missing tokens: 17,117,803 (42.7%)
  • Most frequent missing forms

ig

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 3,360
  • Forms in Wikipedia: 1,153
  • Tokens: 113,878
  • Covered forms: 431 (37.4%)
  • Missing forms: 722 (62.6%)
  • Covered tokens: 75,472 (66.3%)
  • Missing tokens: 38,406 (33.7%)
  • Most frequent missing forms

it

[edit]
  • Forms in Wikidata: 405,627
  • Forms in Wikipedia: 341,080
  • Tokens: 284,500,580
  • Covered forms: 95,434 (28.0%)
  • Missing forms: 245,646 (72.0%)
  • Covered tokens: 260,559,963 (91.6%)
  • Missing tokens: 23,940,617 (8.4%)
  • Most frequent missing forms

ko

[edit]
  • Forms in Wikidata: 523
  • Forms in Wikipedia: 290,844
  • Tokens: 34,282,183
  • Covered forms: 426 (0.1%)
  • Missing forms: 290,418 (99.9%)
  • Covered tokens: 2,463,793 (7.2%)
  • Missing tokens: 31,818,390 (92.8%)
  • Most frequent missing forms

la

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 778,661
  • Forms in Wikipedia: 11,551
  • Tokens: 1,031,544
  • Covered forms: 8,317 (72.0%)
  • Missing forms: 3,234 (28.0%)
  • Covered tokens: 884,683 (85.8%)
  • Missing tokens: 146,861 (14.2%)
  • Most frequent missing forms

lb

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 874
  • Forms in Wikipedia: 10,240
  • Tokens: 1,365,293
  • Covered forms: 369 (3.6%)
  • Missing forms: 9,871 (96.4%)
  • Covered tokens: 488,116 (35.8%)
  • Missing tokens: 877,177 (64.2%)
  • Most frequent missing forms

lt

[edit]
  • Forms in Wikidata: 84
  • Forms in Wikipedia: 92,063
  • Tokens: 13,288,668
  • Covered forms: 39 (0.0%)
  • Missing forms: 92,024 (100.0%)
  • Covered tokens: 62,019 (0.5%)
  • Missing tokens: 13,226,649 (99.5%)
  • Most frequent missing forms

lv

[edit]
  • Forms in Wikidata: 1,929
  • Forms in Wikipedia: 60,189
  • Tokens: 8,004,635
  • Covered forms: 1,145 (1.9%)
  • Missing forms: 59,044 (98.1%)
  • Covered tokens: 2,356,774 (29.4%)
  • Missing tokens: 5,647,861 (70.6%)
  • Most frequent missing forms

ml

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 748,804
  • Forms in Wikipedia: 28,789
  • Tokens: 1,992,352
  • Covered forms: 8,837 (30.7%)
  • Missing forms: 19,952 (69.3%)
  • Covered tokens: 1,074,456 (53.9%)
  • Missing tokens: 917,896 (46.1%)
  • Most frequent missing forms

ms

[edit]
  • Forms in Wikidata: 4,481
  • Forms in Wikipedia: 51,515
  • Tokens: 16,143,659
  • Covered forms: 3,465 (6.7%)
  • Missing forms: 48,050 (93.3%)
  • Covered tokens: 11,948,528 (74.0%)
  • Missing tokens: 4,195,131 (26.0%)
  • Most frequent missing forms

mt

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 608
  • Forms in Wikipedia: 5,941
  • Tokens: 371,515
  • Covered forms: 297 (5.0%)
  • Missing forms: 5,644 (95.0%)
  • Covered tokens: 127,796 (34.4%)
  • Missing tokens: 243,719 (65.6%)
  • Most frequent missing forms

nan

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 42
  • Forms in Wikipedia: 5,791
  • Tokens: 1,119,898
  • Covered forms: 2 (0.0%)
  • Missing forms: 5,789 (100.0%)
  • Covered tokens: 63 (0.0%)
  • Missing tokens: 1,119,835 (100.0%)
  • Most frequent missing forms

nb

[edit]
  • Forms in Wikidata: 156,114
  • Forms in Wikipedia: 153,555
  • Tokens: 49,620,256
  • Covered forms: 47,929 (31.2%)
  • Missing forms: 105,626 (68.8%)
  • Covered tokens: 44,362,797 (89.4%)
  • Missing tokens: 5,257,459 (10.6%)
  • Most frequent missing forms

nl

[edit]
  • Forms in Wikidata: 1,023
  • Forms in Wikipedia: 260,266
  • Tokens: 130,343,371
  • Covered forms: 696 (0.3%)
  • Missing forms: 259,570 (99.7%)
  • Covered tokens: 38,084,248 (29.2%)
  • Missing tokens: 92,259,123 (70.8%)
  • Most frequent missing forms

nn

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 86,457
  • Forms in Wikipedia: 23,956
  • Tokens: 4,198,152
  • Covered forms: 8,738 (36.5%)
  • Missing forms: 15,218 (63.5%)
  • Covered tokens: 3,429,149 (81.7%)
  • Missing tokens: 769,003 (18.3%)
  • Most frequent missing forms

pa

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 15,677
  • Forms in Wikipedia: 21,156
  • Tokens: 4,611,923
  • Covered forms: 3,329 (15.7%)
  • Missing forms: 17,827 (84.3%)
  • Covered tokens: 3,381,437 (73.3%)
  • Missing tokens: 1,230,486 (26.7%)
  • Most frequent missing forms

pl

[edit]
  • Forms in Wikidata: 19,057
  • Forms in Wikipedia: 333,225
  • Tokens: 117,356,732
  • Covered forms: 7,464 (2.2%)
  • Missing forms: 325,761 (97.8%)
  • Covered tokens: 43,258,583 (36.9%)
  • Missing tokens: 74,098,149 (63.1%)
  • Most frequent missing forms

pnb

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 15,731
  • Forms in Wikipedia: 21,465
  • Tokens: 5,029,117
  • Covered forms: 1,857 (8.7%)
  • Missing forms: 19,608 (91.3%)
  • Covered tokens: 2,439,087 (48.5%)
  • Missing tokens: 2,590,030 (51.5%)
  • Most frequent missing forms

pt

[edit]
  • Forms in Wikidata: 34,532
  • Forms in Wikipedia: 214,847
  • Tokens: 158,056,230
  • Covered forms: 13,770 (6.4%)
  • Missing forms: 201,077 (93.6%)
  • Covered tokens: 118,281,244 (74.8%)
  • Missing tokens: 39,774,986 (25.2%)
  • Most frequent missing forms

ro

[edit]
  • Forms in Wikidata: 58
  • Forms in Wikipedia: 119,245
  • Tokens: 40,889,103
  • Covered forms: 48 (0.0%)
  • Missing forms: 119,197 (100.0%)
  • Covered tokens: 370,902 (0.9%)
  • Missing tokens: 40,518,201 (99.1%)
  • Most frequent missing forms

ru

[edit]
  • Forms in Wikidata: 912,948
  • Forms in Wikipedia: 651,825
  • Tokens: 290,067,562
  • Covered forms: 139,801 (21.4%)
  • Missing forms: 512,024 (78.6%)
  • Covered tokens: 177,479,843 (61.2%)
  • Missing tokens: 112,587,719 (38.8%)
  • Most frequent missing forms

sd

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 972
  • Forms in Wikipedia: 11,146
  • Tokens: 1,533,326
  • Covered forms: 82 (0.7%)
  • Missing forms: 11,064 (99.3%)
  • Covered tokens: 396,270 (25.8%)
  • Missing tokens: 1,137,056 (74.2%)
  • Most frequent missing forms

se

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 57,028
  • Forms in Wikipedia: 1,705
  • Tokens: 95,453
  • Covered forms: 50 (2.9%)
  • Missing forms: 1,655 (97.1%)
  • Covered tokens: 3,279 (3.4%)
  • Missing tokens: 92,174 (96.6%)
  • Most frequent missing forms

sk

[edit]
  • Forms in Wikidata: 128,106
  • Forms in Wikipedia: 109,573
  • Tokens: 18,366,700
  • Covered forms: 46,034 (42.0%)
  • Missing forms: 63,539 (58.0%)
  • Covered tokens: 12,451,506 (67.8%)
  • Missing tokens: 5,915,194 (32.2%)
  • Most frequent missing forms

sl

[edit]
  • Forms in Wikidata: 103
  • Forms in Wikipedia: 106,577
  • Tokens: 19,924,659
  • Covered forms: 76 (0.1%)
  • Missing forms: 106,501 (99.9%)
  • Covered tokens: 114,079 (0.6%)
  • Missing tokens: 19,810,580 (99.4%)
  • Most frequent missing forms

sr

[edit]
  • Forms in Wikidata: 32
  • Forms in Wikipedia: 183,777
  • Tokens: 42,439,136
  • Covered forms: 23 (0.0%)
  • Missing forms: 183,754 (100.0%)
  • Covered tokens: 127,781 (0.3%)
  • Missing tokens: 42,311,355 (99.7%)
  • Most frequent missing forms

sv

[edit]
  • Forms in Wikidata: 260,821
  • Forms in Wikipedia: 219,718
  • Tokens: 72,173,155
  • Covered forms: 67,041 (30.5%)
  • Missing forms: 152,677 (69.5%)
  • Covered tokens: 64,276,644 (89.1%)
  • Missing tokens: 7,896,511 (10.9%)
  • Most frequent missing forms

ta

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 6,528
  • Forms in Wikipedia: 31,721
  • Tokens: 2,539,025
  • Covered forms: 1,095 (3.5%)
  • Missing forms: 30,626 (96.5%)
  • Covered tokens: 283,535 (11.2%)
  • Missing tokens: 2,255,490 (88.8%)
  • Most frequent missing forms

tg

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 100
  • Forms in Wikipedia: 9,793
  • Tokens: 1,252,518
  • Covered forms: 29 (0.3%)
  • Missing forms: 9,764 (99.7%)
  • Covered tokens: 4,210 (0.3%)
  • Missing tokens: 1,248,308 (99.7%)
  • Most frequent missing forms

th

[edit]
  • Forms in Wikidata: 15
  • Forms in Wikipedia: 27,089
  • Tokens: 2,068,858
  • Covered forms: 13 (0.0%)
  • Missing forms: 27,076 (100.0%)
  • Covered tokens: 4,870 (0.2%)
  • Missing tokens: 2,063,988 (99.8%)
  • Most frequent missing forms

tl

[edit]
  • Forms in Wikidata: 27
  • Forms in Wikipedia: 20,893
  • Tokens: 3,583,109
  • Covered forms: 19 (0.1%)
  • Missing forms: 20,874 (99.9%)
  • Covered tokens: 21,296 (0.6%)
  • Missing tokens: 3,561,813 (99.4%)
  • Most frequent missing forms

tr

[edit]
  • Forms in Wikidata: 2,125
  • Forms in Wikipedia: 151,341
  • Tokens: 30,211,406
  • Covered forms: 1,333 (0.9%)
  • Missing forms: 150,008 (99.1%)
  • Covered tokens: 6,791,830 (22.5%)
  • Missing tokens: 23,419,576 (77.5%)
  • Most frequent missing forms

uk

[edit]
  • Forms in Wikidata: 238,621
  • Forms in Wikipedia: 356,409
  • Tokens: 114,386,141
  • Covered forms: 26,786 (7.5%)
  • Missing forms: 329,623 (92.5%)
  • Covered tokens: 16,779,308 (14.7%)
  • Missing tokens: 97,606,833 (85.3%)
  • Most frequent missing forms

ur

[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 7,712
  • Forms in Wikipedia: 17,576
  • Tokens: 4,872,849
  • Covered forms: 1,357 (7.7%)
  • Missing forms: 16,219 (92.3%)
  • Covered tokens: 2,276,525 (46.7%)
  • Missing tokens: 2,596,324 (53.3%)
  • Most frequent missing forms

vi

[edit]
  • Forms in Wikidata: 47
  • Forms in Wikipedia: 60,377
  • Tokens: 75,656,151
  • Covered forms: 30 (0.0%)
  • Missing forms: 60,347 (100.0%)
  • Covered tokens: 3,181,916 (4.2%)
  • Missing tokens: 72,474,235 (95.8%)
  • Most frequent missing forms