Wikidata:Lexicographical coverage

From Wikidata
Jump to navigation Jump to search

This page presents the lexicographical coverage of the Wikidata Lexicographical data compared to the Wikipedia corpus of the given language.

The scripts to calculate these are in a PAWS notebook. Improvements are very welcome.

More information on this prototype tool: m:Abstract Wikipedia/Updates/2021-02-10#Corpus coverage dashboard

See also Wikidata:Lexicographical data/Statistics.

ar[edit]

  • Forms in Wikidata: 274
  • Forms in Wikipedia: 246,598
  • Tokens: 69,840,956
  • Covered forms: 56 (0.0%)
  • Missing forms: 246,542 (100.0%)
  • Covered tokens: 265,073 (0.4%)
  • Missing tokens: 69,575,883 (99.6%)
  • Most frequent missing forms

bg[edit]

  • Forms in Wikidata: 170
  • Forms in Wikipedia: 118,514
  • Tokens: 33,132,887
  • Covered forms: 155 (0.1%)
  • Missing forms: 118,359 (99.9%)
  • Covered tokens: 350,276 (1.1%)
  • Missing tokens: 32,782,611 (98.9%)
  • Most frequent missing forms

br[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 1,098
  • Forms in Wikipedia: 9,552
  • Tokens: 1,459,030
  • Covered forms: 417 (4.4%)
  • Missing forms: 9,135 (95.6%)
  • Covered tokens: 617,324 (42.3%)
  • Missing tokens: 841,706 (57.7%)
  • Most frequent missing forms

bn[edit]

(This analysis was performed separately from all the others on this page, using the corpus linked here and custom counting code.)

  • Forms in Wikidata: 43,276
  • Forms in Wikipedia: 5,34,894
  • Tokens: 1,33,06,025
  • Covered forms: 12,603 (2.36%)
  • Missing forms: 5,22,291 (97.64%)
  • Covered tokens: 37,86,986 (28.46%)
  • Missing tokens: 95,19,039 (71.54%)
  • Most frequent missing forms

ca[edit]

  • Forms in Wikidata: 116
  • Forms in Wikipedia: 176,311
  • Tokens: 108,297,498
  • Covered forms: 91 (0.1%)
  • Missing forms: 176,220 (99.9%)
  • Covered tokens: 11,022,513 (10.2%)
  • Missing tokens: 97,274,985 (89.8%)
  • Most frequent missing forms

cs[edit]

  • Forms in Wikidata: 122,145
  • Forms in Wikipedia: 261,374
  • Tokens: 74,084,890
  • Covered forms: 32,443 (12.4%)
  • Missing forms: 228,931 (87.6%)
  • Covered tokens: 43,377,524 (58.6%)
  • Missing tokens: 30,707,366 (41.4%)
  • Most frequent missing forms

cy[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 80
  • Forms in Wikipedia: 10,844
  • Tokens: 1,442,683
  • Covered forms: 49 (0.5%)
  • Missing forms: 10,795 (99.5%)
  • Covered tokens: 16,393 (1.1%)
  • Missing tokens: 1,426,290 (98.9%)
  • Most frequent missing forms

da[edit]

  • Forms in Wikidata: 40,184
  • Forms in Wikipedia: 111,139
  • Tokens: 30,879,404
  • Covered forms: 20,394 (18.3%)
  • Missing forms: 90,745 (81.7%)
  • Covered tokens: 25,041,218 (81.1%)
  • Missing tokens: 5,838,186 (18.9%)
  • Most frequent missing forms

de[edit]

  • Forms in Wikidata: 63,299
  • Forms in Wikipedia: 1,008,036
  • Tokens: 596,433,479
  • Covered forms: 32,527 (3.2%)
  • Missing forms: 975,509 (96.8%)
  • Covered tokens: 438,105,672 (73.5%)
  • Missing tokens: 158,327,807 (26.5%)
  • Most frequent missing forms

el[edit]

  • Forms in Wikidata: 14
  • Forms in Wikipedia: 129,276
  • Tokens: 40,452,744
  • Covered forms: 13 (0.0%)
  • Missing forms: 129,263 (100.0%)
  • Covered tokens: 6,716 (0.0%)
  • Missing tokens: 40,446,028 (100.0%)
  • Most frequent missing forms

en[edit]

  • Forms in Wikidata: 106,983
  • Forms in Wikipedia: 965,225
  • Tokens: 1,508,248,447
  • Covered forms: 74,703 (7.7%)
  • Missing forms: 890,522 (92.3%)
  • Covered tokens: 1,399,211,834 (92.8%)
  • Missing tokens: 109,036,613 (7.2%)
  • Most frequent missing forms

eo[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 3,238
  • Forms in Wikipedia: 27,201
  • Tokens: 4,222,541
  • Covered forms: 1,203 (4.4%)
  • Missing forms: 25,998 (95.6%)
  • Covered tokens: 2,117,658 (50.2%)
  • Missing tokens: 2,104,883 (49.8%)
  • Most frequent missing forms

es[edit]

  • Forms in Wikidata: 4,651
  • Forms in Wikipedia: 372,589
  • Tokens: 405,914,020
  • Covered forms: 3,847 (1.0%)
  • Missing forms: 368,742 (99.0%)
  • Covered tokens: 194,810,287 (48.0%)
  • Missing tokens: 211,103,733 (52.0%)
  • Most frequent missing forms

et[edit]

  • Forms in Wikidata: 2,728,486
  • Forms in Wikipedia: 123,073
  • Tokens: 16,832,892
  • Covered forms: 72,698 (59.1%)
  • Missing forms: 50,375 (40.9%)
  • Covered tokens: 13,667,980 (81.2%)
  • Missing tokens: 3,164,912 (18.8%)
  • Most frequent missing forms

eu[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 1,001,970
  • Forms in Wikipedia: 26,466
  • Tokens: 3,138,442
  • Covered forms: 16,138 (61.0%)
  • Missing forms: 10,328 (39.0%)
  • Covered tokens: 2,339,560 (74.5%)
  • Missing tokens: 798,882 (25.5%)
  • Most frequent missing forms

fa[edit]

  • Forms in Wikidata: 45
  • Forms in Wikipedia: 100,251
  • Tokens: 44,426,012
  • Covered forms: 28 (0.0%)
  • Missing forms: 100,223 (100.0%)
  • Covered tokens: 113,535 (0.3%)
  • Missing tokens: 44,312,477 (99.7%)
  • Most frequent missing forms

fi[edit]

  • Forms in Wikidata: 8,217
  • Forms in Wikipedia: 276,898
  • Tokens: 46,847,582
  • Covered forms: 4,890 (1.8%)
  • Missing forms: 272,008 (98.2%)
  • Covered tokens: 11,528,995 (24.6%)
  • Missing tokens: 35,318,587 (75.4%)
  • Most frequent missing forms

fr[edit]

  • Forms in Wikidata: 18,219
  • Forms in Wikipedia: 465,138
  • Tokens: 474,988,250
  • Covered forms: 16,013 (3.4%)
  • Missing forms: 449,125 (96.6%)
  • Covered tokens: 356,240,742 (75.0%)
  • Missing tokens: 118,747,508 (25.0%)
  • Most frequent missing forms

ha[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 7
  • Forms in Wikipedia: 4,816
  • Tokens: 859,259
  • Covered forms: 5 (0.1%)
  • Missing forms: 4,811 (99.9%)
  • Covered tokens: 1,088 (0.1%)
  • Missing tokens: 858,171 (99.9%)
  • Most frequent missing forms

he[edit]

  • Forms in Wikidata: 348,296
  • Forms in Wikipedia: 249,890
  • Tokens: 76,643,376
  • Covered forms: 54,538 (21.8%)
  • Missing forms: 195,352 (78.2%)
  • Covered tokens: 41,683,644 (54.4%)
  • Missing tokens: 34,959,732 (45.6%)
  • Most frequent missing forms

hi[edit]

  • Forms in Wikidata: 345
  • Forms in Wikipedia: 54,631
  • Tokens: 18,693,672
  • Covered forms: 116 (0.2%)
  • Missing forms: 54,515 (99.8%)
  • Covered tokens: 2,830,494 (15.1%)
  • Missing tokens: 15,863,178 (84.9%)
  • Most frequent missing forms

hr[edit]

  • Forms in Wikidata: 55
  • Forms in Wikipedia: 135,627
  • Tokens: 28,543,040
  • Covered forms: 51 (0.0%)
  • Missing forms: 135,576 (100.0%)
  • Covered tokens: 1,318,994 (4.6%)
  • Missing tokens: 27,224,046 (95.4%)
  • Most frequent missing forms

hu[edit]

  • Forms in Wikidata: 131
  • Forms in Wikipedia: 274,652
  • Tokens: 64,674,851
  • Covered forms: 77 (0.0%)
  • Missing forms: 274,575 (100.0%)
  • Covered tokens: 117,390 (0.2%)
  • Missing tokens: 64,557,461 (99.8%)
  • Most frequent missing forms

id[edit]

  • Forms in Wikidata: 81
  • Forms in Wikipedia: 100,137
  • Tokens: 40,049,055
  • Covered forms: 45 (0.0%)
  • Missing forms: 100,092 (100.0%)
  • Covered tokens: 222,981 (0.6%)
  • Missing tokens: 39,826,074 (99.4%)
  • Most frequent missing forms

ig[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 0
  • Forms in Wikipedia: 1,153
  • Tokens: 113,878
  • Covered forms: 0 (0.0%)
  • Missing forms: 1,153 (100.0%)
  • Covered tokens: 0 (0.0%)
  • Missing tokens: 113,878 (100.0%)
  • Most frequent missing forms

it[edit]

  • Forms in Wikidata: 655
  • Forms in Wikipedia: 341,080
  • Tokens: 284,500,580
  • Covered forms: 523 (0.2%)
  • Missing forms: 340,557 (99.8%)
  • Covered tokens: 101,821,968 (35.8%)
  • Missing tokens: 182,678,612 (64.2%)
  • Most frequent missing forms

ko[edit]

  • Forms in Wikidata: 152
  • Forms in Wikipedia: 290,844
  • Tokens: 34,282,183
  • Covered forms: 113 (0.0%)
  • Missing forms: 290,731 (100.0%)
  • Covered tokens: 1,330,113 (3.9%)
  • Missing tokens: 32,952,070 (96.1%)
  • Most frequent missing forms

la[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 778,173
  • Forms in Wikipedia: 11,551
  • Tokens: 1,031,544
  • Covered forms: 8,303 (71.9%)
  • Missing forms: 3,248 (28.1%)
  • Covered tokens: 884,399 (85.7%)
  • Missing tokens: 147,145 (14.3%)
  • Most frequent missing forms

lb[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 863
  • Forms in Wikipedia: 10,240
  • Tokens: 1,365,293
  • Covered forms: 360 (3.5%)
  • Missing forms: 9,880 (96.5%)
  • Covered tokens: 486,090 (35.6%)
  • Missing tokens: 879,203 (64.4%)
  • Most frequent missing forms

lt[edit]

  • Forms in Wikidata: 34
  • Forms in Wikipedia: 92,063
  • Tokens: 13,288,668
  • Covered forms: 13 (0.0%)
  • Missing forms: 92,050 (100.0%)
  • Covered tokens: 16,458 (0.1%)
  • Missing tokens: 13,272,210 (99.9%)
  • Most frequent missing forms

lv[edit]

  • Forms in Wikidata: 586
  • Forms in Wikipedia: 60,189
  • Tokens: 8,004,635
  • Covered forms: 370 (0.6%)
  • Missing forms: 59,819 (99.4%)
  • Covered tokens: 1,076,761 (13.5%)
  • Missing tokens: 6,927,874 (86.5%)
  • Most frequent missing forms

ml[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 170,590
  • Forms in Wikipedia: 28,789
  • Tokens: 1,992,352
  • Covered forms: 5,119 (17.8%)
  • Missing forms: 23,670 (82.2%)
  • Covered tokens: 580,767 (29.1%)
  • Missing tokens: 1,411,585 (70.9%)
  • Most frequent missing forms

ms[edit]

  • Forms in Wikidata: 2,399
  • Forms in Wikipedia: 51,515
  • Tokens: 16,143,659
  • Covered forms: 1,158 (2.2%)
  • Missing forms: 50,357 (97.8%)
  • Covered tokens: 9,297,531 (57.6%)
  • Missing tokens: 6,846,128 (42.4%)
  • Most frequent missing forms

mt[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 397
  • Forms in Wikipedia: 5,941
  • Tokens: 371,515
  • Covered forms: 259 (4.4%)
  • Missing forms: 5,682 (95.6%)
  • Covered tokens: 123,548 (33.3%)
  • Missing tokens: 247,967 (66.7%)
  • Most frequent missing forms

nan[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 79
  • Forms in Wikipedia: 5,791
  • Tokens: 1,119,898
  • Covered forms: 3 (0.1%)
  • Missing forms: 5,788 (99.9%)
  • Covered tokens: 179,094 (16.0%)
  • Missing tokens: 940,804 (84.0%)
  • Most frequent missing forms

nb[edit]

  • Forms in Wikidata: 28,485
  • Forms in Wikipedia: 153,555
  • Tokens: 49,620,256
  • Covered forms: 13,557 (8.8%)
  • Missing forms: 139,998 (91.2%)
  • Covered tokens: 39,126,764 (78.9%)
  • Missing tokens: 10,493,492 (21.1%)
  • Most frequent missing forms

nl[edit]

  • Forms in Wikidata: 190
  • Forms in Wikipedia: 260,266
  • Tokens: 130,343,371
  • Covered forms: 151 (0.1%)
  • Missing forms: 260,115 (99.9%)
  • Covered tokens: 26,106,915 (20.0%)
  • Missing tokens: 104,236,456 (80.0%)
  • Most frequent missing forms

nn[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 238
  • Forms in Wikipedia: 23,956
  • Tokens: 4,198,152
  • Covered forms: 116 (0.5%)
  • Missing forms: 23,840 (99.5%)
  • Covered tokens: 511,133 (12.2%)
  • Missing tokens: 3,687,019 (87.8%)
  • Most frequent missing forms

pl[edit]

  • Forms in Wikidata: 15,231
  • Forms in Wikipedia: 333,225
  • Tokens: 117,356,732
  • Covered forms: 5,528 (1.7%)
  • Missing forms: 327,697 (98.3%)
  • Covered tokens: 37,736,974 (32.2%)
  • Missing tokens: 79,619,758 (67.8%)
  • Most frequent missing forms

pt[edit]

  • Forms in Wikidata: 11,564
  • Forms in Wikipedia: 214,847
  • Tokens: 158,056,230
  • Covered forms: 6,338 (3.0%)
  • Missing forms: 208,509 (97.0%)
  • Covered tokens: 90,762,356 (57.4%)
  • Missing tokens: 67,293,874 (42.6%)
  • Most frequent missing forms

ro[edit]

  • Forms in Wikidata: 31
  • Forms in Wikipedia: 119,245
  • Tokens: 40,889,103
  • Covered forms: 27 (0.0%)
  • Missing forms: 119,218 (100.0%)
  • Covered tokens: 341,692 (0.8%)
  • Missing tokens: 40,547,411 (99.2%)
  • Most frequent missing forms

ru[edit]

  • Forms in Wikidata: 910,202
  • Forms in Wikipedia: 651,825
  • Tokens: 290,067,562
  • Covered forms: 138,184 (21.2%)
  • Missing forms: 513,641 (78.8%)
  • Covered tokens: 160,336,226 (55.3%)
  • Missing tokens: 129,731,336 (44.7%)
  • Most frequent missing forms

se[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 211
  • Forms in Wikipedia: 1,705
  • Tokens: 95,453
  • Covered forms: 27 (1.6%)
  • Missing forms: 1,678 (98.4%)
  • Covered tokens: 2,318 (2.4%)
  • Missing tokens: 93,135 (97.6%)
  • Most frequent missing forms

sk[edit]

  • Forms in Wikidata: 125,962
  • Forms in Wikipedia: 109,573
  • Tokens: 18,366,700
  • Covered forms: 45,475 (41.5%)
  • Missing forms: 64,098 (58.5%)
  • Covered tokens: 12,421,405 (67.6%)
  • Missing tokens: 5,945,295 (32.4%)
  • Most frequent missing forms

sl[edit]

  • Forms in Wikidata: 2
  • Forms in Wikipedia: 106,577
  • Tokens: 19,924,659
  • Covered forms: 2 (0.0%)
  • Missing forms: 106,575 (100.0%)
  • Covered tokens: 2,148 (0.0%)
  • Missing tokens: 19,922,511 (100.0%)
  • Most frequent missing forms

sr[edit]

  • Forms in Wikidata: 27
  • Forms in Wikipedia: 183,777
  • Tokens: 42,439,136
  • Covered forms: 17 (0.0%)
  • Missing forms: 183,760 (100.0%)
  • Covered tokens: 71,064 (0.2%)
  • Missing tokens: 42,368,072 (99.8%)
  • Most frequent missing forms

sv[edit]

  • Forms in Wikidata: 191,804
  • Forms in Wikipedia: 219,718
  • Tokens: 72,173,155
  • Covered forms: 52,077 (23.7%)
  • Missing forms: 167,641 (76.3%)
  • Covered tokens: 62,324,164 (86.4%)
  • Missing tokens: 9,848,991 (13.6%)
  • Most frequent missing forms

ta[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 1,914
  • Forms in Wikipedia: 31,721
  • Tokens: 2,539,025
  • Covered forms: 315 (1.0%)
  • Missing forms: 31,406 (99.0%)
  • Covered tokens: 104,354 (4.1%)
  • Missing tokens: 2,434,671 (95.9%)
  • Most frequent missing forms

tg[edit]

These statistics use corpus data from the Leipzig Corpora Collection.

  • Forms in Wikidata: 520
  • Forms in Wikipedia: 9,793
  • Tokens: 1,252,518
  • Covered forms: 135 (1.4%)
  • Missing forms: 9,658 (98.6%)
  • Covered tokens: 79,365 (6.3%)
  • Missing tokens: 1,173,153 (93.7%)
  • Most frequent missing forms

th[edit]

  • Forms in Wikidata: 9
  • Forms in Wikipedia: 27,089
  • Tokens: 2,068,858
  • Covered forms: 8 (0.0%)
  • Missing forms: 27,081 (100.0%)
  • Covered tokens: 4,204 (0.2%)
  • Missing tokens: 2,064,654 (99.8%)
  • Most frequent missing forms

tl[edit]

  • Forms in Wikidata: 1
  • Forms in Wikipedia: 20,893
  • Tokens: 3,583,109
  • Covered forms: 1 (0.0%)
  • Missing forms: 20,892 (100.0%)
  • Covered tokens: 1,100 (0.0%)
  • Missing tokens: 3,582,009 (100.0%)
  • Most frequent missing forms

tr[edit]

  • Forms in Wikidata: 36
  • Forms in Wikipedia: 151,341
  • Tokens: 30,211,406
  • Covered forms: 32 (0.0%)
  • Missing forms: 151,309 (100.0%)
  • Covered tokens: 142,166 (0.5%)
  • Missing tokens: 30,069,240 (99.5%)
  • Most frequent missing forms

uk[edit]

  • Forms in Wikidata: 759
  • Forms in Wikipedia: 356,409
  • Tokens: 114,386,141
  • Covered forms: 526 (0.1%)
  • Missing forms: 355,883 (99.9%)
  • Covered tokens: 6,577,831 (5.8%)
  • Missing tokens: 107,808,310 (94.2%)
  • Most frequent missing forms

vi[edit]

  • Forms in Wikidata: 8
  • Forms in Wikipedia: 60,377
  • Tokens: 75,656,151
  • Covered forms: 5 (0.0%)
  • Missing forms: 60,372 (100.0%)
  • Covered tokens: 46,662 (0.1%)
  • Missing tokens: 75,609,489 (99.9%)
  • Most frequent missing forms