Wikidata:Lexicographical data/Ideas of queries

This page is a list of maintenance and cool queries that could be built with lexicographical data on Wikidata.

Consider adding your signature to queries you insert to help users contact the author with questions, etc.

Maintenance and repairing

all lexemes where the language is not a language (or similar)
- query
all lexemes where the language of the lexeme is not the same as the language of the lemma
- query
empty lexemes (no statements, no forms, no sense) in all languages
- SPARQL query
- SQL quarry
for all lexemes find every lexeme without any form (by language)

for all lexemes find every lexeme without any sense (by language)

for all lexemes find senses without item for this sense (P5137)
- query for all languages
- query for English
- query for Hindustani
for all languages with has grammatical gender (P5109) feature find all nouns without grammatical gender (P5185) specified
- query for Bulgarian
- query for Hindustani
- query for Sanskrit
for all verbs and nouns find lexemes with forms with empty grammatical features
- query for English nouns
for all lexemes find those with IPA transcription (P898) and pronunciation audio (P443) specified at the level of lexeme instead of level of form (pronunciation is feature of form)
- query for IPA
- query for audio
for all lexemes find those with image (P18) specified at the level of lexeme instead of level of sense
- query
find all lexemes where lemma does not occur in form representations
- query for English
for all usages of hyphenation (P5279) unify separator from "|" or "-" or "•" or anything other to "‧" (U+2027 HYPHENATION POINT)
find all cases where audio (P51) can be replaced by pronunciation audio (P443)
for all languages with has conjugation class (P5206) find verbs without conjugation class (P5186) specified
find all usages of item for this sense (P5137) outside senses
- query for lexemes
- query for forms
find all usages of usage example (P5831) without subject form (P5830) or subject sense (P6072)
- query ordered by lang--So9q (talk) 14:13, 27 November 2019 (UTC)[reply]
find all lexemes with usage example (P5831) where subject form (P5830) points to form of different lexeme
find all usages of IPA transcription (P898) and Slavistic Phonetic Alphabet transcription (P5276) without phonemic (with //) or phonetic (with []) markup
- Please clarify further. This query shows all lexemes with IPA not containing "//"--So9q (talk) 14:06, 27 November 2019 (UTC)[reply]
list lexemes that could use lexemes for parts of them
find all usages of derived from lexeme (P5191) within single language and check if they can be replaced by suffixation with combines lexemes (P5238)
for all lexemes find those with creates lexeme type (P5923) specified at the level of lexeme instead of level of sense
- query with optional filter--So9q (talk) 14:13, 27 November 2019 (UTC)[reply]
Find all lexemes with the same lemma, language and lexical category (candidates for merging or other action)
- query to do this for all languages and categories
- query for Danish
find all parts of speech used in particular language (with usage statistics)
sorted by usage: all lexical categories, languages, grammatical features, lemmas language codes and representations language codes
find lexemes with senses definitions that contain "heraldric" (these exist because of a bug in the query of MachtSinn)
- english swedish
find lexemes with forms that are missing grammatical features (can be improved via Wikidata Lexeme Forms by dragging and dropping the link of the form into the right input box):
- swedish
find lexemes with senses that refer to heraldics (should be deleted)
- all instance or subclass of blazonry
find all lexemes with the same spelling
- swedish
Lexemes with described by source (P1343) instead of on the forms and senses the source describe
- swedish total count
forms without pronunciation audio (P443) (affixes excluded)
- danish english
Wikidata:Report on potential duplicate Lexemes
Potential duplicate lexemes with same title spelling, language and lexeme category
- query for English
- query for Russian (without marked homographs)
Homographs marked with homograph lexeme (P5402) without reverse homograph lexeme (P5402):
- from any part of speech (short URL: https://w.wiki/fS4)
- within the same part of speech (short URL: https://w.wiki/gbJ)
Lexemes with a sense that has more than one P5137 value
- https://query.wikidata.org/#SELECT%20distinct%20%3Fl%20%3Flemma%20%3Fsense%20%28COUNT%28%3Fvalue%29%20AS%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fl%20a%20ontolex%3ALexicalEntry%20%3B%0A%20%20%20%20%20%20%20wikibase%3Alemma%20%3Flemma%3B%0A%20%20%20%20%20%20%20ontolex%3Asense%20%3Fsense%20.%0A%20%20%20%20%3Fsense%20wdt%3AP5137%20%3Fvalue.%0A%20%20%23filter%28%3Fcount%3E1%29%0A%7D%0Agroup%20by%20%3Fl%20%3Flemma%20%3Fsense%20%3Fcount%0Aorder%20by%20desc%28%3Fcount%29%0Alimit%2050
Lexemes with unreferenced derived from lexeme (P5191) statements
- Punjabi
- Hindustani

Statistics

For more lexeme statistics, see Wikidata:Lexicographical_data/Statistics

number of forms missing grammatical features
- total (times out), english (times out), swedish
number of usage examples that demonstrated both a form and a sense
- ordered by language--So9q (talk) 18:48, 8 January 2021 (UTC)[reply]

properties used with senses (with counts) Listeria list on Wikidata — query on WDQS
qualifiers used with senses, with their main properties Listeria list on Wikidata — query on WDQS

LinguaLibre audio recording

You can use a wikidata query url with Form-ID in External Tools option in LinguaLibre, and then the User:Lingua Libre Bot will automatically add pronunciation audio file in the respective Form-id of lexeme entry. The query should have both ?id and ?label parameters representing form-id and form label respectively. You have to copy paste the "URL" to the query (from "</> Code" button or your address bar in the given form. Example queries:

All forms of a language missing pronunciation audio: query (replace with your language Q-id)
All lemma forms missing pronunciation: query (replace with your language Q-id)
Latest lexeme forms by a user: query (Replace with your own username)

Language-independent

All nouns in a given language:
- Nouns in Basque
Length of the words of a language
- Longest words of a language
- Shortest words of a language
  query for English verbs
Average number of grammatical features per language (probably very expensive)
Something about word endings (inspiration)
Where the word for “tea” is derived from “te” or from “cha” (inspiration)
- Works for other words, eg. beer
ISO standard (Q15087423)(s) or other standards/texts which define a term
number of phoneme (Q8183)
- Words with the highest number of phoneme (Q8183) / of distinct phoneme (Q8183)
- Words with the lowest number of phoneme (Q8183) / of distinct phoneme (Q8183)
- Average number of phoneme (Q8183) / of distinct phoneme (Q8183)
Number of syllable (Q8188)
- Words with the highest number of syllable (Q8188)
- Words with the lowest number of syllable (Q8188)
- Average number of syllable (Q8188)
The above queries for forms that lack IPA
Histogram of words by the number of antonym (Q131779)
Histogram of words by the number of synonym (Q42106)
Words used the most number of times in compound (Q245423)
Words used the most number of times in multiword expression (Q6935164)
Words that rhyme with a given word
Words that have the most/least other words rhyming with them in a given language
- Words with no rhymes in a given language
Words that are the same in the most languages
Words that are the same in two languages, with the greatest geographical separation
The words with the most meanings, in different languages
Shortest words that are unique to a single language
Longest words that are shared between two or more languages
Map of the most masculine/feminine country (scale from blue to red based on counting gender for each country)
List of false friend (Q202961) (similar words in two languages but with different meanings)
- Using the false friend (P5976) property
List of false cognate (Q2285656) (similar words in two languages with same meaning but different etymologies)
Word with five vowels (a, e, i, o, u) in a given language (bonus for finding the shortest ones) : murciélago@es or oiseau@fr (guerroyai@fr with the y)
- in French (without y) and in French (with y)
- search in Basque language including all grammatical forms
finding lipogram (Q836165) in a given language
frequency of a letter in a given language (letter frequency (Q520562), etaoin shrdlu (Q670443) for English), could be display as an histogram
- number of words by first letter in English
- frequency of a sequence of letter
- the shortest non-existent sequence ('cbi' is very rare, only three entries in French Wiktionary)
Generating crosswords (and probably too expensive: generating the biggest crosswords without black squares, File:MC-record-9x8-JCM.svg this is the biggest known yet in French)
- There is some software for crossword generation: for example. The idea would be adding a random set of lexemes with senses. There are also web services where you can upload csv files: example
- We can also get random words with senses to populate that crossword: query
Histogram of etymon (Q992080) by the number of derived words
Word (in each language and totally) with the longest chain of loan history
Words which are etymon (Q992080) of the most number of language (Q34770)
Percentage of words in a given language (Q34770) which were derived from other language (Q34770) (grouped by language (Q34770)) (example)
Words added to a dictionary for the first time for a given year (example source)
Words with the highest number of word sense (Q1570700)
Histogram of words in a language (Q34770) by the number of part of speech (Q82042) (lexical classes) the word sense (Q1570700) belong to
Biggest Levenshtein distance (Q496939) between two flexions of a word (is it calculable? my money is on the very irregular and suppletive verbe "aller@fr", see fr:wikt:Annexe:Conjugaison en français/aller)
query normalized IPA
Words with senses with images (image (P18) sampled) see also this report
- in English
- in Breton
Finding anagrams
Lexemes and forms with only one (repeated or not) vowel.
- Basque
- Filtered by vowel (a)
All homographs in all languages: [1]
Finding homographic forms of different non-homographic lemmas
- ru
Finding homophones (same IPA, different spelling)
Words having a different gender in different languages
- Animals having a different gender in French and German
Languages by number of "item for this sense" statements
Illustrated word of the day, for children
lexemes with sense but no item for this sense (P5137) by language

Specific languages

Breton

A dynamic map of all the variations and pronunciation of the word "ki" (dog) depending on the geographical zone in Brittany (similar to this)

Chinese

Antonyms which differs only in tone

Dagbani

English

peas. To quote Wiktionary (wikt:en:pease#English, CC BY-SA 3.0): “The original singular was pease, and the plural was peasen. Over the centuries, pease became used as the plural, peasen was dropped, pea was created as a new singular, and finally pease was respelled peas.” I have no clue how this will be modeled, but whatever it is I’m sure there’ll be some interesting queries to write about this and similar words :)
Acronyms deriving from latin words, like e. g., i. e., cf., etc.
Noun lexemes which do not have a plural form
- this query should work according to my understanding of the specification but it does not unfortunately
Noun lexemes which have an invalid plural form (see en:English plurals and [2] for rules)
Verb lexemes which do not have a simple form (see [3])
- query
Verb lexemes which do not have a 3rd person singular present tense form (see [4])
- query
Verb lexemes which do not have a present participle and gerund form (see [5])
- query
Regular verb lexemes which do not have a past tense and past participle form (see [6])

French

Out of all feminine French nouns how many end by an "e"?
- query
- Reverse: how many words end in "e" are feminine?
  - query
Words ending in "-té" are usually feminine (but "un comité"@fr) or in "-age" are masculine (but "une page"@fr *but* "un page" for the person), but what are the exceptions and how many are there? maybe search for a link with etymology (most words in -té are abstract and come from a latin in -tas, exceptions probably come from elsewhere)
- French words ending in "-té" with gender (on lemma only)

German

Plurals ending in just “n” (without an “e” before the “n”), like „Triumvirn“.
- query, now annotated with comments, including in its scope all varieties of German, limited to nomitative-case forms, and excluding words whose plural forms are not obtained by appending “n” but just happen to end in “n” (like Pokemon
  )
Verbs ending in just “n”, not “en”
- query
Acronyms that don’t adopt the grammatical gender of the last word in the expanded form.
Word triples with the same pronunciation and different genders, like „der Coup“, „die Kuh“, „das Q“ (title of a book by CUS (Q1024556))
Word quintuples with one vowel exchanged, like „Wart/Wert/Wirt/Wort/Wurt“ (from that same book) or „Zacken/Zecken/Zicken/Zocken/Zucken“ (inspiration) (probably very expensive)

Malayalam

See Wikidata:Lexicographical data/Malayalam/Queries

Russian

Распределение существительных по классу склонения, с примерами

Spanish

Lexemes without forms
- query
Lexemes with forms without hyphenation
- query

Wikidata:Lexicographical data/Ideas of queries

Contents

Maintenance and repairing

Statistics

LinguaLibre audio recording

Language-independent

Specific languages

Breton

Chinese

Dagbani

English

French

German

Malayalam

Russian

Spanish

Swedish

Navigation menu

Wikidata:Lexicographical data/Ideas of queries

Maintenance and repairing

Statistics

LinguaLibre audio recording

Language-independent

Specific languages

Breton

Chinese

Dagbani

English

French

German

Malayalam

Russian

Spanish

Swedish

Navigation menu

Search