Wikidata:Lexicographical data/Ideas of queries
This page is a list of maintenance and cool queries that could be built with lexicographical data on Wikidata.
Consider adding your signature to queries you insert to help users contact the author with questions, etc.
Maintenance and repairing
[edit]- all lexemes where the language is not a language (or similar)
- all lexemes where the language of the lexeme is not the same as the language of the lemma
- empty lexemes (no statements, no forms, no sense) in all languages
- for all lexemes find every lexeme without any form (by language)
- for all lexemes find every lexeme without any sense (by language)
- for all lexemes find senses without item for this sense (P5137)
- for all languages with has grammatical gender (P5109) feature find all nouns without grammatical gender (P5185) specified
- for all verbs and nouns find lexemes with forms with empty grammatical features
- for all lexemes find those with IPA transcription (P898) and pronunciation audio (P443) specified at the level of lexeme instead of level of form (pronunciation is feature of form)
- for all lexemes find those with image (P18) specified at the level of lexeme instead of level of sense
- find all lexemes where lemma does not occur in form representations
- for all usages of hyphenation (P5279) unify separator from "|" or "-" or "•" or anything other to "‧" (U+2027 HYPHENATION POINT)
- find all cases where audio (P51) can be replaced by pronunciation audio (P443)
- for all languages with has conjugation class (P5206) find verbs without conjugation class (P5186) specified
- find all usages of item for this sense (P5137) outside senses
- find all usages of usage example (P5831) without subject lexeme form (P5830) or subject sense (P6072)
- find all lexemes with usage example (P5831) where subject lexeme form (P5830) points to form of different lexeme
- find all usages of IPA transcription (P898) and Slavistic Phonetic Alphabet transcription (P5276) without phonemic (with //) or phonetic (with []) markup
- Please clarify further. This query shows all lexemes with IPA not containing "//"--So9q (talk) 14:06, 27 November 2019 (UTC)
- list lexemes that could use lexemes for parts of them
- find all usages of derived from lexeme (P5191) within single language and check if they can be replaced by suffixation with combines lexemes (P5238)
- for all lexemes find those with creates lexeme type (P5923) specified at the level of lexeme instead of level of sense
- Find all lexemes with the same lemma, language and lexical category (candidates for merging or other action)
- find all parts of speech used in particular language (with usage statistics)
- sorted by usage: all lexical categories, languages, grammatical features, lemmas language codes and representations language codes
- find lexemes with senses definitions that contain "heraldric" (these exist because of a bug in the query of MachtSinn)
- find lexemes with forms that are missing grammatical features (can be improved via Wikidata Lexeme Forms by dragging and dropping the link of the form into the right input box):
- find lexemes with senses that refer to heraldics (should be deleted)
- find all lexemes with the same spelling
- Lexemes with described by source (P1343) instead of on the forms and senses the source describe
- forms without pronunciation audio (P443) (affixes excluded)
- Wikidata:Report on potential duplicate Lexemes
- Potential duplicate lexemes with same title spelling, language and lexeme category
- Homographs marked with homograph lexeme (P5402) without reverse homograph lexeme (P5402):
- from any part of speech (short URL: https://w.wiki/fS4)
- within the same part of speech (short URL: https://w.wiki/gbJ)
- Lexemes with a sense that has more than one P5137 value
- Lexemes with unreferenced derived from lexeme (P5191) statements
Statistics
[edit]For more lexeme statistics, see Wikidata:Lexicographical_data/Statistics
- number of forms missing grammatical features
- number of usage examples that demonstrated both a form and a sense
- properties used with senses (with counts) Listeria list on Wikidata — query on WDQS
- qualifiers used with senses, with their main properties Listeria list on Wikidata — query on WDQS
LinguaLibre audio recording
[edit]You can use a wikidata query url with Form-ID in External Tools option in LinguaLibre, and then the User:Lingua Libre Bot will automatically add pronunciation audio file in the respective Form-id of lexeme entry. The query should have both ?id and ?label parameters representing form-id and form label respectively. You have to copy paste the "URL" to the query (from "</> Code
" button or your address bar in the given form. Example queries:
- All forms of a language missing pronunciation audio: query (replace with your language Q-id)
- All lemma forms missing pronunciation: query (replace with your language Q-id)
- Latest lexeme forms by a user: query (Replace with your own username)
Language-independent
[edit]- All nouns in a given language:
- Length of the words of a language
- Longest words of a language
- Shortest words of a language
- Average number of grammatical features per language (probably very expensive)
- Something about word endings (inspiration)
- Where the word for “tea” is derived from “te” or from “cha” (inspiration)
- Works for other words, eg. beer
- ISO standard (Q15087423)(s) or other standards/texts which define a term
- number of phoneme (Q8183)
- Words with the highest number of phoneme (Q8183) / of distinct phoneme (Q8183)
- Words with the lowest number of phoneme (Q8183) / of distinct phoneme (Q8183)
- Average number of phoneme (Q8183) / of distinct phoneme (Q8183)
- Number of syllable (Q8188)
- Words with the highest number of syllable (Q8188)
- Words with the lowest number of syllable (Q8188)
- Average number of syllable (Q8188)
- The above queries for forms that lack IPA
- Histogram of words by the number of antonym (Q131779)
- Histogram of words by the number of synonym (Q42106)
- Words used the most number of times in compound (Q245423)
- Words used the most number of times in multiword expression (Q6935164)
- Words that rhyme with a given word
- Words that have the most/least other words rhyming with them in a given language
- Words with no rhymes in a given language
- Words that are the same in the most languages
- Words that are the same in two languages, with the greatest geographical separation
- The words with the most meanings, in different languages
- Shortest words that are unique to a single language
- Longest words that are shared between two or more languages
- Map of the most masculine/feminine country (scale from blue to red based on counting gender for each country)
- List of false friend (Q202961) (similar words in two languages but with different meanings)
- Using the false friend (P5976) property
- List of false cognate (Q2285656) (similar words in two languages with same meaning but different etymologies)
- Word with five vowels (a, e, i, o, u) in a given language (bonus for finding the shortest ones) : murciélago@es or oiseau@fr (guerroyai@fr with the y)
- finding lipogram (Q836165) in a given language
- frequency of a letter in a given language (letter frequency (Q520562), etaoin shrdlu (Q670443) for English), could be display as an histogram
- number of words by first letter in English
- frequency of a sequence of letter
- the shortest non-existent sequence ('cbi' is very rare, only three entries in French Wiktionary)
- Generating crosswords (and probably too expensive: generating the biggest crosswords without black squares, File:MC-record-9x8-JCM.svg this is the biggest known yet in French)
- There is some software for crossword generation: for example. The idea would be adding a random set of lexemes with senses. There are also web services where you can upload csv files: example
- We can also get random words with senses to populate that crossword: query
- Histogram of etymon (Q992080) by the number of derived words
- Word (in each language and totally) with the longest chain of loan history
- Words which are etymon (Q992080) of the most number of language (Q34770)
- Percentage of words in a given language (Q34770) which were derived from other language (Q34770) (grouped by language (Q34770)) (example)
- Words added to a dictionary for the first time for a given year (example source)
- Words with the highest number of word sense (Q1570700)
- Histogram of words in a language (Q34770) by the number of part of speech (Q82042) (lexical classes) the word sense (Q1570700) belong to
- Biggest Levenshtein distance (Q496939) between two flexions of a word (is it calculable? my money is on the very irregular and suppletive verbe "aller@fr", see fr:wikt:Annexe:Conjugaison en français/aller)
- query normalized IPA
- Words with senses with images (image (P18) sampled) see also this report
- Finding anagrams
- Lexemes and forms with only one (repeated or not) vowel.
- All homographs in all languages: [1]
- Finding homographic forms of different non-homographic lemmas
- Finding homophones (same IPA, different spelling)
- Words having a different gender in different languages
- Languages by number of "item for this sense" statements
- Illustrated word of the day, for children
- lexemes with sense but no item for this sense (P5137) by language
Specific languages
[edit]Breton
[edit]- A dynamic map of all the variations and pronunciation of the word "ki" (dog) depending on the geographical zone in Brittany (similar to this)
Chinese
[edit]- Antonyms which differs only in tone
Dagbani
[edit]- Lexeme pairs between English and Dagbani
- Dagbani lexemes that are nouns and end with the letters "aa"
- Longest lexemes in Dagbani
English
[edit]- peas. To quote Wiktionary (wikt:en:pease#English, CC BY-SA 3.0): “The original singular was pease, and the plural was peasen. Over the centuries, pease became used as the plural, peasen was dropped, pea was created as a new singular, and finally pease was respelled peas.” I have no clue how this will be modeled, but whatever it is I’m sure there’ll be some interesting queries to write about this and similar words :)
- Acronyms deriving from latin words, like e. g., i. e., cf., etc.
- Noun lexemes which do not have a plural form
- Noun lexemes which have an invalid plural form (see en:English plurals and [2] for rules)
- Verb lexemes which do not have a simple form (see [3])
- Verb lexemes which do not have a 3rd person singular present tense form (see [4])
- Verb lexemes which do not have a present participle and gerund form (see [5])
- Regular verb lexemes which do not have a past tense and past participle form (see [6])
French
[edit]- Out of all feminine French nouns how many end by an "e"?
- Words ending in "-té" are usually feminine (but "un comité"@fr) or in "-age" are masculine (but "une page"@fr *but* "un page" for the person), but what are the exceptions and how many are there? maybe search for a link with etymology (most words in -té are abstract and come from a latin in -tas, exceptions probably come from elsewhere)
German
[edit]- Plurals ending in just “n” (without an “e” before the “n”), like „Triumvirn“.
- query, now annotated with comments, including in its scope all varieties of German, limited to nomitative-case forms, and excluding words whose plural forms are not obtained by appending “n” but just happen to end in “n” (like Pokemon)
- query, now annotated with comments, including in its scope all varieties of German, limited to nomitative-case forms, and excluding words whose plural forms are not obtained by appending “n” but just happen to end in “n” (like
- Verbs ending in just “n”, not “en”
- Acronyms that don’t adopt the grammatical gender of the last word in the expanded form.
- Word triples with the same pronunciation and different genders, like „der Coup“, „die Kuh“, „das Q“ (title of a book by CUS (Q1024556))
- Word quintuples with one vowel exchanged, like „Wart/Wert/Wirt/Wort/Wurt“ (from that same book) or „Zacken/Zecken/Zicken/Zocken/Zucken“ (inspiration) (probably very expensive)
Malayalam
[edit]See Wikidata:Lexicographical data/Malayalam/Queries
Russian
[edit]- Распределение существительных по классу склонения, с примерами