Wikidata:Lexicographical data/Documentation/Languages/hi

From Wikidata
Jump to navigation Jump to search
Hindustani
natural language, modern language, common language
Subclass ofWestern Hindi Edit
Native labelہندوستانی, हिन्दुस्तानी Edit
Located in the administrative territorial entityDelhi, Pakistan Edit
Linguistic typologysubject–object–verb, syllabic language, fusional language Edit
Has grammatical caseobliquus in Hindi Edit
Has grammatical genderfeminine, masculine Edit
Writing systemDevanagari, Urdu orthography Edit
Language regulatory bodyCentral Hindi Directorate Edit
Entry in abbreviations tableH., ਹਿੰ., ہ, ҳ. Edit

Hindustani (Q11051) or Hindi-Urdu is a language spoken in India and Pakistan. This page is a documentation page for Hindustani (Hindu-Urdu) language under WikiProject Wikidata:Lexicographical data, intended for coordinating contributions to Hindustani (Hindi and Urdu) lexeme content and related discussions. WikiProject India is a related Wikiproject that covers all Hindustani topics.

Example Hindustani lexeme entries:

Sample Lexemes by Lexical Category
verb noun pronoun adjective adverb postposition conjunction interjection determiner grammatical particle
आना/آنا (L33485) चूल्हा/چُولھا (L1011246) तुम/تُم (L580418) बहरा/بہرا (L640865) आगे/آگے (L580431) तक/تک (L409543) लेकिन/لیکِن (L580024) नमस्ते/نمستے (L579679) सब/سب (L620518) भी/بھی (L580358)

Wikidata:Lexemes aims to provide a CC0 licensed structured lexicographical data for everyone to use for different purposes, including for Wiktionary and the upcoming Abstract Wikipedia.


Layout[edit]

Every lexeme entry has the following layout:

Lexeme-level[edit]

The lemma of the lexeme can be considered a title or headword, generally the dictionary form of the word. It is to be written in both hi (Hindi, Devanagari script) and ur (Urdu, Arabic script) spelling variants for the Hindustani language entries. See उठना/اُٹھنا (L1071943) for example.

Every lexeme entry will have a lexeme ID (beginning with "L").

The language of the lexeme should be Hindustani (Q11051) in all cases (that is, not Hindi (Q1568) and not Urdu (Q1617)).

The lexical category should also be specified as broad as possible, and based on the Hindustani linguistic ontology.

Senses[edit]

Senses represent different meanings of the same word.

Some statements that may be added to senses include image, item for this sense, translation, synonym, antonym, usage example, and more (see list). Note that for the translation, antonym, & synonym properties, the lexeme "sense ID" (LXXXXX-S1) of the target lexeme has to be copy pasted, not the lexeme ID.

Forms[edit]

Forms represent different inflections (cases for nouns/adjectives, conjugations for verbs) of the lexeme (in both Hindi and Urdu spelling variants).

Each noun typically has four forms, for each combination of number (singular (Q110786)/plural (Q146786)) and case (direct case (Q1751855)/oblique case (Q1233197)). A small number of nouns which are often but not always animate also have vocative inflections (vocative case (Q185077)). These are governed by the senses on the lexeme and should not be added without certainty that they are used.

Structure and properties[edit]

Common properties to be added for lexeme entries are given below:

Statements[edit]

Identifiers[edit]

  • Urdu Lughat ID (P11350) – aggregate online dictionary maintained by the Urdu Dictionary Board, a Karachi-based Pakistani government operation
Provided below is a key to some of the part of speech abbreviations used in the headings of entries. A key to those used in the footer for etymologies may be found in the menu on the Advanced Search page.
  • صف = صفت
  • امث = اسمِ مؤنث
  • امذ = اسمِ مذکر
  • ف ل = فعل لازمی
  • ف م = فعل متعدی
  • م ف = متعلق فعل

Senses[edit]

See sense properties by usage

Forms[edit]

See forms by grammatical feature

Spelling[edit]

Below are some guidelines for resolving some irregularities in spellings between the two writing systems, particularly for words which may be poorly attested in one register or the other.

  • ष — in Sanskritized words borrowed via Bengali this is ش, otherwise it is کھ. Most words spelled with this letter post-partition are Bengali borrowings.
  • ज्ञ — in practice always گی. Some Urdu dictionaries contain spellings with نج under the assumption this cluster represents an independent sound in Hindi, but this does not reflect actual usage.
  • ऋ — is always رِ.
  • ण — is always ن.
  • पुर — word-finally, this is پور rather than پُر.
  • ऑ — this vowel is purely decorative and is best ignored even in Devanagari spellings. Most of its use is confined to distinguishing the abbreviation डाॅ॰ “Dr.”.
  • आँड़ — this sequence of a nasal and flap is typically written as نڈ in Urdu dictionaries and it is acceptable to pair these spellings together as the consonants represented by ڑ and ڈ are allophones in native Hindustani words. In English loanwords, only ڈ is realized in all positions, and in vocabulary loaned from Punjabi the positions of ڈ and ڑ is maintained in Urdu spellings as these sounds are not allophones in Punjabi.
  • त् — words spelled with this ending in standard Hindi are borrowings from Bengali words ending in ৎ. Although the virama/halant is retained when not followed by a suffix, it is removed in the oblique plural as in तों rather than त्ओं.
  • आँव — although انو may be found for this sequence in older Urdu writing, this is now more commonly written as اؤں.
  • य — word-finally, spelled with یہ in borrowings from Bengali, otherwise spelled with ے.
  • ژ — the value of this letter is always simply ज.
  • ہ — the use of this letter word finally is often arbitrary and unetymological. The word commonly spelled پتہ in Urdu is from Punjabi پتا rather than a Persian *پته. If both variants with this letter and ا exist they do not need separate lexemes. The lemma can follow whichever spelling is treated as the primary one in Urdu Lughat.
  • ق — some of the words spelled with this letter are native words which have been given pseudo-Arabic spellings, such as قُلی. The nukta form क़ is not necessary to represent this consonant which already had an ambiguous status in Persian. The /q/ phoneme represented by ق does not have phonemic status in Pashto either, and the spelling in Pashto onomatopoeic formations used in Hindustani like تڑق is an emphatic affect.

Maintenance[edit]

To do[edit]

Lexicographical Coverage[edit]

See also: WD:Lexicographical data/Statistics
  • The lexeme forms coverage chart for Hindustani language is given below.
  • Forms in Wikidata: 7,585
  • Forms in Wikipedia: 54,443
  • Tokens: 18,734,831
  • Covered forms: 3,050 (5.6%)
  • Missing forms: 51,393 (94.4%)
  • Covered tokens: 12,433,553 (66.4%)
  • Missing tokens: 6,301,278 (33.6%)
  • Most frequent missing forms

Queries[edit]

Main page: WD:Lexicographical data/Ideas of queries

1) Get all existing lexemes in Hindustani: query result

The following query uses these:

  • Items: Hindustani (Q11051)  View with Reasonator View with SQID
    SELECT ?lexeme ?lemma WHERE {
      ?lexeme dct:language wd:Q11051; 
              wikibase:lemma ?lemma.
    }
    

2) Get the count of lexemes in Hindustani belonging to different lexical categories: https://w.wiki/3$cf

3) Query for all Hindi/Urdu nouns missing a direct case: query

The following query uses these:

  • Items: Hindustani (Q11051)  View with Reasonator View with SQID, noun (Q1084)  View with Reasonator View with SQID, direct case (Q1751855)  View with Reasonator View with SQID
    SELECT DISTINCT ?l ?lemma WHERE {
      ?l a ontolex:LexicalEntry ; 
           dct:language wd:Q11051; 
           wikibase:lexicalCategory wd:Q1084; 
           wikibase:lemma ?lemma ; 
           ontolex:lexicalForm ?form .
        ?form ontolex:representation ?word ;  
        minus {
          {?l a ontolex:LexicalEntry ; ontolex:lexicalForm/wikibase:grammaticalFeature wd:Q1751855.}
        }.
    }
    

Resources[edit]

Some resources, in addition to the ones listed below, may be found at Commons:Category:Books about the Hindustani language.

Dictionaries[edit]

Quotable dictionaries[edit]

Public domain dictionaries may be quoted using gloss quote (P8394), referenced with the claims stated in (P248) (appropriate dictionary item), page(s) (P304) (appropriate page number), and reference URL (P854) if applicable.

Public domain monolingual dictionaries (preferred):

Public domain bilingual dictionaries (those in other regional languages preferred):

More may be found here.

Citable, but non-quotable, dictionaries[edit]

Other dictionaries that may be cited but not quoted include the following (those glossed in regional languages are likewise preferrable):

Phrases[edit]

Grammars[edit]

Orthography[edit]

Regional Context[edit]

Tools[edit]

WD:Tools/Lexicographical data

Contact[edit]


  1. 1.0 1.1 "In Hindi we have Shakespear and Forbes, but neither of these works is more than a very copious vocabulary, and both are derived almost exclusively from the written language."