Wikidata:Requests for permissions/Bot/بوٹ دا عثمان

From Wikidata
Jump to navigation Jump to search

بوٹ دا عثمان[edit]

بوٹ دا عثمان (talkcontribsnew itemsnew lexemesSULBlock logUser rights logUser rightsxtools)
Operator: Middle river exports (talkcontribslogs)

Task/s:

  • Dual script support for Punjabi Wikidata labels and descriptions in Gurmukhi (pa) and Shahmukhi (pnb), taking a conservative approach (only making "obvious" transliterations).
  • Addition of labels and descriptions entirely consistent of information which can be inferred from other language labels (again, only for "obvious" cases).

Code:

Function details:

  • Gurmukhi to Shahmukhi: The Gurmukhi script for Punjabi represents enough information that the vast majority of strings can be transliterated to Shahmukhi through the application of a set of standard rules. For example, word-initial ਆ should always be آ, and ਨ੍ਹ should be نھ.The exceptions to this are largely concentrated in the most commonly used words in the language, such as ਮੂੰਹ (face/mouth) being مونہہ. For this reason, I intend to scope these transliterations to labels and descriptions only consistent of the most common words, as defined by the EMILLE Punjabi Corpus which maintains this data. The exceptions will then be substituted first before applying the general rules.
  • Shahmukhi to Gurmukhi: The Perso-Arabic script allows for the omission of information that Gurmukhi does not, and has more one-to-many conversions, which means that transliterations in this direction will unfortunately have to be more limited for the time being. Despite these differences, there are some common letter combinations which always correspond to the same letters in Gurmukhi, and for labels and descriptions entirely consistent of these, a Gurmukhi representation may be derived. For example, ب is always ਬ so long as it is not followed by ھ, and ٹ not followed by ھ is always ਟ. There are some more sophisticated conversions which may be done; for example: و can represent the consonant ਵ or the vowels ਉ, ਊ, ਓ, and ਔ. However, a vowel at the beginning of the word is always represented by ا, so we know word-initial و is always ਵ. So if we see وِکی ("wiki"), we can transliterate this as ਵਿਕੀ with 100% confidence.
  • Tidying / Standardizing: There are some forms which always have a standard/expected alternate spelling, so aliases will be added for these where applicable. For example, any word containing ਲ਼ may be written with ਲ instead, and any word written with ݨ may be written with ن instead. These letters are in use in the Punjabi Wikipedia editions, but are not broadly supported across keyboard layouts yet or used by all writers, so aliases containing the alternate spelling should be applied wherever possible. The other aspect of tidying I intend the bot to perform is correction of spelling errors which have "obvious" corrections which may be made with 100% confidence. For example, the shad on کّ should be moved if followed by ھ as in کھّ. Wherever کّھ appears, it is always an error which should be corrected this way. For Gurmukhi, if we see ਅ + ਾ as in ਅਾ, this should always be replaced with ਆ.
  • Deriving labels and descriptions from other languages: This functionality is intended to be similar to what the existing Mr. Ibrahembot does. For items absent of descriptions but which represent common object types, Punjabi descriptions will be provided. Preliminarily this will be limited to geographic entities (i.e. "city in Punjab, Pakistan"), books (i.e., " 1995 book"), and scholarly articles (i.e. "1995 scholarly article"). Pakistani toponyms specifically with only English labels in the format Chak ### XY, as in Chak 251 GB, will be given standard Punjabi labels as in چاک 251 گ ب. The other component to this is that any label for an item with instance of: human may be copied exactly from the Urdu label if one is present to the Punjabi Shahmukhi field. The reason for this is that despite Punjabi personal names having their own pronunciations, they are all orthographia conservadum, that is they are only supposed to be spelled identically to Urdu. Even for human items labeled for pseudonyms or other monikers, the derivation of those to Urdu would follow the exact same conventions in Punjabi.

--Middle river exports (talk) 19:32, 17 September 2022 (UTC)[reply]

Please make some test edits Ymblanter (talk) 19:10, 27 September 2022 (UTC)[reply]