Wikidata:Property proposal/pronunciation

From Wikidata
Jump to navigation Jump to search


Originally proposed at Wikidata:Property proposal/Lexemes

   Done: pronunciation (P7243) (Talk and documentation)
DescriptionPronunciation of the lexeme's form, e.g. ◌́ to indicate word's stress position. Attach other pronunciation-related information, as as IPA and sound files as qualifiers.
Data typeMonolingual text
Example 1Lexeme:L63567#F1 → "попе́рчивший" + Qualifiers: IPA transcription (P898) pɐˈpʲert͡ɕɪfʂɨɪ̯, pronunciation audio (P443), etc
Example 2Lexeme:L63567#F1 → "поперчи́вший" + Qualifiers: IPA transcription (P898) pəpʲɪrˈt͡ɕifʂɨɪ̯, pronunciation audio (P443), etc
Example 3Lexeme:L63570#F1 → "сессия" + Qualifiers IPA transcription (P898) ˈsɛsʲ(ː)ɪɪ̯ə, pronunciation audio (P443) File:Ru-сессия (ɛ).ogg
Example 4Lexeme:L63570#F1 → "сессия" + Qualifiers IPA transcription (P898) ˈsʲesʲ(ː)ɪɪ̯ə, pronunciation audio (P443) File:Ru-сессия (e).ogg
Planned useUpload Russian wiktionary data, starting with nouns
See alsoIPA transcription (P898) UPA transcription (P6798) X-SAMPA Code (P2859) Slavic phonetic alphabet (P5276)


There are often cases when the same written word form has more than one pronunciation. Each pronunciation has a number of properties itself, such as pronunciation audio (P443), IPA transcription (P898), region in which it is used, or the references to scholarly works about it. I would like to propose that we put all pronunciation-related values as qualifiers to a single property. The property itself would have the written form of the word, possibly repeated, with the linguistic stress marks applied to it.

English example
Lexeme = potato
 Form F1 = potato (singular)
   pronunciation = "potato" (en)  -- TBD: here we could use the tick symbol to indicate where the stress should be, i.e. "potáto"
     IPA = "pəˈteɪtəʊ"
     Sound = En-uk-potato.ogg
     Region = UK
   pronunciation = "potato" (en)  -- same value as above
     IPA = "pəˈteɪtoʊ"
     Sound = en-us-potato.ogg
     Region = US
   pronunciation = "potato" (en)  -- same value as above
     IPA = "pəˈteɪtə"
     Sound = LL-Q1860 (eng)-Nattes à chat-potato.wav
     Region = US (?)
 Form F2 - potatos (plural)
Russian example
Lexeme = "поперчивший"   (<he> peppered ...)
  Form F1 = "поперчивший"          -- form word has no stress symbol, and its primary form (F1) is the same as for the whole lexeme
    pronunciation = "попе́рчивший"  -- one common usage, with the stress mark on the 2nd syllable
      IPA = "pɐˈpʲert͡ɕɪfʂɨɪ̯"
      Sound = sound1.ogg
    pronunciation = "поперчи́вший"  -- another common usage, stress on the 3rd syllable instead
      IPA = "pəpʲɪrˈt͡ɕifʂɨɪ̯"
      Sound = sound2.ogg
      Region = ...
      References = ...
  Form F2 = ...

In many cases, the difference is just the location of the stress syllable, e.g. the first two examples, indicated by the unicode stress symbol ◌́ (used in some languages as an occasional reading aid, but dictionaries usually include it). In other cases, the stress could be on the same syllable, but culturally it is read in more than one way. Examples 3 and 4 -- the word сессия (session) has two forms - [ˈsɛsʲ(ː)ɪɪ̯ə] and [ˈsʲesʲ(ː)ɪɪ̯ə]. In this case we simply have two identical values for the pronunciation forms, but with a different set of qualifier values. BTW, some extreme cases could even have four different pronunciations, e.g. моветон -- (it has both IPA and sound files for all 4 variants, 2 backed by sources). --Yurik (talk) 01:58, 28 August 2019 (UTC)


  • Symbol support vote.svg Support Looks logical --Ghuron (talk) 03:30, 28 August 2019 (UTC)
  • I do not see why it is not possible to do this withtout adding a new property. I have the feeling that ading a new property will make easier some cases like the one you take as example bu I think it will complicate all the other cases. Pamputt (talk) 05:46, 28 August 2019 (UTC)
    A number of other structures has been analyzed. This one was propped by smalyshev - WDQS Dev, and seems to fit the best. Others were to store words with stresses as the primary form text, and create multiple forms, one per pronounciation. And yet another was to use IPA as the primary identifier (too bad no one can read it). Luckily, the above proposal makes it very simple even for basic cases -- just have one value for the new property, with the stress mark, and attach IPA and sound to it. We still need to store that text form, so an identical proposal is needed even if IPA and sound is stored as form statements rather than qualifiers. --Yurik (talk) 05:59, 28 August 2019 (UTC)
    I am still not convinced by this approach. What about using two forms instead of one; one for "попе́рчивший" and another one for "поперчи́вший". I do not know enough about the Russian language so I cannot be sure it is a correct approach. Pamputt (talk) 11:05, 28 August 2019 (UTC)
    Moreover, are we sure that all the properties that will be used as qualifier do not already used qualifier themselves? Pamputt (talk) 11:06, 28 August 2019 (UTC)
    @Pamputt: that was my other idea initially - to treat "form" not as a grammatical form, but rather as a pronunciation form. That would create an entirly different set of issues:
    • people that assume that "a form" implies grammatical form will be very confused why there are 4 singular nominative forms, 4 singular genitive cases, etc (Russian often has 27 grammatical forms for adjectives, so multiplied by pronunciations - the number here would be ... large and hard to manage).
    • there are many form statements that are not pronunciation related, such as hyphenation, syllable breaks, and links to the form-specific word components. We wouldn't want to keep these duplicated and in sync across multiple pronunciations if we can set up a more efficient system.
    You are right that this would preclude using qualifiers for those properties, and that's the potential cost of this solution, but after discussing it online with several people, this approach seems to create the least amount of complications as oppose to the other ones listed above. --Yurik (talk) 14:43, 28 August 2019 (UTC)
    The main problem with this approach is that different pronunciations are not always attached to a given form. If we take as example English. For the same spelling, you may have different pronunciation from Canada, US, UK, Australia, etc. For describing such case, I would take the IPA as the main information and I ould add an audio file, the location, etc. I am afraid that this new property make things more complicated and do not take into account all the cases for any language. Or maybe, we should use this property only for Russian and maybe other Slavic languages. Pamputt (talk) 17:18, 28 August 2019 (UTC)
    @Pamputt: this proposal accounts for that. For example, word potato (wiktionary) / potato (L3784) has four variants in Wiktionary, and we can record them as follows: (example moved to the top). P.S. note that this proposal accounts for the fact that none of the 3 above qualifiers are mandatory - you may not know how to write IPA code, or you might not have the sound file, but you can still use this approach and someone else can add it later. If we used IPA as the key, it would require everyone to know how to do it.
--Yurik (talk) 17:52, 28 August 2019 (UTC)
  • Symbol support vote.svg Support--Cinemantique (talk) 10:45, 28 August 2019 (UTC)
  • Symbol support vote.svg Support. Nomen ad hoc (talk) 16:47, 28 August 2019 (UTC).
  • Pictogram voting comment.svg Comment what about using the location of usage/region as the value for this property, with any cases where there's only one pronunciation getting maybe Earth (Q2) or something else indicating universality...? ArthurPSmith (talk) 18:31, 28 August 2019 (UTC)
    @ArthurPSmith: What if I do not know where it is used? What if it is used everywhere except an area? Also, often the split is not geographical but rather "educational" - e.g. there is a more "patrician" way to say a word one way, and a more "plebs" way to say it another way (historically, the way one said a word indicated belonging to a certain group). The only thing you know for certain is the spelling of the word itself. In the case of a language with the variable stress, e.g. Russian or English, but not French or Mandarin, we would still need to record the position of the stress - and this would be the perfect place because stress is often a part of differentiation between pronunciations. --Yurik (talk) 18:44, 28 August 2019 (UTC)
  • Symbol support vote.svg Support. Yep, I like this way for forms pronunciation. Iniquity (talk) 00:21, 29 August 2019 (UTC)
  • Pictogram voting comment.svg Comment could you add lexemes to the samples (i.e. provide full sample statements by replacing "Lexeme:L1234#F4" with the applicable lexeme/form)? --- Jura 08:06, 29 August 2019 (UTC)
    @Jura1: done. --Yurik (talk) 19:13, 29 August 2019 (UTC)
  • Symbol support vote.svg Support Seems reasonable and useful. --Sintakso (talk) 11:28, 30 August 2019 (UTC)
  • Created, @Yurik, Ghuron, Pamputt, Cinemantique, Nomen ad hoc, ArthurPSmith:--Ymblanter (talk) 06:44, 4 September 2019 (UTC)
  • @Iniquity, Jura1, Sintakso:--Ymblanter (talk) 06:44, 4 September 2019 (UTC)