Shortcut: WD:PP/L

Wikidata:Property proposal/Lexemes

From Wikidata
Jump to navigation Jump to search

Property proposal: Generic Authority control Person Organization
Creative work Place Sports Sister projects
Transportation Natural science Lexeme Wikimedia Commons

See also[edit]

This page is for the proposal of new properties.

Before proposing a property

  1. Check if the property already exists by looking at Wikidata:List of properties (research on manual list) and Special:ListProperties.
  2. Check if the property was previously proposed or is on the pending list.
  3. Check if you can give a similar label and definition as an existing Wikipedia infobox parameter, or if it can be matched to an infobox, to or from which data can be transferred automatically.
  4. Select the right datatype for the property.
  5. Start writing the documentation based on the preload form below and add it in the appropriate section.

Creating the property

  1. Once consensus is reached, change status=ready on the template, to attract the attention of a property creator.
  2. Creation can be done 1 week after the proposal, by a property creator or an administrator.
  3. See steps when creating properties.

On this page, old discussions are archived. An overview of all archives can be found at this page's archive index. The current archive is located at 2021/06.


Kielitoimiston sanakirja ID[edit]

   Under discussion
Descriptionidentifier for an entry in Kielitoimiston sanakirja
Data typeExternal identifier
DomainFinnish lexemes
Example 1sarja (L29370)sarja
Example 2talo (L29606)talo
Example 3vyö (L29685)vyö
Planned uselinking existing Wikidata lexemes
Number of IDs in sourceover 100,000
Expected completenessalways incomplete (Q21873886)
Formatter URL$1
Applicable "stated in"-valueKielitoimiston sanakirja (Q54855316)


Kielitoimiston sanakirja (Q54855316) is the online version of the official Finnish dictionary and it's commonly linked to in Wiktionary (Q151). Similarly other dictionaries, this an external identifier for a headword which multiple lexemes can share. Kriomet (talk) 11:13, 26 April 2021 (UTC)


in sense[edit]

   Under discussion
Descriptionqualifier for lexeme statements that apply to some senses but not others
Data typeSense
Example 1džús (L402822): instance of (P31): mass noun (Q489168) → L402822-S1
Example 2džús (L402822): instance of (P31): count noun (Q1520033) → L402822-S2
Example 3bežec (L409973): grammatical gender (P5185) = masculine personal (Q27918551) → L409973-S1
Example 4ucho (L299083): L299083-F7 → L299083-S1 (after merge with ucho (L249402))
Planned use400+ Slovak lexemes with ambiguous massness, 50+ Slovak lexemes with ambiguous gender, potentially millions of uses across all languages
See alsodemonstrates sense (P6072), derived from sense (P5980)


The idea for this property originally came up when discussing noun massness. It turns out such qualifier would be useful on several statements:

  1. Some nouns are countable in some senses and uncountable in others. See water in Wiktionary for an example. To represent sense-dependent massness, Wikidata lexeme would have two instance of (P31) statements, one for mass noun (Q489168) and one for count noun (Q1520033), both of which would have one or more "in sense" qualifiers.
  2. Verbs can be transitive or intransitive depending on sense. This is currently modeled by adding grammatical aspect (P7486): transitive verb (Q1774805)/intransitive verb (Q1166153) statements to the lexeme, which could be disambiguated with "in sense" qualifiers like massness above.
  3. Some nouns have sense-dependent gender. For example, bežec in Slovak can be a running person, a running animal, or a moving part of a machine. These three senses have genders masculine personal (Q27918551), masculine animate non-personal (Q52943193), and masculine inanimate (Q52943434) respectively. At the moment, every sense gets its own "homograph" lexeme, but such lexemes are not true homographs, because they share etymology and the senses are closely related. With "in sense" qualifier on grammatical gender (P5185), they could be merged into single lexeme.

The following cases are NOT considered valid applications:

  • Usage examples have their own demonstrates sense (P6072) property.
  • Etymology differences create true homographs. True homographs should not be merged.


  • Aggressively splitting lexemes works, but it creates false "homographs" and it makes the data harder to understand for both users and editors.
  • Statements can be placed on sense level too, but adding instance of (P31): mass noun (Q489168) statement to a sense feels wrong, because the sense is not an instance of mass noun, the lexeme is.

Robert Važan (talk) 11:37, 20 May 2021 (UTC)

May 26 update: The property would be also used on sense-specific forms. — Robert Važan (talk) 14:50, 26 May 2021 (UTC)

June 5 update: I am retracting the May 26 change. Use of this property on forms is problematic and requires separate discussion. — Robert Važan (talk) 09:28, 5 June 2021 (UTC)


  •  Support Nice proposal, this definitely solves an issue with our current data model for lexemes. ArthurPSmith (talk) 17:05, 20 May 2021 (UTC)
  • Pictogram voting comment.svg Comment @Fnielsen, VIGNERON: I am requesting your comment since you participated in the original discussion. Thanks. — Robert Važan (talk) 10:20, 26 May 2021 (UTC)
  • Pictogram voting comment.svg Comment This seems to be a good proposal. In Danish (Q9035) where have the special lexeme øl (L39743) that has two sense "a beer" and "some beer" where the grammatical gender is different and where such a proposal would be wanted. Currently, the P31 is used. In this case, we also seems to miss a connection between form and senses. So either the form should have a link to the sense, or the sense should have a link to the sense. For the first case the "in sense" proposal could be used if the scope is expanded beyond "just" being a qualifier on lexeme level? Perhaps the name should be changed to something like "relates to sense"? — Finn Årup Nielsen (fnielsen) (talk) 11:59, 26 May 2021 (UTC)
    @Fnielsen: I have encountered similar case in Slovak: ucho (L249402) and ucho (L299083) could be merged with "in sense" on the differing forms. So yes, I would be in favor of using the property on forms too. Unless dedicated property would be more preferable to allow for tighter constraints on both properties. — Robert Važan (talk) 14:40, 26 May 2021 (UTC)
    @Fnielsen: Naming the property "relates to sense" would encourage its use to express soft statements like "usually in sense". Wording "in sense" implies hard association in line with translation (P5972) and synonym (P5973). — Robert Važan (talk) 14:43, 26 May 2021 (UTC)
    @Fnielsen: I have meantime found several issues with using "in sense" on forms. Annotation of sense-specific forms requires its own discussion and most likely separate property. Briefly, "in sense" on forms is often better expressed via additional grammatical features (e.g. gender, massness), "in sense" on related lexeme statements (e.g. singulare/plurale tantum), or possible future form grouping (via shared stem or otherwise). Having separate property for forms will also allow tighter property constraints on both properties. — Robert Važan (talk) 09:28, 5 June 2021 (UTC)

role of component[edit]

   Under discussion
Descriptionqualifier on combines (P5238) to indicate the role that one part of a multi-term lexeme plays in the combination
Data typeItem
Example 1(umsetzen (L518929) combines (P5238) setzen (L485935)) → "conjugated portion"
Example 2(umsetzen (L518929) combines (P5238) um (L6722)) → "separable prefix"
Example 3(agir kirin (L20381) combines (P5238) kirin (L60024)) → "conjugated portion"
Example 4(agir kirin (L20381) combines (P5238) agir (L1700)) → "verbal noun"


This qualifier would help clarify the role (whether syntactic, semantic, or something else) of a component of a multi-term lexeme—such as a Germanic separable verb, the productive result of the combination of a "to do" verb with a noun in many languages, or other kinds of compounds whose meanings are not just the sum of their parts. It can certainly be added multiple times to the same claim with different values so that roles of different sorts can be simultaneously marked. Mahir256 (talk) 21:19, 16 June 2021 (UTC)




CantoDict identifiers[edit]

CantoDict word ID[edit]

   Under discussion
Descriptionidentifier for a word in CantoDict
Data typeExternal identifier
Allowed values[1-9][0-9]*
Example 1但係/但系 (L315711)645
Example 2已經/已经 (L315712)865
Example 3我哋 (L400825)287
Example 4你哋 (L400826)288
Example 5佢哋 (L400827)289
Number of IDs in source60702
Expected completenessalways incomplete (Q21873886)
Formatter URL$1/
Single-value constraintyes

CantoDict character ID[edit]

   Under discussion
Descriptionidentifier for a single character word in CantoDict
Data typeExternal identifier
Allowed values[1-9][0-9]*
Example 1(L31492)406
Example 2(L230480)28
Example 3(L400376)589
Example 4(L400814)468
Example 5(L400823)1
Number of IDs in source5368
Expected completenessalways incomplete (Q21873886)
Formatter URL$1/
Single-value constraintyes


CantoDict is one of the few freely accessible English-Cantonese dictionaries and the only one I'm aware of which has example sentences. - Nikki (talk) 14:36, 4 January 2021 (UTC)


  • CantoDict entries are about words (not lexemes), but Wikidata splits terms with different etymology as different lexemes. So BA candidate.svg Weak oppose for CantoDict word ID;  Support for CantoDict character ID but they should not be used on lexemes but instead on items such as (Q54366215).--GZWDer (talk) 15:05, 5 January 2021 (UTC)
    • Ok, I've changed it to not have a distinct values constraint. We can instead add a complex constraint expecting lexemes with the same ID to have matching lemmas. I don't think we can expect every external resource for lexemes to assign different IDs for different origins of the same spelling, especially languages with limited resources. We could force people to use described at URL (P973) instead but what would be the benefit of that? - Nikki (talk) 17:17, 6 January 2021 (UTC)
 Support For both. OED Online ID (P5275) has the same issue but is treated as an external ID. You can list exceptions to constraints if necessary; would there really be so many duplicate cases for Chinese characters? ArthurPSmith (talk) 18:16, 11 January 2021 (UTC)


notable misspellings[edit]

Descriptionmisspelling that appear in an authorative list of misspellings (use only on forms)
Representsmisspelled word
Data typeString
Example 1L3280-F1 → "fuscia" (incorrect for fuchsia (L3280))
Example 2L3280-F1 → "fuschia" (incorrect for fuchsia (L3280))
Example 3L36116-F1 → "abbonnemang" (incorrect for abonnemang (L36116))
Example 4L36116-F1 → "abbonemang" (incorrect for abonnemang (L36116))
See alsoWikidata:Property proposal/correct form


This makes it possible to easily create e.g. a spell checker that recommends a correction.--So9q (talk) 21:23, 22 March 2020 (UTC)

We should define "notable misspelling". Here is my suggestion: the misspelling has to appear in one of the following sources:

  1. an authoritative source such as e.g. Retskrivningsordbogen (Q3398246)
  2. articles like [1] from an authoritative source in this case: Oxford University Press EL. The official global blog for Oxford University Press English Language Teaching.

#appears in one of wikipedias list of misspellings, e.g. [2]

  1. appears in WD with p31 misspelling (Q1984758) e.g. Rzehakinacea (Q33188867)
  2. appears in a corpus approved explicitly by this community with an occurrence over a certain threshold. (yet to be created, decided)--So9q (talk) 18:45, 24 September 2020 (UTC)


 Support I support this proposal in this form (more in the linked discussion) with the condition we have applicable definition of common misspelling. --Lexicolover (talk) 12:17, 24 March 2020 (UTC)

See discussion here: Wikidata_talk:Lexicographical_data#Common_misspellings_data--So9q (talk) 19:55, 24 March 2020 (UTC)
Lexicolover stated that they suggest only using misspellings from an authorative source.--So9q (talk) 18:49, 24 September 2020 (UTC)

Symbol neutral vote.svg Neutral we need something to solve this problem but I'm not sure if a simple property is the simpliest solution here. A broader system for all sort of variants would be more difficult but better in the long run as correct/incorrect spelling is often not a binary situation (see "colour"/color" in English, correctness is contextual here). Cheers, VIGNERON (talk) 20:44, 25 March 2020 (UTC)

@VIGNERON: I agree that there can be situations where its more about style/culture than a clear misspelling. In that case I guess we would avoid marking it as a misspelling. Have you thought out a better way to handle the complexities of misspellings than I have proposed?--So9q (talk) 19:04, 24 September 2020 (UTC)

Symbol neutral vote.svg Neutral: what is your definition of "common"? It sounds a bit arbitrary... Nomen ad hoc (talk) 07:30, 26 March 2020 (UTC).

@Nomen ad hoc: that point can easily be objectively defined by the frequency. If a misspelling is over a threshold, let's say 5%, then it's "common". We can use tool like Google Books Ngram Viewer to see the frequency. We can also rely on sources, dictionaries (especially the descriptivist one) often give the common misspelling. Cheers, VIGNERON (talk) 08:55, 26 March 2020 (UTC)
@Nomen ad hoc: see proposed definition above.--So9q (talk) 10:41, 26 March 2020 (UTC)
@Nomen ad hoc: How would you define a common/notable misspelling?--So9q (talk) 18:51, 24 September 2020 (UTC)

 SupportFinn Årup Nielsen (fnielsen) (talk) 11:25, 26 March 2020 (UTC)

@Fnielsen: WDYT about the definition of misspelling above?--So9q (talk) 19:04, 24 September 2020 (UTC)
  • Pictogram voting comment.svg Comment I preferred the initial version of this proposal [3] or the earlier proposal (correct form) using form datatype. --- Jura 15:49, 26 March 2020 (UTC)
    • Actually, the direction of the earlier proposal seems preferable (correct form). If the form is only known as a misspelling, "grammatical feature" could include that too. If it's also something else, the "grammatical feature" would just include that "something else". --- Jura 13:28, 1 April 2020 (UTC)
  •  Oppose per above. --- Jura 13:59, 24 April 2020 (UTC)
  •  Oppose per Jura. - Premeditated (talk) 16:10, 14 November 2020 (UTC)
  • Pictogram voting comment.svg Comment I like the proposal in general. A thought, would it be good to specify the sense with a qualifier? I am sure there are examples when only specific senses of a lexeme gets misspelled (but I struggle to come up with an example for now). Ainali (talk) 08:44, 28 November 2020 (UTC)
  • How about "notable misspelling" with description "misspelling that in an authorative list of misspellings"? ChristianKl❫ 22:31, 11 December 2020 (UTC)

I changed according to the suggestion from ChristianKl. New voting started. Please vote again below. @ChristianKl, vigneron, fnielsen, jura1, Premeditated, ainali:@Nomen ad hoc:--So9q (talk) 08:17, 17 December 2020 (UTC)

Always the same: what's your definition of "authoritative"? Nomen ad hoc (talk) 09:08, 17 December 2020 (UTC).
@ChristianKl: got any input on this? I would say "an individual or organization working professionally with dictionaries or language teaching in the language in question". WDYT?— 09:57, 17 December 2020 (UTC)
My input would be that it makes sense to define the term further. ChristianKl❫ 13:18, 17 December 2020 (UTC)
  • I still prefer the inverse approach. BTW for users to see what you mean with "notable misspellings" from "authorative list": can you add corresponding references to the samples? Would autocorrects from OO qualify? --- Jura 10:26, 17 December 2020 (UTC)
  • Pictogram voting comment.svg Comment I previously supported this proposal, but I am now uncertain where I stand. It seems to me that language is not so fixed as a structured knowledge graph can represented. I think there is a gradualness to formness. While some forms are definitely forms, there are a some things that are not written words that most would say are not forms but just plain misspelt, — and then there is those in between. In Danish, there are some forms that have official alternative forms which we can interlink with alternative form (P8530) (see, e.g., I have recently added pizzeria (L348857) and there accidentally added it as pizzaria. The issue is what "pizzaria". The form is not mentioned as in the official Danish spelling, but listed in another important Danish dictionary The official form is pizzeria, while "pizzaria" is an "unofficial, but common form".  – The preceding unsigned comment was added by Fnielsen (talk • contribs) at 17:04, 18 December 2020 (UTC).
  • Pictogram voting comment.svg Comment @So9q: Isn't this property completely redundant? If a form is instance of (P31) misspelling (Q1984758), then its correct form is any other form of the lexeme with the same set of grammatical features that is not itself a misspelling. If notability is defined by authoritative source, it is already covered by references on instance of (P31) misspelling (Q1984758) statement. If notability is defined by frequency, it can be inferred from suitable corpus (or perhaps someday from more general frequency statements on forms). — Robert Važan (talk) 17:51, 1 May 2021 (UTC)
@Robert Važan:Thanks for the comment. I agree on with your points. Furthermore I thought about what a misspelling really is. In my view it is intrinsically linked to 2 properties:
  1. cultural setting (what is a misspelling in one region might be accepted in another)
  2. time. Over time misspelled words can become appropriated and accepted.
These two properties increase the complexity of misspellings a lot and if we add instance of (P31) misspelling (Q1984758) on forms I think we should also add point in time (P585) (ideally we would add start time and end time, but that is probably practically impossible to determine and find references for) and indigenous to (P2341) as qualifiers. I'm guessing we will have a hard time finding good references for misspellings. People seems more interested in correctly spelled words. An inverse approach of regarding all forms without a reference to an authoritative source as a misspelling might be more fruitful. I marked the proposal as abandoned.--So9q (talk) 06:10, 2 May 2021 (UTC)

form of property constraint[edit]

   Under discussion
Descriptionqualifier to define a form constraint in combination with "property constraint" (P2302)
Data typeForm
Example 1generational suffix (P8017): property constraint (P2302)one-of constraint (Q21510859) → THIS property → Junior (L252247) (#F2)
Example 2generational suffix (P8017): property constraint (P2302)one-of constraint (Q21510859) → THIS property → Senior (L252248) (#F2)
Example 3MISSING
Planned useproperty constraint (P2302) for forms properties
See alsoitem of property constraint (P2305)


I have not worked much with Lexeme related property but I think this could be useful. Should work somewhat similar to item of property constraint (P2305) Premeditated (talk) 10:11, 23 September 2020 (UTC)


  • Pictogram voting comment.svg Comment It seems more like a "one-of-constraint". Not sure if that works for generational suffix (P8017) as it can have plenty of values. --- Jura 10:43, 23 September 2020 (UTC)
That is true. Could be added with reason for deprecation (P2241)constraint provides suggestions for manual input (Q99460987). The property can also be used as qualifier for none-of constraint (Q52558054). There are not many properties that has data type form, yet. - Premeditated (talk) 13:06, 23 September 2020 (UTC)
Lexemes are associated with languages; it seems to me you want an item value here that is language-independent? Which the existing one-of-constraint should handle? ArthurPSmith (talk) 17:23, 23 September 2020 (UTC)
Ah, but I see that the property (ies?) in question is (are) form-valued. And I see the analogy with the item-valued one. Ok,  Support I guess. ArthurPSmith (talk) 17:28, 24 September 2020 (UTC)

Lucas Werkmeister (WMDE)
Jarekt - mostly interested in properties related to Commons
John Samuel
Yair rand
Jon Harald Søby
Was a bee
Peter F. Patel-Schneider
ZI Jony
Pictogram voting comment.svg Notified participants of WikiProject property constraints

  • I removed the "ready" flag. Proposing new property constraint properties should happen in discussion with the Wikiproject for property constraints given that it needs additional programming work.
@Lucas Werkmeister (WMDE): Are there any issues with adding this property into the constraint system? ChristianKl❫ 16:15, 20 November 2020 (UTC)
@ChristianKl: Shouldn’t be too hard to implement as far as I can tell, but it might take a while due to other things being prioritized above it. --Lucas Werkmeister (WMDE) (talk) 17:38, 20 November 2020 (UTC)
I don't think the other entity types needs a "item of property constraint (P2305)". But that is only my thought. -Premeditated (talk) 20:24, 29 November 2020 (UTC)
Wikidata:WikiProject Properties/Reports/Datatypes has a column with some of the constraints on values. Format and range constraints frequently have similar functions.
If the same code would work for all of them, maybe it's worth adding them. --- Jura 09:12, 30 November 2020 (UTC)


location of lexeme usage[edit]

   Under discussion
Descriptionfor lexemes which are considered regional, the main locations where the lexeme is used
Data typeItem
Allowed valuesgeographical locations
Example 1carer (L290514)caregiver (Q553079) (British word for caregiver)
Example 2MISSING
Example 3MISSING
See alsolocation of sense usage (P6084)


To be able to mark lexemes which are regional. Uziel302 (talk) 10:02, 26 March 2020 (UTC)


  • this seems very vague Nepalicoi (talk) 18:25, 19 May 2020 (UTC)
  • I question the need for this, given the existence of location of sense usage (P6084). It would only be useful when the same regional lexeme has multiple senses, and the exact same regional distribution applies to each one of those senses. How common is that in reality? Multiple senses typically don't have identical origins. SM5POR (talk) 12:34, 9 August 2020 (UTC)
  •  Oppose caregiver (Q553079) seems like a very bad example for a location. ChristianKl❫ 19:58, 14 December 2020 (UTC)
  •  Oppose See SM5POR's comment. Even if a word is currently only used in one region, it may start being used elsewhere with a different sense in other regions, so it's not actually a property of the lexeme. For example, the word "cookie". That was originally used in the US to mean a variety of flat round sweet items, some of which are soft and chewy, some of which are firm and crumbly. Only the latter are biscuits in British English, so the word "cookie" has since been adopted to refer to the former. It's also now used worldwide to refer to browser cookies. - Nikki (talk) 15:46, 15 February 2021 (UTC)

radical (P5280) for non-CJVK languages[edit]

What's the status of this? special:ListProperties?datatype=wikibase-lexeme gives no hints but something mentioned at lexeme talk:L1 made me wonder. Arlo Barnes (talk) 00:18, 30 October 2020 (UTC)

@Arlo Barnes: what is your question or proposal exactly? radical (P5280) is not meant form lexemes. The talk on L1 was old, two month later combines (P5238) was created that can be used for that I think. Cheers, VIGNERON (talk) 07:46, 1 March 2021 (UTC)
@VIGNERON: Thank you, I think that answers any questions I had. Arlo Barnes (talk) 19:58, 1 March 2021 (UTC)

The Language Council of Norway Termwiki ID[edit]

DescriptionIdentifikator for en term i Språkrådets termwiki
RepresentsThe Norwegian Language Council term wiki (Q55404177)
Data typeExternal identifier
Template parameterno:Mal:Autoritetsdata
Domainterm (Q1969448)
Allowed values[a-åA-Å]+[ [a-åA-ÅZ]+]*
Example 1ocean current (Q129558) → Havstrømøm !-- universe (Q1) → verdi -->
Example 2concept (Q151885) → Begrep
Example 3recreational vehicle (Q746448) → Bobil
Example 4elliptical galaxy (Q184348) → Elliptisk galakse galakse
Planned useAs identifier for terms
Number of IDs in source1000+
Expected completenessalways incomplete (Q21873886)
Formatter URL$1
Robot and gadget jobsMix'n'Match
Wikidata projectTerms


The Language Council of Norway runs a termwiki where designated term group within their working areas are definig their terms and get them approved by the governmental Language Council of Norway Norwegian Language Council (Q1348705). The content is marked by Creative Commons CC0 1.0 license The term in the english language is also included. the content can be imported from Pmt (talk) 19:12, 15 January 2021 (UTC)

  •  Support external identifier John Samuel (talk) 18:02, 27 January 2021 (UTC)


By chanche! This proporty allready exists as Language Council of Norways termwiki ID (P5445): Identifier for a term in the Norwegian language, as given by an entry in the Language Council of Norways termwiki. This is a terminology database for academic disciplines. The terms are usually given as Bokmål, Nynorsk, and English variants. , so kindly stop/delete this proposal. Really sorry for the inconvinience! Pmt (talk) 18:57, 11 February 2021 (UTC)