Property talk:P487

From Wikidata
Jump to navigation Jump to search

Documentation

Unicode character
Unicode character representing the item
DescriptionUnicode character of the item
RepresentsUnicode character (Q29654788)
Associated itemUnicode Consortium (Q1572774)
Data typeString
Template parametertbd
DomainAll items that an Unicode character represents (note: this should be moved to the property statements)
Allowed values
According to this template: Unicode symbol
According to statements in the property:
(0[0-9A-F]{3}|1(0[0-9A-F]{2,4}|[1-9A-F][0-9A-F]{2,3})|[2-9A-F][0-9A-F]{3,4})|[^ \t\n\r"#%&'*+,/;<=>?\[\\\]\{\|\}]
When possible, data should only be stored as statements
Exampleſ (Q484140)ſ
da capo (Q1138573)𝄊
Latin cross (Q200674)
Vesta (Q178710)
basketball (Q5372)🏀
flag of Friesland (Q1004161)🏴󠁮󠁬󠁦󠁲󠁿
SourceUnicode specification (note: this information should be moved to a property statement; use property source website for the property (P1896))
Formatter URLhttps://util.unicode.org/UnicodeJsps/character.jsp?a=$1
Tracking: usageCategory:Pages using Wikidata property P487 (Q26250051)
See alsoUnicode code point (P4213), Unicode range (P5949), Unicode block (P5522)
Lists
Proposal discussionProposal discussion
Current uses
Total164,062
Main statement163,972>99.9% of uses
Qualifier90<0.1% of uses
Search for values
[create Create a translatable help page (preferably in English) for this property to be included here]
Format “(0[0-9A-F]{3}|1(0[0-9A-F]{2,4}|[1-9A-F][0-9A-F]{2,3})|[2-9A-F][0-9A-F]{3,4})|[^ \t\n\r"#%&'*+,/;<=>?\[\\\]\{\|\}]: value must be formatted using this pattern (PCRE syntax). (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P487#Format, SPARQL
Allowed entity types are Wikibase item (Q29934200): the property may only be used on a certain entity type (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P487#Entity types
Scope is as main value (Q54828448): the property must be used by specified way only (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P487#Scope, SPARQL
This property is being used by:

Please notify projects that use this property before big changes (renaming, deletion, merge with another property, etc.)

Pattern ^([Uu]\+)?([A-F0-9]{4,5})$ will be automatically replaced to \2 and moved to Unicode code point (P4213) property.
Testing: TODO list

Constraints[edit]

As Wikidata:Database reports/Constraint violations/P487 shows both constraints do not make sense. Infovarius (talk) 08:45, 15 June 2013 (UTC)[reply]

Format violations[edit]

I don't see how all these items violate the format "." - could it be that the regex is interpreted falsely and the bot doesn't allow arbitrary unicode characters for '.'? --DSGalaktos (talk) 18:50, 25 June 2013 (UTC)[reply]

Apparently, this got fixed. --DSGalaktos (talk) 21:34, 5 October 2013 (UTC)[reply]

Dutch translation[edit]

The Dutch translation of this property is wrong, both in spelling and in wording. I think it should be 'Unicode-symbool'. Can I change this? Bever (talk) 01:14, 25 July 2013 (UTC)[reply]

External identifier[edit]

Isn't this an identifier for the item within the Unicode character table? --- Jura 10:46, 14 October 2015 (UTC)[reply]

More format violations[edit]

Where is the format actually defined? I added some links to unicode characters which got flagged as format violations. They actually use 8 bytes to encode which is unusual but valid UTF8. – Jberkel (talk) 23:01, 15 August 2016 (UTC)[reply]

Distinct values constraint bug[edit]

It seems that something is broken in the distinct values constraint check. For example, the articles for these similar Cyrillic characters (Q5809477, Q4914472, Q5299849, Q6901498) variably tag each other as violating the distinct values constraint, even though they visibly have distinct values. I suspect something is up with unicode support in the software, but have no idea what would cause this bug. Mathmitch7 (talk) 18:27, 3 August 2018 (UTC)[reply]

I would imagine that this is caused by Unicode equivalence, which means that it's this property which needs to be implemented in a different way rather than something inherently wrong with the distinct values constraint. If the values are simply entered as a character, any comparison will find equivalent characters - even if they have different codepoints. This is an inherent aspect of the Wikimedia software, as far as I know, as 99% of the time it's a desirable feature. If you put ꙮ, Ꙫ, Ꙭ or Ꙩ into Ctrl+F, you'll notice that your browser considers them equivalent, too. My inclination is that it would help if we could enter the relevant HTML entity directly into the page source, which here would be &#xa66e;, &#xa66a;, &#xa66c; and &#xa668; (respectively). Even though they'll display as the actual characters, any code accessing the page source to do a comparison should hopefully be comparing the entity strings (and therefore should be perceiving them as different). Not sure how this could be implemented on Wikidata, however. Theknightwho (talk) 22:34, 15 November 2021 (UTC)[reply]

Multi-character strings[edit]

I am about to remove [1] – a string of six code points. Incnis Mrsi (talk) 07:15, 13 July 2019 (UTC)[reply]

[2] wrong. Control characters—those having General_Category = Cx—are counted separately. These are modifiers—General_Category = Mx—which are appended to the preceding character. Incnis Mrsi (talk) 18:25, 14 July 2019 (UTC)[reply]

Formatter URL[edit]

Broke when the Consortium updated their sites, now checking if there's a replacement link. Arlo Barnes (talk) 03:43, 3 June 2020 (UTC)[reply]

Limit to special Unicode-character items only?[edit]

User:Theknightwho has redefined the constraints so that the property is now obviously expected to be used only on specialized items representing Unicode characters (e.g. (Q87526993) or (Q87505717)). While I can see some logic in that, and even agree it might be useful,

  1. It contradicts the original proposal and even the examples shown in the property definition.
  2. There are thousands of other uses of the property currently (e.g. Mars (Q111)Unicode character (P487)♂︎) which are correct according to the original definition.
  3. We have no good property to replace these uses currently, IIANM. (I.e. being able to state “there is a Unicode character representing this concept” (“Mars can be represented by ♂︎”) is useful. If this usage would be outlawed, how are we supposed to represent that?)

So, I am reverting the newly added restrictions, at least pending further discussion. --Mormegil (talk) 15:30, 24 November 2021 (UTC)[reply]

Mormegil So I agree this does need discussion, and I probably did jump the gun. My perspective is:
  1. This property should only be used on Unicode characters.
  2. We should use properties such as depicted by (P1299), notation (P913) or icon (P2910), depending on which is the most applicable, with a link to the relevant item (in this case (Q87526785). This therefore also covers symbols which aren't yet encoded in the standard.
As an aside, the items for Unicode characters are currently a bit misleading, as they're not supposed to be specific to Unicode. For example, (Q87526785) is not "the male sign as encoded in Unicode". It's "the male sign", and any other information associated with that character should also be added, whether or not it's specific to Unicode (e.g. background info, other encodings etc.). The descriptions seem to be a quirk of the fact that whoever did the batch upload just gave all of them the description "Unicode character". There is a long road ahead of item merges, but that's not a conversation for here. Theknightwho (talk) 15:58, 9 December 2021 (UTC)[reply]
OK, the redefinition seems quite logical to me (even though, process-wise… it’s interesting the usage would go so far from the original definition… never mind). icon (P2910) has the datatype “Commons media file”, i.e. it is used for image files, so that is obviously out. depicted by (P1299) is used for creative works depicting the subject, e.g. Daniel (Q171724)depicted by (P1299)Book of Daniel (Q80115) (its values are restricted to work (Q386724), artwork series (Q15709879), performing artist (Q713200), artistic theme (Q1406161), caricature (Q482919)), so I’d say that does not fit very well. On the other hand, notation (P913) seems to be quite apt, even though the current description (“mathematical notation or another symbol”) seems to be a bit mathematically inclined (also, it’s an instance of Wikidata property related to mathematics (Q22988631)); but still, just a small tweak of the description and it would be a good fit, I’d say (probably discuss there first?).
So, the best way forward would be 1. agree on P913 to be used for the current usage and tweak its definition, 2. migrate all current non-canonical usage to P913, 3. change the constraints here?
--Mormegil (talk) 16:38, 13 December 2021 (UTC)[reply]
So I mostly agree, but I actually think that depicted by (P1299) does apply in a few instances. For example, the character exclamation mark (Q166764) (an exclamation mark) is depicted by the characters (Q87527388), (Q87527394), ! (Q87544533), (Q87544151) and (Q87544008), as they are all specific types of exclamation mark. This is conceptually different from notation (P913), which I agree needs to be broadened to mean "symbolically represented by" (but that's a really awkward way of putting it). Theknightwho (talk) 17:25, 13 December 2021 (UTC)[reply]

Regexp constraint[edit]

Note: the Unicode tool only accepts a single code point, not arbitrary single character.

  • If we pass a "character", its must not be a special character reserved by the URL syntax for query strings or forbidden (like controls) and it must not be a "compatibility character" (it must be in NFC form);
  • So the Unicode tool also accepts taking an hexadecimal code point value to bypass all these limitations (the hexadecimal value must than have at least 2 digits: passing "00" will query the NULL control U+0000, but passing "0" will query the DIGIT ZERO U+0030, and it is now possible to query compatibility characters like U+FA74 (whose canonical decomposition mapping is U+5145) by passing "FA74", passing the single character "充" will not work as Wikidata stores it in NFC form converting U+FA74 to U+5145, so you'll get a URL querying the CJK unified character instead of the CJK compatibility character expected.

So I had to change the contraint regexp as (0[0-9A-F]{3}|1(0[0-9A-F]{2,4}|[1-9A-F][0-9A-F]{2,3})|[2-9A-F][0-9A-F]{3,4})|[^ \t\n\r"#%&'*+,/;<=>?\[\\\]\{\|\}], where the second alternative (passing the character directly) is limited, but the first alternative (passing the hexadecimal code point with 4-6 digits) always works (and it is the internal recommended format, you can see that on the Unicode tool: use the "+" and "-" buttons to navigate to next/previous codepoints and look at the generated query string in the browser's visited URL).

I could have made some other restrictions in the regexp for the valid hexadecimal range, or to exclude more characters (notably compatibility characters, but Wikidata will not let you enter them in its online edit form which automatically converts the submitted input to the NFC form).
The Unicode tool itself tests the length of the text in its own search form: 2 characters or more must be a single valid hexadecimal value, otherwise it will query only the first character submitted in its search form and will display a warning if there are more.
This new regexp in the Wikidata constraint now accepts querying everything useful on arbitrary valid code points (including controls, combining marks, non-characters, and surrogates!), but won't accept values from Wikidata that contain multiple code points (after their conversion to the NFC form by the Wikidata editor itself). If the value is not accepted by the regexp, it's because it contains for example some invisible controls, or combining marks (including variation selectors), or because it becomes combining sequence once converted to NFC (you can't pass it any unencoded non-characters, isolated surrogates or most controls).
In that case you need to check that the string in Wikidata is effectively a single code point: you can retry using the hexadecimal code point, visit the link, and then verify the hexadecimal code point value displayed in result on the unicode tool, copy the character displayed in that result page, and get back to Wikidata and paste it to the Wikidata editor and retry: if this does not work, just keep the hexadecimal code point in Wikidata (and make sure it has at least four digits, even if the Unicode online tool does not care about missing or extra right-padded zeroes when you just pass 2 or 3 hex digits, or more than 6).

Hope this will help. Verdy p (talk) 21:30, 26 April 2024 (UTC)[reply]