Wikidata:Property proposal/ID pattern
Id pattern[edit]
Originally proposed at Wikidata:Property proposal/Generic
Description | A replacement pattern, to form an external id. to be used with applies if regular expression matches (P8460) |
---|---|
Data type | String |
Domain | property |
Allowed values | valid replacement pattern with $1 for the first match. $2 for the second and so forth… |
Example 1 | X username (P2002) → [id build pattern] → $1 applies if regular expression matches (P8460) → /^https?:\/\/(?:mobile\.)?twitter\.com\/(?:intent\/user\?screen_name\=)?(?!hashtag|home|explore|notifications|messages|i)([0-9A-Za-z_]{1,15})/ |
Example 2 | subreddit (P3984) → [id build pattern] → $1 applies if regular expression matches (P8460) → /^https?:\/\/www\.reddit\.com\/r\/([^\/?#]+)\/ |
Example 3 | MusicBrainz artist ID (P434) → [id build pattern] → $1 applies if regular expression matches (P8460) → /^https?:\/\/musicbrainz\.org\/artist\/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/ |
Example 4 | MusicBrainz artist ID (P434) → [id build pattern] → $1 applies if regular expression matches (P8460) → /^https?:\/\/www\.bbc\.co\.uk\/music\/artists\/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/ |
Example 5 | Fandom article ID (P6262) → [id build pattern] → $1:$2 applies if regular expression matches (P8460) → /https?:\/\/([a-z0-9\.-]+).fandom\.com\/wiki\/([^\s#\?]+)/ |
Example 6 | Fandom article ID (P6262) → [id build pattern] → $2.$1:$3 applies if regular expression matches (P8460) → /https?:\/\/([a-z0-9\.-]+).fandom\.com\/([\w]+)\/wiki\/([^\s#\?]+)/ |
Source | list |
Motivation[edit]
I am currently working on a browser extension, that – among others – displays wikidata entities for websites the user visits. In order to do that, it must be able to know which websites are associated with which external identifier on wikidata. For instance:
The url https://twitter.com/timberners_lee
contains a twitter handle timberners_lee
which in Wikidata is associated with Tim Berners-Lee (Q80).
Currently, the extension uses a static list of regular expressions that only a git contributor is able to expand. A wikidata property would make it much more easy to contribute entries to this list. Plus, other extensions could certainly use it too.
It is crucial that the expression only returns a single capture group, that only contains the id. Other groups must be non-capturing. --Shisma (talk) 16:57, 17 November 2020 (UTC)
Discussion[edit]
- Support--Trade (talk) 19:10, 17 November 2020 (UTC)
- Support this would be helpful for programs I am currently working on. BrokenSegue (talk) 20:05, 17 November 2020 (UTC)
- Comment Isn't this something that external users can infer from the URL formatter property (plus the regex property if it is also present)? (I think @99of9: might have some interesting thoughts on this proposal.) Mahir256 (talk) 01:54, 18 November 2020 (UTC)
- Mahir256 Here is one example where this approach wouldn't work: Lets take X username (P2002)
- we convert
https://twitter.com/$1
into a regular expression
→https:\/\/twitter\.com\/$1
- and replace
$1
with[0-9A-Za-z_]{1,15}
wrapped in a capture group
→https:\/\/twitter\.com\/([0-9A-Za-z_]{1,15})
- we convert
- now this looks like a senseable regular expression until you notice that it believes
hashtag
is a twitter user. I have tried to do that but I came to find that neither formatter URL (P1630) nor format as a regular expression (P1793) are designed or used to produce a useful regular expression by that scheme. If you wish I can give you more examples --Shisma (talk) 07:59, 18 November 2020 (UTC)
- Mahir256 Here is one example where this approach wouldn't work: Lets take X username (P2002)
- Comment Entity Explosion (Q98398855) is not that far from what you are trying to do. Thierry Caro (talk) 16:25, 18 November 2020 (UTC)
- Looking at the source of that extension it seems to do what Mahir256 suggests which is known buggy and unreliable (but works 90% of the time probably). BrokenSegue (talk) 18:26, 18 November 2020 (UTC)
- There are similar proposals at Wikidata:Property_proposal/URL_match_pattern. --- Jura 19:27, 18 November 2020 (UTC)
- Hmmm, that proposal seems better in that it allows multiple matchers. But it seems worse because it can't handle multiple regexs per property. It's disheartening that it has been stuck for months... BrokenSegue (talk) 20:26, 18 November 2020 (UTC)
- I don't see why it couldn't have multiple regexes .. if it lingers there, it's probably that its proposer lost interest. --- Jura 02:09, 19 November 2020 (UTC)
- @Jura1: Oh, I misunderstood the proposal. Yeah that proposal seems strictly better than this one now. I'll move my support there. BrokenSegue (talk) 03:53, 19 November 2020 (UTC)
- @BrokenSegue: I think the template on that page should be updated if it's to take in account applies if regular expression matches (P8460) created in the meantime. --- Jura 04:55, 19 November 2020 (UTC)
- @Jura1: Yes I agree. Seems like a trivial change but I don't want to unilaterally alter that submission and @GZWDer: is no longer active it seems. Maybe @Shisma: will alter this proposal to match? BrokenSegue (talk) 15:16, 19 November 2020 (UTC)
- @BrokenSegue: you may alter this proposal. --Shisma (talk) 18:17, 19 November 2020 (UTC)
- sorry I don't understand how applies if regular expression matches (P8460) relates to this --Shisma (talk) 18:57, 19 November 2020 (UTC)
- @Shisma: the proposal is that we make a new property that explains how to use the output of regex capture groups. So this proposal would change to a property that would take the value
\1:\3
and that would have a qualifier applies if regular expression matches (P8460) →https:\/\/([a-z0-9\.-]+)\.(wikia|fandom)\.com\/wiki\/(.*)
. So if that regex matches then you take the capture groups and plug them into the matches to produce the identifier. BrokenSegue (talk) 19:29, 19 November 2020 (UTC)- @BrokenSegue: that would even solve some edgecases 👍. But most properties will be set to
\1
, right?--Shisma (talk)- @Shisma: yeah that's my understanding. BrokenSegue (talk) 16:49, 20 November 2020 (UTC)
- well, it's seems counter-intuitive but it's actually better--Shisma (talk) 17:26, 20 November 2020 (UTC)
- @Shisma: yeah that's my understanding. BrokenSegue (talk) 16:49, 20 November 2020 (UTC)
- @BrokenSegue: that would even solve some edgecases 👍. But most properties will be set to
- @Shisma: the proposal is that we make a new property that explains how to use the output of regex capture groups. So this proposal would change to a property that would take the value
- @Jura1: Yes I agree. Seems like a trivial change but I don't want to unilaterally alter that submission and @GZWDer: is no longer active it seems. Maybe @Shisma: will alter this proposal to match? BrokenSegue (talk) 15:16, 19 November 2020 (UTC)
- @BrokenSegue: I think the template on that page should be updated if it's to take in account applies if regular expression matches (P8460) created in the meantime. --- Jura 04:55, 19 November 2020 (UTC)
- @Jura1: Oh, I misunderstood the proposal. Yeah that proposal seems strictly better than this one now. I'll move my support there. BrokenSegue (talk) 03:53, 19 November 2020 (UTC)
- @BrokenSegue: and @Trade: I updated the proposal. Please check if you still support it. Feel free to make changes --Shisma (talk) 17:44, 20 November 2020 (UTC)
- Sorry for the back and forth, but looking at it now, I actually prefer the initial version. It makes the usecase clear even if the format is similar to P8460. Supposedly we could have an optional ID pattern, but I find it dubious to make it the main value especially as the only use case is a Wikidata property that IMHO shouldn't have been defined that way. The initial version also seems to make it clearer how to include formatting variations. BTW, I think Krbot's Autofixes don't use the leading "/" and seems to add "^" directly. To make a long story short, I will support the one at Wikidata:Property_proposal/URL_match_pattern. --- Jura 12:30, 21 November 2020 (UTC)
- @Jura1: i'd say the initial version is almost identical (without replacement pattern) to Wikidata:Property proposal/URL match pattern. Isn't it? But I don't care and support both proposals but just one of them should pass. --Shisma (talk) 13:25, 21 November 2020 (UTC)
- @Jura1: Can you clarify? You're supporting the version of that proposal that uses
<replacement value>
as a delimiter to stuff two pieces of data into one field? Or one of the modified proposals? BrokenSegue (talk) 17:17, 21 November 2020 (UTC)
- Sorry for the back and forth, but looking at it now, I actually prefer the initial version. It makes the usecase clear even if the format is similar to P8460. Supposedly we could have an optional ID pattern, but I find it dubious to make it the main value especially as the only use case is a Wikidata property that IMHO shouldn't have been defined that way. The initial version also seems to make it clearer how to include formatting variations. BTW, I think Krbot's Autofixes don't use the leading "/" and seems to add "^" directly. To make a long story short, I will support the one at Wikidata:Property_proposal/URL_match_pattern. --- Jura 12:30, 21 November 2020 (UTC)
@BrokenSegue, Jura1: are there any advantages or disadvantages that one proposal might have over the other? to me it appears like this. --Loominade (talk) 10:22, 23 November 2020 (UTC)
Match pattern | ID pattern |
---|---|
Pro: more intuitive | Pro: re-uses existing property |
Pro: has a default value |
- @Loominade: that is also my understanding. The match pattern one also allows for a "default" case of "\1" but I'm not sure that matters very much. I really don't care which we go with. BrokenSegue (talk) 17:44, 23 November 2020 (UTC)
- actually I don't like re-using applies if regular expression matches (P8460) for this. the other proposal has been marked as ready yesterday. Perhaps I should withdraw --Shisma (talk) 18:15, 24 November 2020 (UTC)