Wikidata:Property proposal/urn formatter

From Wikidata
Jump to navigation Jump to search

urn formatter[edit]

Originally proposed at Wikidata:Property proposal/Generic

   Done: URN formatter (P7470) (Talk and documentation)
Descriptionformatter to generate URN from property value. Include $1 to be replaced with property value.
RepresentsURN (Q76497)
Data typeString
DomainWikidata properties
Allowed valuesurn:.+. Including variables ($1). Default capturing regex on property values is ([.]+)
Example 1ISBN-13 (P212)urn:isbn:$1
Example 2ISSN (P236)urn:ISSN:$1
Example 3RfC ID (P892)urn:ietf:rfc:$1
Example 4URN-NBN (P4109)$1
SourceIETF RFC 8141: Uniform Resource Names (URNs), IANA: Uniform Resource Names (URN) Namespaces
Planned useAdd formatters to ISBN-13 (P212), ISSN (P236), RfC ID (P892) and in future use this for rendering.
Expected completenessalways incomplete (Q21873886)
Robot and gadget jobsBots can use this to cross reference entities between wikidata and other databases.
See alsoformatter URL (P1630), format as a regular expression (P1793)

Motivation[edit]

Following discussion on project chat. Please add more samples. (Add your motivation for this property here.) --- Jura 08:22, 5 October 2019 (UTC)[reply]

Discussion[edit]

One wrinkle is that not all downstream uses of P1630 may be implementing it the way above. For instance, given the current limitations, queries can get away with implementing P1630 using
BIND(IRI(REPLACE(?fmt, '\\$1'  , ?id)) AS ?url)
Does that matter? For queries I would say not. Existing formatters for properties that only use $1 would continue to work, however the transformation has been coded. They would (silently) give bad transformations for new property formatters that want to use $2 and $3 -- but new code would be needed anyway for these, to recognise and implement the qualifier regex.
Yes, it will need an update to the wikidata GUI code for the qualifier regex to be supported; but it seems to be a direction we ought to be moving in. So,  Support this property, and  Support the idea for a new qualifier property, to be able to specify regexes to be used in conjunction with it and with P1360. The qualifier property will have to be 'on hold' until the GUI can support it, but that's no reason not to suggest it, approve it, and thus indicate to the developers that this is something that the community would like.
But one thing I would say is that this property and P1630 need to work exactly the same way, or this will get very confusing. If this property is to be allowed to take a regex qualifier, then P1630 needs to be able to take it too; if P1630 won't take a regex qualifier, then this property shouldn't either. But in my view it makes sense to support both this property, and any qualifier regex proposal, for use both on this and P1630. Jheald (talk) 17:16, 7 October 2019 (UTC)[reply]
@Jheald: Maybe we should have a regex and non-regex variant so that the non regex one can be used in cases where it is sufficient and the regex one can be used where it will not be sufficient? @ArthurPSmith: Iwan.Aucamp (talk) 05:28, 8 October 2019 (UTC)[reply]
Development team asked for thoughts re "capture regex" qualifier suggestion. Jheald (talk) 08:01, 8 October 2019 (UTC) [reply]
  • Oh .. I didn't think this would get that much feedback. Thanks for your input.
    I had in mind that format as a regular expression (P1793) should be read if present for one capturing group, but I suppose it could work for several capturing groups too. The alternative for these would be that the property doesn't work (or another would be needed).
    While some support by Wikibase might interesting, I wouldn't rely on it and I think this should primarily work when retrieved from query server. Jheald outlined some of the possible queries.
    I don't think this needs to work the same way as formatter URL (P1630). This even though there is a phab ticket to do that there too and the gadget for formatter supports some if it.--- Jura 08:35, 8 October 2019 (UTC)[reply]
  • BTW, I thought to use format as a regular expression (P1793), but maybe a new property is needed. --- Jura 08:42, 8 October 2019 (UTC)[reply]
  • Lucas Werkmeister (WMDE) (talkcontribslogs) has got back about this at WD:DEV (thread). Essentially, the team don't know a way to securely evaluate user-supplied regexes in PHP. (See eg this Q on StackOverflow for some discussion of how allowing users to submit their own regexes for execution can lead to Denial of Service risks, sometimes brought on not even intentionally, just through coding errors). At the moment the system uses SPARQL to process constraints involving format as a regular expression (P1793), in part because that's then safely in a sandbox that will time queries out if they take too long. But it's apparently not possible to securely/efficiently do such sandboxing in PHP.
For the moment, it may be best to get on with passing the basic proposal here for a urn-formatter, without the qualifier. I do think the ability to be able to specify a "capture regex" would be very useful, and would be worth proposing separately. But it may be quite a challenge to find ways to implement it safely, and/or to require that only pre-approved regexes would be allowed. I suspect that probably there are ways that could be found to make it work; but it's going to need some thought, which would probably be better in its own discussion, split off from this basic property proposal (which in itself ought to be a sure thing). Jheald (talk) 13:02, 8 October 2019 (UTC)[reply]
I think let's remove the regex mention from description then, if that is okay with you @Jura1: Iwan.Aucamp (talk) 23:04, 8 October 2019 (UTC)[reply]
@Jura1: @Tinker_Bell: @ArthurPSmith: @Jheald: I removed reference to regex from the proposal - notifying in case any of you want to change your votes. Iwan.Aucamp (talk) 21:08, 9 October 2019 (UTC)[reply]
  • The current samples don't seem to require it. So I suppose one can. Default regex would probably be "(.*)". If we find some requiring it, maybe we have to readd it.
    BTW I don't see how the discussion with Lucas is relevant. I think he writes about format constraints that are supported by Wikibase, but need to be processed in Query Server. This property proposal isn't supported by special Wikidata feature and one would need to use Query Server anyways. Maybe I missed the part when php gets relevant for this proposal. --- Jura 19:00, 10 October 2019 (UTC)[reply]
  • A quirk of the SPARQL engine is that the default regex has to be '(^.*)', otherwise the substitute string seems to get put in twice. Probably a Blazegraph implementation bug, but adding the caret makes the issue go away (as well as making the regex very slightly more efficient).
It's an interesting point that as a UIN isn't a clickable link, the GUI doesn't have to process the identifier string, so maybe a "capture regex" qualifier would be more acceptable. On the other hand, in my view we owe a duty of care to downstream reausers not to potentially hose their systems by profferng a regex that could cause a denial of service. So, at the very least, if we were to go ahead with such a field, IMO it ought to be locked to be edited only by an admin, under instructions not to make such an edit unless they were sure it could not be harmful (and competent to make such an assessment). Jheald (talk) 21:22, 10 October 2019 (UTC)[reply]
So you want to change the way format as a regular expression (P1793) is currently used? --- Jura 09:42, 11 October 2019 (UTC)[reply]
@Jura1: I am less worried about P1793, as that is used to specify a regex for constraint checking, which is essentially a process internal to Wikidata, which according to the dev team is sandboxed through the SPARQL query engine.
The suggested "capture regex" qualifier would be different, as this is something we would be offering to third parties, encouraging them to use it. That comes with responsibilities. Jheald (talk) 11:12, 11 October 2019 (UTC)[reply]
Actually P1793 is already used for more than that and even when external parties use P1793 for constraint checking, they have the same issues to consider. --- Jura 11:17, 11 October 2019 (UTC)[reply]
@Jura1: OK then, start a property proposal for the new qualifier, and let's see how it goes. Jheald (talk) 19:58, 11 October 2019 (UTC)[reply]
You are the one who thinks we shouldn't be using P1793 in the way it's defined. --- Jura 07:09, 12 October 2019 (UTC)[reply]
@Jura1: P1793 says it is a "regex describing an identifier or a Wikidata property" -- ie a regex giving that values of the identifier or Wikidata property are expected to conform to, ie a constraint specification, even if the constraint is not (currently) actively monitored. If you know of uses other than that, please give examples.
The "capture regex" role would be other than that, and we would want to limit it to qualifying a very limited set of properties. It is appropriate to (propose) a different property for this rather different job. Jheald (talk) 23:07, 13 October 2019 (UTC)[reply]
Interesting thoughts. We may agree with them or not. Anyways, I meant this as a proposal for URN only, not formatters or regexes in general. --- Jura 06:35, 19 October 2019 (UTC)[reply]