Wikidata:Property proposal/urn formatter

From Wikidata
Jump to navigation Jump to search

formatter URN[edit]

Return to Wikidata:Property proposal/Generic

   Under discussion
Descriptionformatter to generate URN from property value. Include $1 to be replaced with property value.
RepresentsUniform Resource Name (Q76497)
Data typeString
DomainWikidata properties
Allowed valuesurn:.+. Including variables ($1). Default capturing regex on property values is ([.]+)
Example 1ISBN-13 (P212)urn:isbn:$1
Example 2ISSN (P236)urn:ISSN:$1
Example 3RfC ID (P892)urn:ietf:rfc:$1
Example 4URN-NBN (P4109)$1
SourceIETF RFC 8141: Uniform Resource Names (URNs), IANA: Uniform Resource Names (URN) Namespaces
Planned useAdd formatters to ISBN-13 (P212), ISSN (P236), RfC ID (P892) and in future use this for rendering.
Expected completenessalways incomplete (Q21873886)
Robot and gadget jobsBots can use this to cross reference entities between wikidata and other databases.
See alsoformatter URL (P1630), format as a regular expression (P1793)

Motivation[edit]

Following discussion on project chat. Please add more samples. (Add your motivation for this property here.) --- Jura 08:22, 5 October 2019 (UTC)

For properties with an associated URN it would be useful to define how these properties can be converted to URNs so that wikidata items can be easily cross-referenced to other systems or exported with URN identifiers. Iwan.Aucamp (talk) 07:54, 7 October 2019 (UTC)

Discussion[edit]

  • Symbol strong support vote.svg Strong support I had been thinking in this property since a while, but I wasn't sure. --Tinker Bell 17:27, 6 October 2019 (UTC)
  • Symbol strong support vote.svg Strong support Iwan.Aucamp (talk) 07:20, 7 October 2019 (UTC)
    • Pictogram voting comment.svg Comment I am not sure exactly how we would deal with cases where transformation of data is required, e.g. if the wikidata property value is A-B-C and the URN should be ABC instead. But In general I think this is useful. Iwan.Aucamp (talk) 07:34, 7 October 2019 (UTC)
  • Symbol support vote.svg Support Makes sense to have this; however I'm a little unclear on the description statement regarding "regex" and "capturing" - is there some intention to have this implemented by the Wikidata UI developers? The "$1" interpretation is what is already done with formatter URL's so that seems fine, but if something more than that is wanted I think it needs to be clearly specified (or just remove that from the description). ArthurPSmith (talk) 12:33, 7 October 2019 (UTC)
  • Pictogram voting comment.svg Comment/support. The regex idea is interesting. The idea is that if by default we think of P1630 doing something like
    BIND(IRI(REPLACE(?id, '(^.*)'  , ?fmt)) AS ?url)
    
    -- one way one can implement it in a SPARQL query -- then it would be a nice feature to be able to add a qualifier on P1630 or the present property, to specify to use something like '(^([^\-]+)\-([^\-]+)\-([^\-]+))' instead of the default '(^.*)', allowing the P1630 formatter string to then include additional capture groups, in this case $2 and $3.
One wrinkle is that not all downstream uses of P1630 may be implementing it the way above. For instance, given the current limitations, queries can get away with implementing P1630 using
BIND(IRI(REPLACE(?fmt, '\\$1'  , ?id)) AS ?url)
Does that matter? For queries I would say not. Existing formatters for properties that only use $1 would continue to work, however the transformation has been coded. They would (silently) give bad transformations for new property formatters that want to use $2 and $3 -- but new code would be needed anyway for these, to recognise and implement the qualifier regex.
Yes, it will need an update to the wikidata GUI code for the qualifier regex to be supported; but it seems to be a direction we ought to be moving in. So, Symbol support vote.svg Support this property, and Symbol support vote.svg Support the idea for a new qualifier property, to be able to specify regexes to be used in conjunction with it and with P1360. The qualifier property will have to be 'on hold' until the GUI can support it, but that's no reason not to suggest it, approve it, and thus indicate to the developers that this is something that the community would like.
But one thing I would say is that this property and P1630 need to work exactly the same way, or this will get very confusing. If this property is to be allowed to take a regex qualifier, then P1630 needs to be able to take it too; if P1630 won't take a regex qualifier, then this property shouldn't either. But in my view it makes sense to support both this property, and any qualifier regex proposal, for use both on this and P1630. Jheald (talk) 17:16, 7 October 2019 (UTC)
@Jheald: Maybe we should have a regex and non-regex variant so that the non regex one can be used in cases where it is sufficient and the regex one can be used where it will not be sufficient? @ArthurPSmith: Iwan.Aucamp (talk) 05:28, 8 October 2019 (UTC)
Development team asked for thoughts re "capture regex" qualifier suggestion. Jheald (talk) 08:01, 8 October 2019 (UTC)
  • Oh .. I didn't think this would get that much feedback. Thanks for your input.
    I had in mind that format as a regular expression (P1793) should be read if present for one capturing group, but I suppose it could work for several capturing groups too. The alternative for these would be that the property doesn't work (or another would be needed).
    While some support by Wikibase might interesting, I wouldn't rely on it and I think this should primarily work when retrieved from query server. Jheald outlined some of the possible queries.
    I don't think this needs to work the same way as formatter URL (P1630). This even though there is a phab ticket to do that there too and the gadget for formatter supports some if it.--- Jura 08:35, 8 October 2019 (UTC)
  • BTW, I thought to use format as a regular expression (P1793), but maybe a new property is needed. --- Jura 08:42, 8 October 2019 (UTC)
  • Lucas Werkmeister (WMDE) (talkcontribslogs) has got back about this at WD:DEV (thread). Essentially, the team don't know a way to securely evaluate user-supplied regexes in PHP. (See eg this Q on StackOverflow for some discussion of how allowing users to submit their own regexes for execution can lead to Denial of Service risks, sometimes brought on not even intentionally, just through coding errors). At the moment the system uses SPARQL to process constraints involving format as a regular expression (P1793), in part because that's then safely in a sandbox that will time queries out if they take too long. But it's apparently not possible to securely/efficiently do such sandboxing in PHP.
For the moment, it may be best to get on with passing the basic proposal here for a urn-formatter, without the qualifier. I do think the ability to be able to specify a "capture regex" would be very useful, and would be worth proposing separately. But it may be quite a challenge to find ways to implement it safely, and/or to require that only pre-approved regexes would be allowed. I suspect that probably there are ways that could be found to make it work; but it's going to need some thought, which would probably be better in its own discussion, split off from this basic property proposal (which in itself ought to be a sure thing). Jheald (talk) 13:02, 8 October 2019 (UTC)
I think let's remove the regex mention from description then, if that is okay with you @Jura1: Iwan.Aucamp (talk) 23:04, 8 October 2019 (UTC)
@Jura1: @Tinker_Bell: @ArthurPSmith: @Jheald: I removed reference to regex from the proposal - notifying in case any of you want to change your votes. Iwan.Aucamp (talk) 21:08, 9 October 2019 (UTC)
  • The current samples don't seem to require it. So I suppose one can. Default regex would probably be "(.*)". If we find some requiring it, maybe we have to readd it.
    BTW I don't see how the discussion with Lucas is relevant. I think he writes about format constraints that are supported by Wikibase, but need to be processed in Query Server. This property proposal isn't supported by special Wikidata feature and one would need to use Query Server anyways. Maybe I missed the part when php gets relevant for this proposal. --- Jura 19:00, 10 October 2019 (UTC)
  • A quirk of the SPARQL engine is that the default regex has to be '(^.*)', otherwise the substitute string seems to get put in twice. Probably a Blazegraph implementation bug, but adding the caret makes the issue go away (as well as making the regex very slightly more efficient).
It's an interesting point that as a UIN isn't a clickable link, the GUI doesn't have to process the identifier string, so maybe a "capture regex" qualifier would be more acceptable. On the other hand, in my view we owe a duty of care to downstream reausers not to potentially hose their systems by profferng a regex that could cause a denial of service. So, at the very least, if we were to go ahead with such a field, IMO it ought to be locked to be edited only by an admin, under instructions not to make such an edit unless they were sure it could not be harmful (and competent to make such an assessment). Jheald (talk) 21:22, 10 October 2019 (UTC)
So you want to change the way format as a regular expression (P1793) is currently used? --- Jura 09:42, 11 October 2019 (UTC)
@Jura1: I am less worried about P1793, as that is used to specify a regex for constraint checking, which is essentially a process internal to Wikidata, which according to the dev team is sandboxed through the SPARQL query engine.
The suggested "capture regex" qualifier would be different, as this is something we would be offering to third parties, encouraging them to use it. That comes with responsibilities. Jheald (talk) 11:12, 11 October 2019 (UTC)
Actually P1793 is already used for more than that and even when external parties use P1793 for constraint checking, they have the same issues to consider. --- Jura 11:17, 11 October 2019 (UTC)
@Jura1: OK then, start a property proposal for the new qualifier, and let's see how it goes. Jheald (talk) 19:58, 11 October 2019 (UTC)
You are the one who thinks we shouldn't be using P1793 in the way it's defined. --- Jura 07:09, 12 October 2019 (UTC)
@Jura1: P1793 says it is a "regex describing an identifier or a Wikidata property" -- ie a regex giving that values of the identifier or Wikidata property are expected to conform to, ie a constraint specification, even if the constraint is not (currently) actively monitored. If you know of uses other than that, please give examples.
The "capture regex" role would be other than that, and we would want to limit it to qualifying a very limited set of properties. It is appropriate to (propose) a different property for this rather different job. Jheald (talk) 23:07, 13 October 2019 (UTC)