Property talk:P304

From Wikidata
Jump to navigation Jump to search

Documentation

page(s)
page number of source referenced for statement. Note "column(s)" (P3903) and "folio(s)" (P7416) for other numbering systems
Descriptionpage number of source referenced for statement. May be a range of pages. To be used in the "sources" field.
Representspage (Q1069725)
Data typeString
Domainany statement of any item (note: this should be moved to the property statements)
Allowed values[\dA-Za-z\x{2160}-\x{217F}]+(?:\s?[\-\x{2010}-\x{2013}\x{2E17}\x{2E1A}\x{2E40}\x{301C}\x{3030}\x{30A0}\x{FE58}\x{FE63}\x{FF0D}\x{10EAD}]\s?[\dA-Za-z\x{2160}-\x{217F}]+)?(?:\s?[,;]\s?[\dA-Za-z\x{2160}-\x{217F}]+(?:\s?[\-\x{2010}-\x{2013}\x{2E17}\x{2E1A}\x{2E40}\x{301C}\x{3030}\x{30A0}\x{FE58}\x{FE63}\x{FF0D}\x{10EAD}]\s?[\dA-Za-z\x{2160}-\x{217F}]+)?)* (allowed: (1) a number with digits (decimal or Roman) or Latin letters; (2) a range of two numbers separated by a short hyphen or dash ; (3) a list of numbers or ranges separated by a comma or a semicolon (with an optional space, before or after each separator))
Example
According to this template: 12 ; XII ; A2ii ; 16-17
When possible, data should only be stored as statements
Tracking: usageCategory:Pages using Wikidata property P304 (Q98107903)
See alsonumber of pages (P1104), section, verse, paragraph, or clause (P958), chapter (P792), volume (P478), issue (P433), measure number (P7141), folio(s) (P7416), file page (P7668), line(s) (P7421)
Lists
Proposal discussionProposal discussion
Current uses
Total35,655,754
Main statement35,097,34198.4% of uses
Qualifier443,0811.2% of uses
Reference115,3320.3% of uses
[create Create a translatable help page (preferably in English) for this property to be included here]
Format “[\dA-Za-z\x{2160}-\x{217F}]+(?:\s?[\-\x{2010}-\x{2013}\x{2E17}\x{2E1A}\x{2E40}\x{301C}\x{3030}\x{30A0}\x{FE58}\x{FE63}\x{FF0D}\x{10EAD}]\s?[\dA-Za-z\x{2160}-\x{217F}]+)?(?:\s?[,;]\s?[\dA-Za-z\x{2160}-\x{217F}]+(?:\s?[\-\x{2010}-\x{2013}\x{2E17}\x{2E1A}\x{2E40}\x{301C}\x{3030}\x{30A0}\x{FE58}\x{FE63}\x{FF0D}\x{10EAD}]\s?[\dA-Za-z\x{2160}-\x{217F}]+)?)*: value must be formatted using this pattern (PCRE syntax). (Help)
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P304#Format, SPARQL
Scope is as main value (Q54828448), as qualifier (Q54828449), as reference (Q54828450): the property must be used by specified way only (Help)
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P304#scope, SPARQL
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P304#allowed entity types
Pattern ^(\d+)(\?\?\?|--| - |‑|‒|—|−)(\d+)$ will be automatically replaced to \1-\3.
Testing: TODO list

Why string[edit]

This property is defined as string because different ways can be used to define page number like A-1 or an interval of pages like 1234-1256. Snipre (talk) 20:06, 20 March 2013 (UTC)[reply]

Claim vs statement in description[edit]

A reference is used in a statement, it is not available in a claim. Because of this I changed the description to use "statement" and not "claim" as it was before. If this "page" will be used as a general property it should not name either claim or statement. Jeblad (talk) 15:22, 7 June 2013 (UTC)[reply]

Article IDs in electronic-only journals[edit]

The page concept does not always fit with electronic-only publications, and a common way around that is to assign article IDs to individual articles, rather than start and end pages. In bibliographic contexts, article IDs are typically entered in what otherwise is the "page" field. Should we follow this pattern or clearly distinguish between the two? For an example article, see doi:10.1371/journal.pcbi.1002727, for which I just started A Future Vision for PLOS Computational Biology (Q19370790). Its page ID is e1002727, which does not fit with the current constraints on allowable strings for P304. --Daniel Mietchen (talk) 04:02, 2 March 2015 (UTC)[reply]

Hi Daniel, I think we should propose a new property for this case. --Succu (talk) 08:00, 2 March 2015 (UTC)[reply]
No, the concept is the same, a new property will just be more confusing for people creating once an item about an article. We have an exception, this is not a problem to manage with the constraints system. The question is more to know is this exception is specific to the current journal or if they are more format problems with others journals. If yes, we have to see if we can modify the constraints. Snipre (talk) 16:24, 2 March 2015 (UTC)[reply]
If you have to cite a certain page within an article of an issue you'll need both a page and the article id. --Succu (talk) 16:43, 2 March 2015 (UTC)[reply]
I mulled this over a bit more and think now that a new property for article ID is the best way to go. --Daniel Mietchen (talk) 20:46, 11 May 2015 (UTC)[reply]
I proposed it. --Daniel Mietchen (talk) 04:07, 2 August 2015 (UTC)[reply]

Which dash?[edit]

The format constraint prescribes the use of a en dash for page ranges whereas the example uses a hyphen as many use cases do which produce constraint violations. For English enwiki states that both variants are used, enwiki itself uses an en dash. It could be useful to use one variant here to simplify data reuse. Possibly a hyphen is easier to input for most people? --Marsupium (talk) 11:50, 28 July 2015 (UTC)[reply]

✓ extended --- Jura 12:55, 26 November 2015 (UTC)[reply]
Would it not make more sense to set the constraint to only allow en dash then have a bot patrol the reports and change correct any hyphens/m dash? This is how e.g. underscores in P18 filenames are barked. /Lokal Profil (talk) 19:50, 1 August 2018 (UTC)[reply]
Current usage of different symbols:
{{Autofix}} can be used for replacement. The same rule already used at Property talk:P555. — Ivan A. Krestinin (talk) 09:47, 7 August 2021 (UTC)[reply]

Non-source use[edit]

Numerous users have been using this property for claims (as opposed to sources/references). Examples: Q15097304, Q14624399, Q14624838, Q21003828. What do these statements mean, and should the property be used in this manner?

(Also, was some automated tool involved here? The series of edits look similar enough that I thought it was a bot at first, until I saw the different user names...)

--Yair rand (talk) 13:06, 13 October 2015 (UTC)[reply]

Well, Q15097304 is a source, there is no sitelinks there. It looks like it describes as an article "The crystal structure of argentojarosite, AgFe₃(SO₄)₂(OH)₆" which has been published in page 921–928 in "The Canadian Mineralogist". If there is a sentence in page 923 which confirms the statement, you may add that in the source if you like. -- Innocent bystander (talk) 13:55, 13 October 2015 (UTC)[reply]

Split into start and end page?[edit]

Many uses of the property actually give a range of pages, e.g. "921–928" in The crystal structure of argentojarosite, AgFe₃(SO₄)₂(OH)₆ (Q15097304). Wouldn't it be better to split that somehow into "starting page" and "end page", either by suitable qualifiers to page(s) (P304) or by dedicated properties? Some sources (particularly on paper) actually may have multiple start and end pages (e.g. Some Thoughts About Publishing Results in Our Field [Publication Activities] (Q24090833) currently states "22-41", even though the article is less than 1.5 pages long — it had just been printed in part on p. 22 and 41), and this would be easy to model with such a split, as would be cases where start and end page are identical, for which P304 works best. --Daniel Mietchen (talk) 20:36, 18 May 2016 (UTC)[reply]

I am not sure that is a good idea. If you have the statement: "page:9-16, 47-93" that would make "startpage:9, 47" and "endpage:16, 93" and that is tricky to put together correctly. It could be interpreted as "9-93, 47-16" and that would be wrong. A software maybe can interpret numbers like 9 and 47 correctly, but I am not as sure with page numbers like "CIV v". -- Innocent bystander (talk) 06:59, 19 May 2016 (UTC)[reply]
@Daniel Mietchen, Innocent bystander: not sure but it sound like a good idea to me. Right now the constraint accept two forms : 921-928, 921–928 (see #Which dash? above) ; that can be confusing too. A clear property would more explicit. For the « 9-16, 47-93 », wouldn't a simple solution to put it two times : « startpage:9 endpage:16 » and « startpage:47 endpage:93 » ?
BTW, there is a lot of strange constraint violations on Wikidata:Database reports/Constraint violations/P304. Daniel, you seems to be responsible for at least parts of them, could you explain and maybe change/adapt the constraint? For instance Q21147030#P304 is quite confusing. Cdlt, VIGNERON (talk) 12:21, 27 May 2016 (UTC)[reply]
RE:VIGNERON: The order of two claims is not a failsafe way to add things here (at least not yet). -- Innocent bystander (talk) 12:24, 27 May 2016 (UTC)[reply]
True; but separate them is a failsafe to avoid confusion like "9-93, 47-16" interpretation. Does the order really matter? Isn't this order obvious? Cdlt, VIGNERON (talk) 12:31, 27 May 2016 (UTC)[reply]
Yes, it is obvious as long as it is a human who reads and as long as numbers who are easy to read for a computer is used. It's all the strange exceptions I am worried about. The constraints report probably has found many of them. -- Innocent bystander (talk) 12:37, 27 May 2016 (UTC)[reply]
@VIGNERON, Innocent bystander: I think we should avoid things like "startpage:9, 47" and "endpage:16, 93" and go for « startpage:9 » « endpage:16 » and « startpage:47 » « endpage:93 » (etc.) instead.
As for adapting the constraint violations, I see the need, but I am not comfortable enough with Perl Compatible Regular Expressions (Q125267) to fix the constraints myself. Would be happy to learn more about that, though, since there are other properties where there is a similar misfit between the constraints and the actual values. --Daniel Mietchen (talk) 18:26, 27 May 2016 (UTC)[reply]
@Daniel Mietchen: What happens if you try to add « startpage:9 » « endpage:16 » and « startpage:47 » « endpage:93 » to an item or the source-part of a claim? From what I know, it is not techinically possible. I would love such a solution, but I do not think it works. -- Innocent bystander (talk) 19:47, 27 May 2016 (UTC)[reply]
@Innocent bystander: Not sure I get your question. I think right now, nothing happens, i.e. such edits cannot be saved. In case you are talking about
« startpage:9 endpage:16 » and « startpage:47 endpage:93 »
vs.
« startpage:9 » « endpage:16 » and « startpage:47 » « endpage:93 »
I can imagine it both ways but am not aware of a technical way to implement the former (i.e. bundling the start and end page of a given blob of text), which is why I wrote the latter. --Daniel Mietchen (talk) 21:27, 27 May 2016 (UTC)[reply]
@VIGNERON: I made some edits to the constraints — review/ comment appreciated. --Daniel Mietchen (talk) 21:30, 27 May 2016 (UTC)[reply]
What about articles spanning multiple issue (P433)? --Succu (talk) 21:28, 27 May 2016 (UTC)[reply]
How common is/ was that? Got an example? Perhaps we could treat them as if these were separate articles? --Daniel Mietchen (talk) 21:45, 27 May 2016 (UTC)[reply]
I have never heard of such a thing in a scientific paper, but I can imagine other kinds of articles published in other kinds of media having such problems. I know that we have handled such articles on sv.Wikisource. The article in question was published as a serialized fiction (Q1347298) in a 19th century newspaper. On Wikisource, the different parts were by convinience merged together in one article. -- Innocent bystander (talk) 04:07, 28 May 2016 (UTC)[reply]
The Water-Babies, A Fairy Tale for a Land Baby (Q14914646) is an example for serialized fiction (Q1347298). In older publications like Allgemeine Gartenzeitung (Q5669289) it was a common praxis. --Succu (talk) 07:49, 28 May 2016 (UTC)[reply]
@Succu: How would a reference to this text look like in Wikipedia? -- Innocent bystander (talk) 13:25, 28 May 2016 (UTC)[reply]
See de:Die Wasserkinder#Erstveröffentlichung (It's my approach). --Succu (talk) 21:01, 28 May 2016 (UTC)[reply]
Thanks! I would probably have done the same. And the corresponding to that here would probably be to put it in 8 different items or 8 different references. -- Innocent bystander (talk) 06:16, 29 May 2016 (UTC)[reply]

Constraint violation for use on version, edition, or translation (Q3331189) able to be coded?[edit]

Where this property is used on items that are set for P31 -> Q3331189 would it possible for this to be marked as a violation, as it would be expected that we would be using number of pages (P1104). Thanks.  — billinghurst sDrewth 04:54, 18 August 2017 (UTC)[reply]

@billinghurst: let's try, I added the constraint, there is 100 violations and most seems not to be confusion with number of pages (P1104). For instance Q19127415 seems to be correct (or at the very least not a confusion with number of pages (P1104)). Should we keep this constraint or not? Cdlt, VIGNERON (talk) 16:24, 31 January 2019 (UTC)[reply]

existing constraint variation required[edit]

Hi. Looking at some of the current variations, we have articles that are split ie. a run of pages, a gap, then another page or run, eg. Perspectives in disease prevention and health promotion fatal occupational injuries - Texas, 1982 (Q26342600). Would it be possible to look to vary the constraint to adapt to that variation? Thanks.  — billinghurst sDrewth 05:05, 18 August 2017 (UTC)[reply]

I agree. The current regex is too restrictive and doesn't allow for common situations when a story or article is split into two sections in the same issue, e.g. Alice Munro's The Bear Came Over the Mountain (Q60664117), published in the New Yorker on pages "110–121, 124–127" -- LesserJerome (talk) 17:19, 18 January 2019 (UTC)[reply]

@billinghurst, LesserJerome, Harej, Daniel Mietchen: we can change and expand the regex but we should list all the value we want and don't want. For instance, yesterday I changed the constraint to allow "67, 69" (because it could be non-consecutive pages), is it ok? The current regex format allows 3 type of hyphens (-, – and —) which is quite open but in the same time it doesn't allow to put spaces on each side of these hyphens, which is quite closed… So 10-13, 10–13 and 10—13 are allowed but not 10 - 13. Is it really what we want? (personally, I would only allow 10-13, and put a conversion script for all others formats but I'm open to discussion). The [A-Z] feel strange too, if this is for roman numeral is should be limited to [IVXLCDM]. Cdlt, VIGNERON (talk) 17:23, 31 January 2019 (UTC)[reply]

VIGNERONI'm satisfied with the allowance of commas, and I think that is a good idea to normalize hyphens/dashes with a conversion script. For roman numerals, perhaps [iIvVxXlLdCcDmM] to allow for numerals that appear in lowercase as well? Case might not be an incredibly important to capture, but this would at least allow for some variation. Thanks for your help! LesserJerome (talk) 17:39, 31 January 2019 (UTC)[reply]
Pretty much agree, we need PAGE/PAGERANGE(, repeat)?(, repeat)? and maybe another repeat. Agree about limited subset of lower case roman numerals, though hesitate as I only have experience in western European books, not certain what early eastern European books did. That said, I had had a later thought that if it was too hard to cover the range in one set, we can just utilise multiple number sets, but that forces the inhalation and tying together to happen at the using site, and that could be problematic.  — billinghurst sDrewth 22:08, 31 January 2019 (UTC)[reply]

warning about page(s) for version, edition or translation[edit]

I am getting a warning because apparently, if it is a version it does not have page numbers?--RaboKarbakian (talk) 17:13, 2 February 2019 (UTC)[reply]

@RaboKarbakian: see two discussions above. This constraint is pobrably a false good idea. Too many false-positive, we should probably remove it. @Billinghurst: what do you think? Cdlt, VIGNERON (talk) 16:24, 5 February 2019 (UTC)[reply]
@RaboKarbakian: constraint removed by @Hsarrazin:. Cdlt, VIGNERON (talk) 09:11, 13 February 2019 (UTC)[reply]
Hello ! Sorry for not checking before removing it... :)
I did not know about this discussion, but this constraint was, for me, an obvious error, since I have thousands of items about editions of short stories, poems, etc, published in periodicals, or in anthologies...(link with published in (P1433) - how could i catalog them without page(s) (P304) ?
don't know if it is feasible to link the use of this property on editions with the simultaneous use of published in (P1433)... what do you think @Billinghurst, VIGNERON: --Hsarrazin (talk) 09:17, 13 February 2019 (UTC)[reply]
No problem, and honestly as soon as I added this constraint I had doubts (cf. the previous discussion above). Cheers, VIGNERON (talk) 09:33, 13 February 2019 (UTC)[reply]

Locations in e-books[edit]

For e-books that don’t have the “real page number” feature, do we use this property for the “location” in references? - PKM (talk) 03:51, 29 March 2019 (UTC)[reply]

Description[edit]

I dont understand the item Description. My native language is Czech, I also speak English, but I dont understand it in neighter language. Could we rephrase it?--Juandev (talk) 18:59, 25 April 2019 (UTC)[reply]

Hyphenated page numbers[edit]

How should we enter the range when page numbers have hyphens? Consider the chapters in Metallogenesis and Tectonics of Northeast Asia (Q57842373). As Chapter 1, Introduction (Q63249120) has pages 1-1 through 1-36. I think pages(2) 1-1-1-36 would be very confusing, so maybe an em dash somewhere would be better? Trilotat (talk) 13:27, 27 September 2019 (UTC)[reply]

Not the em-dash, but only the en-dash. See below!! Verdy p (talk) 13:49, 6 November 2020 (UTC)[reply]

Format regular expression warning on "roman-roman" numbering[edit]

Looking at Q69168855 there is a warning for the page numbering in the form "ix-xix". Pretty certain that this has worked previously without a warning.  — billinghurst sDrewth 22:54, 29 September 2019 (UTC)[reply]

✓ Done @billinghurst: RegEx changed, no more warning —Eihel (talk) 09:46, 4 December 2019 (UTC)[reply]

Page vs folio[edit]

There is a discussion about P304 vs folio(s) (P7416) at Topic:Vc0m1yeldangkxl4. You're all welcome to come and contribute your opinions. Deryck Chan (talk) 13:10, 2 December 2019 (UTC)[reply]

א-ת[edit]

אני צריך בבקשה שבביטוי הריגולטי - יוסיפו אותיות א-ת אלפבית עברי בעברית, כי יש הרבה שימוש בעמודים מסומן באותיות עברית בספרים בעברית. אבגד (talk) 02:25, 20 April 2020 (UTC)[reply]

I add the following characters :

אבגדהוזחטיכךלמםנןסעפףצץקרשת

Would that be okay with you ? @אבגד: —Eihel (talk) 11:31, 11 August 2021 (UTC)[reply]
yes! אבגד (talk) 21:19, 3 October 2021 (UTC)[reply]
@Eihel:. אבגד (talk) 06:06, 6 October 2021 (UTC)[reply]

Possibly broken regex[edit]

Every constraint report I look at with a page number on appears to be unable to parse the regex. I'm not too familiar with the regex so if someone could look at it and see if you can find the issue, that'd be really useful. For an example see, Special:ConstraintReport/Q94816281. Cdo256 (talk) 08:11, 6 September 2020 (UTC)[reply]

People from OpenRefine confirms this problem: the regex expression has some bug and it can't be verified with check tools. In this case the bug inhibits OpenRefine to use this property in any way. Olea (talk) 18:00, 5 November 2020 (UTC)[reply]
Maybe @Verdy p: could help? Olea (talk) 18:00, 5 November 2020 (UTC)[reply]
Apparently the local regexp parser is not Unicode-aware and does not accept the en-dash (U+2013, "–") inside a character class because it is encoded as multiple bytes in UTF-8. Note that before it was valid, and did not cause the error "invalid Regexp) but visibly this has changed recently by restricting character classes to single-bytes only (i.e. accepting only 8-bit bytes, no longer real characters). However this is easy to fix by using an alternation (inside a non-capturing group) (?-|–) instead of a character class [-–]. I'll try that (may be that's enough or there's another problem). It was supposed to be valid in PCRE (and PCRE was supposed to be UTF-8-aware before). Verdy p (talk) 11:57, 6 November 2020 (UTC)[reply]
✓ Done I found something that changed in PCRE: for non-capturing groups, the syntax (? ... | ... ) is no longer accepted, the question mark must be followed now by a required colon (?: ... | ... ) because there may be new flags after it before the colon; it was not necessary before, when the character immediately after the ? could not be an optional flag and the colon was implied. Just adding the explicit colon was enough to fix it. So this is not related to UTF-8: character classes can contain any non-ASCII Unicode character correctly encoded as multiple bytes in UTF-8. This should work now (at least it works now for the sample given above). Verdy p (talk) 12:09, 6 November 2020 (UTC)[reply]
For clarity, here is the expanded syntax, with non-significant whitespaces and newlines added:
        (?:               \d+ [A-Za-z]*
        |   [A-Za-z]+ (?: \d+ [A-Za-z]*
                      )?
        )
    (?: [-–]
        (?:               \d+ [A-Za-z]*
        |   [A-Za-z]+ (?: \d+ [A-Za-z]*
                      )?
        )
    )?
(?: [,;]\s?
        (?:               \d+ [A-Za-z]*
        |   [A-Za-z]+ (?: \d+ [A-Za-z]*
                      )?
        )
    (?: [-–]
        (?:               \d+ [A-Za-z]*
        |   [A-Za-z]+ (?: \d+ [A-Za-z]*
                      )?
        )
    )?
)*
  • Note that there are some reports of page ranges using em-dashes instead of en-dashes; ASCII hyphens (U+002D, ‘-’) are accepted, but en-dashes (U+2013, ‘–’) are recommended as range separators (because the ASCII hyphen is ambiguous, as this could be interpreted as an arithmeric operation). In my opinion, em-dashes (U+2014, ‘—’) are wrong (this does not occur in many elements) and should be replaced by en-dashes.
  • Also the new regexp uses now non-capturing groups everywhere (capturing groups are costly in the regexp engine, and are not necessary here for any purpose).
  • Note that this regexp could be compressed using "subroutine call groups" (as in a(?'x'b\g'x'?y)z, which matches "abyz", "abbyyz", "abbbyyyz", etc.; where (?'name' ...) defines a named subroutine that will be matched, and \g'name' references a subroutine call, but which will be matched separately; this is like a macro substitution). For example with 'r' naming a range, and 'p' naming a single page reference, 'd' a decimal integer, and 'w' a word:
(?'r' (?'p'           (?'d' \d+ ) (?'w' [A-Za-z]+ )?
      |     \g'w' (?: \g'd'       \g'w'?
                  )?
      )
      (?:[-–] \g'p')?
)(?:[,;]\s? \g'r' )*
  • you get the short expression:
(?'r'(?'p'(?'d'\d+)(?'w'[A-Za-z]+)?|\g'w'(?:\g'd'\g'w'?)?)(?:[-–]\g'p')?)(?:[,;]\s?\g'r')*
  • Now, we can expand the set of letters admitted in a word ('w', defined for now only as one or more ASCII letters), or decimal integer ('d', defined for now with only ASCII digits: do we admit Indic, Eastern Arabic, or Han/Kanji digits, or wide digits? or even Hebrew letters like suggested in a previous topic above?); this is defined in only one central location of the regexp, thanks to these subroutines (supported in PCRE v1 and v2 syntaxes; subroutines can also be used with quantifiers since PCRE 7.7).
  • As well, about matching the comma-or-semicolon separation (?:[,;]\s?\g'r')*: do we need it? Shouldn't we restrict to a single page or single range, and force people to use multiple qualifiers instead (note that in such case, no need to add support for semicolons, or Arabic commas: these separators are actually language-specific and should not even be part of the data value (however it is still simple to parse a CSV value to format it in a language-specific way; however digits/numbers should remain as they are in the original edition, notably when they are also mixed with letters like "A1b" or only letters like "IIIa": translating or transliterating these numbers would break and could return confusive results)...
  • Also I discovered in some properties that people wanted to be more liberal after a single page or range, by allowing an edition to be present between parentheses, such as "12 (printed edition), 14 (digital edition)". However I'm not convinced this relates to the same described elements: separate editions that have separate pagination should have separate elements. But some people may argue we should tolerate them, so we could add such edition to be allowed (between correctly paired parentheses), at end of the definition of 'r' above, but before its repetion in the comma-separated list, or just at end of the whole regexp, or inside a separate qualifier (which would be translatable separately; in which case we'd need separate "page(s)" properties, one for each edition, but it's not evidentto pair the "page(s)" qualifiers with "edition" qualifiers for the same element, as they are unordered, unless the elements are separated.
Verdy p (talk) 12:37, 6 November 2020 (UTC)[reply]
@Verdy p: Awesome explanation! Olea (talk) 14:06, 6 November 2020 (UTC)[reply]
@Verdy p: Nice work. I think a typo has crept into the last iteration of the reg ex - I think the "\g'n'" should be "\g'd'" - I've corrected in the documentation and the P1793 and this now validates correctly I believe
Note: there's a debate about which syntax to use for "named capture groups" (however technically they are NOT named capture groups, but "named subroutines", that do not require to generate any capture group in output when there's a match; this is jut describing common subpatterns in the engine itself, and has NO effect on the output; they because just like anonymous non-captured groups (?: ...) whose effect is local, but the syntax ofor named ones allows reusing their definition without duplicating them in each anonymous non-captured groups (?: ...).
Some people complained my syntax was invalid (because it breaks on some of their client applications that can't parse it), but it was not for tools currently used on this wiki, and it was tested as being valid as well on regexp101.com which properly recognizes it as perfectly valid with PCRE (since v1! Those saying this was a .Net syntax were wrong, as .Net in fact borrowed later its support from PCRE). Equivalent for other older engines may be:
  • to replace ?'x' by the more verbose (?P<x>) for the prefix defining a subroutine named x, and
  • to replace \g'x' by the more verbose (?P=x) for referencing the same named subroutine.
  • Perl5 uses a similar syntax using curly braces instead of single quotes (but here again it created named capture groups in output, something that is not needed for named subroutines).
  • The syntax given about is the most compact, and has the lowest cost at runtime as it does not have to create and track capture groups (this adds several complications for backtracking, with complex buffer management in a stack)
  • If the regexp engine in the Wikidata client does not support named subroutines, which are convenient to compact the regexps under the limits of Wikidata (which is why the shortest syntax using quotes was displayed and fully described above), they can be converted to named capture groups (older feature from Perl, but addition capture groups will be recreate in output for matches), and otherwise named subroutines can just be expanded as if they were macro (in which case there will be once again no additional capture groups). And it's quite trivial to do that: those tools should be able to parse the compact PCRE syntax and adapt/translate them appropriately for the regexp engines they support (there are many regexp dialects, Wikidata cannot support them all, notably for its data validators; and remember that these validators are jsut tools to help work, modify and validate Wikidata contents by Wikidata users; then the data dfreom Wikidata can be used without those validators; validators are not needed for read-only clients, and cany client submitting data to Wikdiata should work not just with bots but with human reviewers in the Wikidata community).
However there's no standard there for the various clients that could use Wikidata. Wikidata is NOT developed in favor of one external tool or another. I just focused on tools that are integrated in Wikidata itself, and these tools are based on PCRE and have no problem at all with this short syntax. And my description given above should have been sufficient to explain everything to those few that complained (below) with unfair statements. All was documented, but of course this can still be discussed. It is well known (and documented) that core Java regexps are mot limited (but there are common Java libraries extending its legacy support), and the same is true about Perl5 (still more limited than PCRE, even if it's more extended than core Java; .Net has a decent library; the most limited regexp engine are in Lua patterns, or legacy BSD shells and tools like "vi/ed/sed" and "lex" without the GNU extensions). Verdy p (talk) 20:58, 27 April 2021 (UTC)[reply]

Property still doesn’t work with OpenRefine[edit]

The problem with OpenRefine still persists and was discussed on the Telegram channel on February 14 without result. @Lydia Pintscher (WMDE): Did you have a chance to have a look at it? The OpenRefine Github suspects that the “regexp validator is still not correct in Wikidata”. I’m not tech-savvy enough to check on this, maybe you can have a look? Thank you! --Emu (talk) 10:55, 18 February 2021 (UTC)[reply]

Thanks. I'll have a look and reply in the ticket there. My assumption is that this is a problem in Blazegraph as we're using that for regular expression checking. --Lydia Pintscher (WMDE) (talk) 16:32, 18 February 2021 (UTC)[reply]
You should have pinged me, if you expected a response. I was never notified after the previous thread. I had made most fixes but still the robot has still not run since months to update the list of violations. All we can do for now is to use the "SPARQL (new)" validator running live, whose link is above: it runs ionstantly and finds very few entries (in the past before my last thread, it found zillions violations.
@Emu: Note that nobody commetned about the need to use a comma-separated list (and I've found various occurences using semicolons instead of commas): I had suggested to remove it and force properties have those lists to be split into multiple properties, but this can be automated as these commas or semsolons are easily parsable (as long they validate with the current regexp which now also accepts semicolons) Verdy p (talk) 06:40, 19 February 2021 (UTC)[reply]
@Verdy p, Lydia Pintscher (WMDE): the problem with OpenRefine and the RegEx on P304 still persists. --Mfchris84 (talk) 06:47, 7 April 2021 (UTC)[reply]
oh, as it is mentioned here the beta versions of open refine already ignore invalid regex from wikibase, so the problem is in a certain way solved for the moment. --Mfchris84 (talk) 07:40, 7 April 2021 (UTC)[reply]
Non-sense: regexps used in Wikidata contraints MUST be conforming to PCRE, as explicitly stated by the definition of P304 itself, but unfortunately without specifying its minimum version; however the resported issue above was about a feature (named subroutines) that is part of both PCRE v1 and v2. Other regexp types exist where it makes senses to specify the syntax, but this is not for P304 (even if some tools like SPARQL-based reports, or OpenRefine, or Python-based tools may be using other engines with more reduced capabilities, until these tools are upgraded with a suitable support of at least the PCRE v1 syntax).
The problem you have in the OpenRefine project, is that is only uses the imported "java.util.regex.Pattern" engine from the very basic Java Runtime (JRE), wihout supporting any other engine (this JRE Regexp engine is very basic, I'd say "prehistoric" as it was defined many decennials ago). So this is actually a problem of OpenRefine, not of Wikidata (there does exist other regexp engines available for Java, it would require OpenRefine to define another dependency in their opensource project, but the builtin Regexp engine of the standard JRE is not even compatible as well for use with SPARQL, or Javascript).
It may be useful for you to look at this support page, https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html (for Java 7) and look at the documented differences with Perl 5, notably:
Comparison to Perl 5
The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.
Perl constructs not supported by this class:
  • [...]
  • The backreference constructs, \g{n} for the nthcapturing group and \g{name} for named-capturing group.
(these differences also evolve over time depending on the Java version; you would even less support with an older but common JRE v4!).
In some cases, it may be possible to convert a PCRE v1 regexp to a version compatible with the JRE v7, but Wikidata provides no such mechanism and just documents PCRE for its defined contraints.
Java supports named capturing groups (with the custom syntax (?<name> ... ) or (?<?name> ... ) for not counting them), but not as back references (which are identical repetition of the capture) or subroutines (more generic; subroutines are more like a macro substitution, and in fact here they were used as indicated above for this "macro" capability to reduce the size of the regexp, as we actually don't need the any capturing groups on the output of matches for just validating any constraints in Wikidata). Verdy p (talk) 13:19, 7 April 2021 (UTC)[reply]
So don't attempt to use "back references" (or "named captures") they do NOT work (they would validate "78-78", non-sense range, but not "78-85"; note that there's still NO way in a regexp to validate ranges, i.e. that the first bound is lower than second bound, and that there are no overlapping or repeating ranges or isolated numbers; so you should write ranges and isolated numbers in ascending order).
If your parser does not understand PCRE-subroutines, the only way is to expand manually their regexp definition, using a longer regexp. We cannot support all possible regexp engines in Wikidata. Many engines are defective or have bugs or severe limits. It's up to you (if you don't use a PCRE-compliant regexp engine), to adapt the regexp syntax in your external engine ! And read more carefully the PCRE documentation, I made NO error ! Verdy p (talk) 00:09, 13 August 2021 (UTC)[reply]
OpenRefine still uses the old Java-regexp (it does not support any PCRE or Perl construct); SPARQL uses a "REGEXP()" function in "FILTER", which is based on XQuery 1.0 and XPath 2.0 (https://www.w3.org/TR/xpath-functions/#regex-syntax: it is old but there are pending changes to make it support PCRE extensions) which also does not support these constructs and does not support hexadecimal escapes for encoding Unicode characters (but it supports Unicode character classes and properties; Unicode characters must be litterals only a few ASCII punctuation characters need a backslash for escaping them from being interpreted as metacharacters).
This wiki uses PCRE: the constraint validator uses PCRE for reports (it is evaluated by KrBot); the online validator in this wiki now uses PCRE as well. The "new SPARQL" links on this wiki should no longer use the SPARQL FILTER REGEXP() but should lookup for the attribute set by the KrBot's reports in Wikidata. Because now the online validtor uses PCRE and is capable of setrting this attribute directly, KrBot will no longer be needed, the attribute will be set directly and instantly at each edit (now using PCRE), so KrBot will soon no longer be needed (except rarely, e.g. once in a month or less often, to revalidate data if the online validator ever had bugs when editing items, and forgot to process some items, or if there are changes in PCRE, or to refresh report pages, which are already out of date). OpenRefine should then no longer need its own regexp engine and should use the Wikidata's contraint validation attribute. the Wikidata interface already uses instant validation when loading data in an item, or when submitting changes: thisis much better because it puts much less work on the server, all is done incrementally.
External engines that want to revalidate data themselves should use PCRE, not the legacy SPARQL REGEXP() filter based on XQuery/XPath. But if they want to do that, for specific wikidataproperties, they should tune theirm own regexp patterns, or implement additional custome tests (e.g. to validate valid bounds for ranges, or that they do not overlap, or to check normalization on attribute values of Wikidata items prior to process them with their own regexps).
PCRE is everywhere in Wikidata's supported tools. it's up to OpenRefine to fix their code to either use the Wikidata's validation attributes or to use their own custome regexps.
The old SPARQL engine still visible in Wikidata still wroks for very simple regexps but is already deprecated (it has many bugs, and was based on Java regexps, and it is extremely slow, fails frequently with timeouts). Verdy p (talk) 08:21, 13 August 2021 (UTC)[reply]

Simple pagination getting regex violation[edit]

I am seeing regex violations on very simple pagination such as that on The chemical basis of morphogenesis (Q769913). The pages are given there as 37-72, with a hyphen from the keyboard used. This shouldn't be a violation. Can someone please fix it? UWashPrincipalCataloger (talk) 21:12, 10 August 2021 (UTC)[reply]

✓ Done UWashPrincipalCatalogerEihel (talk) 10:40, 11 August 2021 (UTC)[reply]
The error was caused by the bogous attempts made by YOU, Eihel (when you modified my correct regexp)... You've not read the discussion above and made false assumption ! Please reread carefully the PCRE documentation, because you still don't know what is a "subroutine call" (even if I fully described and documented how it was used above to build a more "compact" regexp). You've made other non-sense changes and broke languages not using the Latin script (or Devanagari later) for writing numbers (because you dropped "\d" which included all decimal digits, not just "0-9" as you assumed incorrectly). Verdy p (talk) 00:19, 13 August 2021 (UTC)[reply]

Additional allowed entity types constraint[edit]

@Eihel: I've reverted the changes you just made because they were causing constraint violations on clearly correct uses of the property; it seems that the constraint validation system is looking only at the updated allowed-entity-types constraint (Q52004125) rather than both (is this documented somewhere?). Vahurzpu (talk) 03:07, 11 August 2021 (UTC)[reply]

I specifically noticed this on items where "page" was used in the references. Vahurzpu (talk) 03:08, 11 August 2021 (UTC)[reply]
Reproduced the issue on Test Wikidata; see testwikidata:Property:P84533 and testwikidata:Q188584 Vahurzpu (talk) 03:20, 11 August 2021 (UTC)[reply]
Indeed @Vahurzpu:, other users also remove this feature, instead of being left on the props while the problem is resolved on Phab. See T269724. Cordially. —Eihel (talk) 04:46, 11 August 2021 (UTC)[reply]
@Eihel:, you were wrong in your talk page, I did NOT remove any roman digit, you remove all other decimal digits that are now not {basic ASCII decimals, or devanagari, or some limited range of Roman numerals}, and now we've got errors with numbers used in Arabic, Persan, Urdu, Chinese/Japanese/Korean ("wide" variants of digits), traditional Hebrew. And you did not read the detailed description above when you incorrectly confused "subroutine call groups" with "named groups" (there are big differences! a "subroutine group" repeats the patterns it represent, like a "macro subtitution", NOT what it matched in the first occurence, so multiple occurences of a subroutine don't have to match the same subtring. All that was explained above. You should read the PCRE documentation more carefully. There was no error in what I made and documented above.
As I said, we have PCRE-based syntax used on this wiki, and we cannot support all the many external regexp engines that may exist (many have bugs or severe limitations). And All was working properly on this wiki. You changed it because some external tools do not (or do not want to) support PCRE syntax.
If you had read above scruputlously, the "named subroutine" was used only like a simple macro (without any recursion with complex conditions to exit the recursion), jsut to avoid explsosing the maximum allowed length of a Wikdiata property. Now your regexp is very long, has new problems that did NOT exist (not before you changed subroutines into named groups, causing "78-85" no longer matching, and finally by replacing "\d" by "0-9", so we have lost many decimal systems!).
You have just added a few new hyphen-like characters (I had already allowed ASCII dash, and the "en-dash" U+2013, you've added incorrect dashes to never use for ranges, such as macron-like or underscore dashes, normally used as group separators in the *same* number, and only in very large numbers: so "12_345" is like "12345", it's not a range between two integers 12 and 345, but a single integer). U+2010 (hyphen) must NOT be used between numbers, it occurs in the middle of words as a "breaking hyphen". U+2011 is the same but unbreakable. U+2012 is a "numeric hyphen" (should have the same width as digits, it is a placeholder for missing digits where we don't want extra digits to occur before or after a number, it was used also to "overstrike" a number (to cancel an entry in accounting reports; it is for legacy compatiblity only). U+2013 is the normal hyphen (usually larger than digits, may have extra padding on both sides). U+2014 is an "em-dash"really too large, but admissible (I don't see any reason for accepting it) In fact U+2013 is preferably used a a sentence separator, or to start and end a parenthetic note or translator's side note, or to denote a change of locutor in a conversation. U+2015 is overlong and joining, it is jsut to create a long stroke denoting the absence of information in a table cell; U+24AF is only used to create large symbols by joining it with other parts (e.g. to draw an horizontal line separating the numerator and denominator for a division in maths formulas, or to create long horizontal braces or parentheses, broken in joined fragments for the middle and the two ends). You've chosend those characters arbitrarily, not using even the definition of "dash" property in Unicode:
  • U+002D (-) Hyphen-Minus,
  • U+058A (֊) Armenian Hyphen,
  • U+05BE (־) Hebrew Punctuation Maqaf,
  • U+1400 (᐀) Canadian Syllabics Hyphen,
  • U+1806 (᠆) Mongolian Todo Soft Hyphen
  • U+2010 (‐) Hyphen
  • U+2011 (‑) Non-Breaking Hyphen
  • U+2012 (‒) Figure Dash
  • U+2013 (–) En Dash
  • U+2014 (—) Em Dash
  • U+2015 (―) Horizontal Bar
  • U+2E17 (⸗) Double Oblique Hyphen
  • U+2E1A (⸚) Hyphen with Diaeresis
  • U+2E3A (⸺) Two-Em Dash
  • U+2E3B (⸻) Three-Em Dash
  • U+2E40 (⹀) Double Hyphen
  • U+301C (〜) Wave Dash (used in East Asia, notably in China and Japan, for ranges)
  • U+3030 (〰) Wavy Dash (same remark)
  • U+30A0 (゠) Katakana-Hiragana Double Hyphen (used in Japan, where the single horizontal dash is confusable with traditional digit 1)
  • U+FE31(︱) Presentation Form For Vertical Em Dash
  • U+FE32 (︲) Presentation Form For Vertical En Dash (note the extra padding at top and bottom to avoid confusion with digit one)
  • U+FE58 (﹘) Small Em Dash (note the extra padding at left and rightto avoid confusion with digit one)
  • U+FE63 (﹣) (Small Hyphen-Minus (note the extra padding at left and rightto avoid confusion with digit one)
  • U+FF0D (-) Fullwidth Hyphen-Minus (appropriate but like the ASCII hyphen-minus could be confused with a minus sign, not confusable with digit 1 in Asia due to internal padding and the stroke is too narrow)
  • U+10EAD (𐺭) Yezidi Hyphenation Mark
You've not read at all what was above, and made false assumptions.May be you don't know PCRE, well, but you did not read above where it was explained, I gave all hints to understand (because the PCR doc can be difficult to read, sometimes we need another formulation based on examples, and the examples were documented above). Verdy p (talk) 22:25, 12 August 2021 (UTC)[reply]
And now I can affirm that the MORE THAN 30 MILLIONS violations listed in this wiki that were last counted came from YOUR edit Eihel, when you replaced "subroutine calls" by "named groups" (which of course invalidated ALL page ranges because they are noramlly between different page numbers (note that there's still no chack of page ranges, ie. the first bound should lower than the 2nd bound, and there should be no overlapping ranges or single page numbers; it's not possible to capture that in a regexp without using custom call in code, but an external parser could do that and canonicalize the page numbers, notably the prefered en-dash, a comma rather than a semicolon, and only one space after the comma and not before, no space anywhere else, leave digits and letters unchanged, sort the numbers or ranges and check overlaps if their numeric format is simple and consistant in the same decimal system, then detect invalid ranges with reversed bounds, like "378-4", an abreviation sometimes used in old printed books in very compact format to save paper, or now used only very informally, which should be "378-384" in Wikidata).
I **DID NOT** use any "named group" like you did. And there were NOT over 30 millions violations listed before *your* change, by not understanding the concept of "subroutine calls" and making your confusion with "named groups". Now we have to wait for the next run of violation counters. Verdy p (talk) 23:25, 12 August 2021 (UTC)[reply]
Note: I have reverted your change from "\d" to "0-9" only, it does not work for all decimal numbers (even if you had manually added only Devanagari digits, already included in "\d"). I've just kept the additional Unicode Roman digits which are not decimal (in the first description, Roman digits were allowed but had to be written with Basic Latin letters, and not the dedicated Unicode Roman digits). I've removed the "subroutine calls" in their expanded form (which was first described above), for some external parsers (or users like you) that don't understand what is a "subroutine call group", because now the regexp can be longer than the older limit. I've ordered all character classes correctly; I have dropped some "dummy" hyphens you incorrectly added; I've not added the vertical presentation forms listed just above because vertical presentation requires other complex changes for numbers and letters; I've added the double or wavy hyphens used in some Asian languages, that don't like long hyphens confused with CJK digit one, and I excluded "soft hyphens" e.g. in Mongolian, as they can occur in the middle of a single long integer or single word; if we allow such inclusion, only for large integers with more 5 digits or more, then we'd need the middle dot used in Catalan, apostrophes in Swiss German, NNBSP in Romance languages, underscore/overscores in some English contexts: they are separators for groups of digits, not suitable for range separators...) Verdy p (talk) 23:55, 12 August 2021 (UTC)[reply]

How to model "after page 306"?[edit]

How to model the position in a work relative to a page number like "after page 306"? Example is the illustration File:The Chaldean Account of Genesis (1876) - illustration - page after 306.png. Maybe something like / sourcing circumstances (P1480) after (Q79030284), so in this case M108413715 page(s) (P304) "306" / sourcing circumstances (P1480) after (Q79030284), but using a value that is not after (Q79030284), because the item is for temporal "after"? Inductiveload, FYI. :-) Thanks for any input, best, --Marsupium (talk) 09:48, 12 January 2022 (UTC)[reply]

Thanks for bringing it up! For some background, this is extremely common and is almost universal in printed works with separate image "plate" pages. There are a few different ways this can appear:
  • Pages are inserted with the images between normally-numbered pages
    • In older books which couldn't duplex plates: e.g. "40, 41, 42, Plate, [blank, reverse of plate], 43"
    • In more modern printing: "40, 41, 42, Plate, Plate, 43"
  • Sometimes they come in groups:
    • With the images always facing text: "40, 41, 42, Plate, [blank, reverse of plates], [blank, reverse of plate], Plate, 43" (e.g. p172 onwards here)
    • With the images always on the same side of the paper: "40, 41, 42, Plate, [blank, reverse of plates], Plate, [blank, reverse of plate], 43"
  • A "frontispiece" plate comes before the title page, usually facing it. The title page is often, but not always, implicitly numbered as "1" or "i" (i.e. it doesn't have a marked page number, but a following page implies it would be so numbered if the numbering were extended down).
  • The image occurs on the title page (e.g. a printer's mark)
Very often, again not always, the images are referred to as "facing" another page. So the above image might also be said to be "facing" page 306. Sometimes it can be hard to tell which way the images face, since the scans are page-by-page, but you can usually tell by looking at the side of the page numbers on surrounding pages.
Also there are a lot of publications where "auxiliary" content like adverts contains images, but the pages themselves aren't actually numbered. Inductiveload (talk) 10:14, 12 January 2022 (UTC)[reply]