Difference between revisions of "Property talk:P244"

From Wikidata
Jump to navigation Jump to search
m (Undid revision 59469669 by 174.103.190.11 (talk))
Line 23: Line 23:
 
}}
 
}}
   
  +
{{ExternalUse|
  +
*[[:it:Template:Controllo di autorità]]
  +
}}
   
 
==Format Summary==
 
==Format Summary==

Revision as of 03:40, 28 August 2013

Distinct values: this property likely contains a value that is different from all other items.
Exceptions are possible as rare values may exist.
List of this constraint violations: , SPARQL (every item), SPARQL (by value),
Single value: this property generally contains a single value.
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P244#Single value, SPARQL
Format: value must be formatted using this pattern (PCRE syntax).
([a-z]|[0-9]{2}|[a-z]{2}|[a-z][0-9]{2}|[a-z]{3}|[a-z]{2}[0-9]{2})?[0-9]{8}
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P244#Format, SPARQL

Documentation

Library of Congress authority ID
identifier for Library of Congress ID for persons, works, organizations and subject headings [Format: 1-2 specific letters followed by 8-10 digits (see regex). For book editions, use P1144]
DescriptionLibrary of Congress Control Number as "normalized LCCN" (This is different format than currently used in wikipedia templates. See here for formatting guidelines)
RepresentsLibrary of Congress Authorities (Q13219454)
Associated itemLibrary of Congress (Q131454)
Has qualityVIAF component (Q26921380), prefix (Q23585486)
Data type
According to this template: String
According to statements in the property:
External identifier
When possible, data should only be stored as statements
Allowed values
According to this template:

The rightmost eight characters are always digits.

    • The length may be 8.
    • If the length is 9, then the first character must be alphabetic.
    • If the length is 10, then the first two characters must be either both digits or both alphabetic.
    • If the length is 11, then the first character must be alphabetic and the next two characters must be either both digits or both alphabetic.
    • If the length is 12, then the first two characters must be alphabetic and the remaining characters digits.
According to statements in the property:
(n|nb|nr|no|ns|sh)([4-9][0-9]|00|20[0-1][0-9])[0-9]{6} (string of 1 or 2 lowercase letters, a 2- or 4-digit year and a sequence of 6 digits)
When possible, data should only be stored as statements
Example
According to this template: Vincent van Gogh (Q5582) => n79022935
According to statements in the property:
Vincent van Gogh (Q5582)n79022935 (RDF)
dialect (Q33384)sh85037527 (RDF)
When possible, data should only be stored as statements
Sourcehttp://id.loc.gov
Formatter URLhttps://id.loc.gov/authorities/$1
Robot and gadget jobsDeltaBot does the following jobs:
Tracking: usageCategory:Pages using Wikidata property P244 (Q30243273)
Related to countryFlag of the United States.svg United States of America (Q30) (See 312 others)
See alsoLibrary of Congress Control Number (LCCN) (bibliographic) (P1144), Library of Congress Demographic Group Terms ID (P4946), Library of Congress Genre/Form Terms ID (P4953)
Lists
  • Count of items by number of statements (chart)
  • Count of items by number of sitelinks (chart)
  • Items with the most identifier properties
  • Items with no other external identifier
  • Items with no other statements
  • Items with novalue claims
  • Items with unknown value claims
  • Usage history
  • Mix'n'match (Report) and Mix'n'match (Report)
  • Database reports/Constraint violations/P244
  • Proposal discussionProposal discussion
    Current uses635,791
    Search for values
    [create] Create a translatable help page (preferably in English) for this property to be included here
    Format “(|((n|nb|nr|no|ns|sh|gf)([4-9][0-9]|00|20[0-1][0-9])[0-9]{6}))”: value must be formatted using this pattern (PCRE syntax). (Help)
    List of this constraint violations: Database reports/Constraint violations/P244#Format, hourly updated report, SPARQL, SPARQL (new)
    Single value: this property generally contains a single value. (Help)
    Exceptions are possible as rare values may exist. Known exceptions: 50 Cent (Q6060), Isaac Asimov (Q34981), David Gross (Q40262), Daniel Defoe (Q40946), Günter Wallraff (Q76529), Cat Stevens (Q154216), Kingsley Amis (Q220078), Robin Hobb (Q234403), Anne Desclos (Q253288), Catherine Robbe-Grillet (Q274608), Josiah Gilbert Holland (Q283094), Dr. Seuss (Q298685), J. T. McIntosh (Q349508), Kir Bulychov (Q360292), José Martínez Ruiz (Q443403), Noel Streatfeild (Q467058), Lawrence Block (Q505123), Anthony Berkeley Cox (Q519357), Natasha Cooper (Q544595), Lilith Saintcrow (Q598052), Willard Huntington Wright (Q630454), Blak Twang (Q881225), Arnon Grunberg (Q983233), Daniel Handler (Q1060636), Michael Marshall Smith (Q1372312), Johan Adriaan Heuff (Q1691912), John Sandford (Q1701679), Hannemieke Stamperius (Q1977977), Ronald Breugelmans (Q2331635), Anna van Gogh-Kaulbach (Q2531329), George C. Chesbro (Q2998315), H. C. McNeile (Q3134064), Lester Dent (Q3236788), W.H. van Eemlandt (Q4312410), Daniel Lyons (Q5217984), Philip Turner (Q5962373), Nelson Coral Nye (Q6990401), Robert Henry Newell (Q7345438), Ministry of Health of Germany (Q491566), TU Dortmund (Q685557), The Hong Kong Polytechnic University (Q1187271), Tokyo Metropolitan Police Department (Q1312559), Geospatial Information Authority of Japan (Q2986578), Association for Women in Communications (Q4809537), Geological Survey of Japan (Q11424703), Tokyo Institute of Psychiatry (Q11525667), Decoration Bureau (Q11635062), The University Museum, The University of Tokyo (Q15524927), Battle Creek Sanitarium (Q3472282), Frogtown (Q5505308), Loring Park (Q6681270), Marathon Oil (Q1577587), gendered advertisement (Q5530953), Federal Signal Corporation (Q1400073), AT&T Corporation (Q2843047), Toronto Public Library (Q2901516), Ramada (Q1502859), Globeville (Q5571034), Bank of Montreal (Q806693)
    List of this constraint violations: Database reports/Constraint violations/P244#Single value, SPARQL, SPARQL (new)
    Distinct values: this property likely contains a value that is different from all other items. (Help)
    Exceptions are possible as rare values may exist. Known exceptions: catering (Q777754), caterer (Q28869945)
    List of this constraint violations: Database reports/Constraint violations/P244#Unique value, SPARQL (every item), SPARQL (by value), SPARQL (new)
    Conflicts with “instance of (P31): Wikimedia disambiguation page (Q4167410), Wikimedia category (Q4167836): this property must not be used with the listed properties and values. (Help)
    List of this constraint violations: Database reports/Constraint violations/P244#Conflicts with P31, hourly updated report, SPARQL, SPARQL (new)
    Qualifiers “named as (P1810), reason for deprecation (P2241), pseudonym (P742), alternate names (P4970), retrieved (P813): this property should be used only with the listed qualifiers. (Help)
    Exceptions are possible as rare values may exist.
    List of this constraint violations: Database reports/Constraint violations/P244#Allowed qualifiers, SPARQL, SPARQL (new)
    Conflicts with “Library of Congress Control Number (LCCN) (bibliographic) (P1144): this property must not be used with the listed properties and values. (Help)
    Exceptions are possible as rare values may exist. Known exceptions: Sel'skiy Vestnik (Q4414024)
    List of this constraint violations: Database reports/Constraint violations/P244#Conflicts with P1144, search, SPARQL, SPARQL (new)
    This property is being used by:

    Please notify projects that use this property before big changes (renaming, deletion, merge with another property, etc.)

    Format Summary

    Competing Formats

    The format of the LCCN is not decided yet. LNNC was used since 1800's and there are several competing standards for storing and displaying it. In the few wikidata pages that use this property there are two distinctive formats used.

    Possible formats are:

    1. "Normalized LCCN" format used by Library of Congress and MARC
      This format is described at www.loc.gov. In case of Julius Caesar it would be "n79021400". Used by
    2. "Normalized LCCN" format with space(s) separating leading letter and the number, used by MADS and VIAF
      In case of Julius Caesar it would be "n 79021400". Used by:
    3. Format with "/" separating 3 parts of the number, used by Wikipadia Authority control templates in few dozen languages
      In case of Julius Caesar it would be "n/79/21400" (notice no leading "0" 3_rd segment. Used by:
      • Authority control templates, like at de:Gaius Iulius Caesar. There are probably over 300k identifiers. The format was first introduced by de:Vorlage:Normdaten in 2009 and allows grater flexibility at creating identifiers in different formats. Symbol "/" is used because it is the only string separator which can be easily used in Wikipedia template "language", using titleparts parser function. At this point it is unclear if we need that flexibility, and if we do than Lua freed us from constraints of template "language".
      • WorldCat http://www.worldcat.org/identities/lccn-n79-21400 URL. One of the formats accepted by worldcat.org requires non-normalized format easily constructed from "n/79/21400" format.

    Votes for format #1 "n79021400"

    Symbol support vote.svg Support Format #3 would be the easiest for us to keep, but I think that in the long term we should abandon that home-brewed format which is not used or recognized by anybody else. Technical limitations of the past made format #3 the only option in 2009, but I think we can safetly abandon it now. My only concern is how to avoid confusing users which will be filling that property: both bot-drivers and individual users. --Jarekt (talk) 15:16, 11 March 2013 (UTC)

    Symbol support vote.svg Support With respect to Gymels explanations below. --Kolja21 (talk) 17:34, 11 March 2013 (UTC)

    Symbol support vote.svg Support I second the arguments mentioned above. --Monsieurbecker (talk) 18:28, 11 March 2013 (UTC)

    Symbol support vote.svg Support a) #2 sucks; b) the slashes of #3 conflict with some elements of the printed card form; c) providing a pattern for syntax checking on entry should be feasible (some time); d) URL usage seems to converge to this form and construction of differing links should be still possible; e) most displays use a contiguous block of digits (8 or 10 depending on type 'A' or 'B') and therefore the problem of inserting the right amount of zeroes at the right position within the number shouldn't pose itself too often. -- Gymel (talk) 14:10, 12 March 2013 (UTC)

    Pictogram voting comment.svg Comment Since there are no objections, I'll change the description in the Wikidata:List of properties. --Kolja21 (talk) 02:04, 14 March 2013 (UTC)

    OK --Jarekt (talk) 14:34, 14 March 2013 (UTC)

    Symbol support vote.svg Support, too. --Ricordisamoa 02:29, 14 March 2013 (UTC)

    Votes for format #2 "n 79021400"

    Votes for format #3 "n/79/21400"



    How does the LCCN work?

    copied from: Wikidata:Project chat#Proposal: Allow string (and may be other future types) properties to be displayed with formatting defined by a template. --Kolja21 (talk) 21:57, 10 March 2013 (UTC)

    According to www.loc.gov "LCCNs have three components: prefix, year, and serial number. The prefix is optional; if present, it has one to three lowercase alphabetic characters. (Prefixes are maintained in a controlled list.) The year is two or four digits. (For 2000 and earlier the year is two digits, for 2001 and later, four digits.) The serial number (after normalization) is six digits. A normalized LCCN is a character string eight to twelve characters in length." Than the site lists the algorithmic rules on how to normalize various forms of LCCNs. Current library of congress site uses normalized LCCN on their website, see en:Template:Authority control/LCCN. However WORLDCAT database which is also using LCCN is using a different form, see en:Template:Authority control/WORLDCAT-LCCN. That is why last time it was discussed community decided on keeping LCCN number as a triplet of codes (separated by "/" ) normalize it on the fly to access www.loc.gov and assemble it in other combinations to access other sites that use LCCN as a key. That does not mean that we need to decide the same. --Jarekt (talk) 16:19, 9 March 2013 (UTC)

    copy end


    For the record, the conversation above has been selectively copied here. I have argued that slashes should not be stored in this property. Ironically the selective copying does not even support Kolja21's desire to have the slashes included. (After all, the above paragraph concludes, "That does not mean that we need to decide the same.")

    Kolja21, in response to your talk page message saying that you reverted my example at WD:P, here are some URLs to sites of importance that "work" without slashes, hyphens, or anything. That is, they work with the identifier given by LCCN (for Van Gogh in this example, n79022935) with no extra "punctuation": http://worldcat.org/identities/lccn-n79022935, http://errol.oclc.org/laf/n79022935.html, http://id.loc.gov/authorities/names/n79022935.html. Moreover, does this URL with slashes "work"?: http://id.loc.gov/authorities/names/n/79/22935.html. No.

    Here is a document titled Structure of the LC Control Number. There is absolutely no reference to slashes. Here is a document on the "normalization" of LCCN identifiers [1], which doesn't even include non-normalized examples with slashes (except in a different context, where they are also removed).

    Here is an excerpt from the English Wikipedia article on LCCN: "The hyphen that is often seen separating the year and serial number is optional. More recently, the Library of Congress has instructed publishers not to include a hyphen." One of the slashes that you are supporting is equivalent to the hyphen that is mentioned in that excerpt. You will note that even the hyphen is a) "optional" and b) now discouraged.

    I will not argue this triviality further, but if I'm wrong, I need something more than arguments from appeal to how Wikipedia stores this string for its own use, to convince me. It is not advisable to re-format canonical identifiers to suit one particular use (Wikipedia templates). Thank you. Espeso (talk) 22:41, 10 March 2013 (UTC)

    For the records: Your interpretation ("Kolja21's desire to have the slashes included") is wrong. I'm happy with both formats, but I want to make sure that editors know how to add the LCCN. If Wikipedia and Wikidata are using the LCCN in a different way, that is an important issue. If you are making errors (changing LCCN n/79/22935 to n7922935), of cause other might do the same. Since there was no consensus in the discussion mentioned above we can make a polling here. --Kolja21 (talk) 00:32, 11 March 2013 (UTC)

    Some more pointers

    The document The LCCN Namespace from 2003 cited above IMHO has to be seen in the context of the lccn.info URI scheme (cf. http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:lccn/ ) which - like all .info schemes - nowadays should be considered as only of historic interest.

    Part of the official MARC21 documentation are the following documents dealing extensively with the form of the LCCN ("CN" used to stand for "Call Number" but is now "Control Number"): MARC Bibliographic and MARC Authority. The document Structure of the LC Control Number (last revision 2006) is linked from the MARC Documentation page an appears to combine these.

    This means that the official LCCNs still are exclusively of the form ('#' depicts blanks) "n##79051955#", i.e. in comparision to the "printed card form" "n79-51955 the different components are blank-padded to the right (alphabetic prefix, supplement number) and zero-padded to the left (serial number) and one has to know the distinction between the types 'A' (years up to and including 2000) and 'B' (years from and including 2001) to get it right (result is a string of always 12 characters). Clearly the double spaces and trailing spaces of the official form of LCCNs will not only be a problem for human editors but also for generic software processing the numbers.

    I checked the XML schema and crosswalks for MARCXML, MADS, and MODS and did not find any constraints, just the instruction "copy the number". Whalt Whitman at id.loc.gov demonstrates this:

     <mads:identifier type="lccn">n  79081476 </mads:identifier>
    

    Fortunately enough the "sane" form of the LCCN as expressed in the Namespace document and utilized on VIAF (for linking: display delivers a zero-padded form with one intermediate blank), id.loc.gov (avoids presenting the number), lccn.loc.gov (shows the official form "n##79081476#" with three blanks for the Whitman example), worldcat.org(?) seems to become ever more widely accepted for linking and URL construction. At this point of time however it does not seem to be an official form. This may change suddenly as soon as the library of congress redeclares id.loc.gov or lccn.loc.gov-URLs from "convenience URLs for services" to official URIs for authority records (or better even: introduces the documented format as official "web-friendly alternative") but this has yet to happen.

    For the time being I consider the form "n79081476" not different from the form with parsing hints "n/79/81476" known from the various templates: Both are convenient and friendly to processing, and both are neither official nor a direct reproducion of official presentations of the authority number.  – The preceding unsigned comment was added by Gymel (talk • contribs). 12:14, 11 March 2013 (UTC)

    Gymel, thanks for this great summary. --Jarekt (talk) 13:47, 11 March 2013 (UTC)

    None of the above? :)

    From a data modeling perspective, an LCCN is a complex data type, not an atomic one. As detailed above, it consists of 3 elements. As such, it should be stored as a complex data type--serializing it into a string is basically a hack and it hides semantics. And isn't semantics what we're all about here? :) I realize Wikidata presently doesn't support complex data types, and as far as I've seen from the data model, data types won't be extensible by users, only by developers. On the one hand, that's a shame, but on the other, perhaps what we should use as user-created data types are Wikidata items themselves? So, basically, I am suggesting that LCCNs be stored as items with their components specified by properties, at least for the time being. (This seems to be an issue common to all complex data types, there should probably be a central discussion about it somewhere.) Silver hr (talk) 23:00, 11 March 2013 (UTC)

    Well, LCCNs here are used as identifiers and in semantics as in Semantic Web there is a strong opinion for them to be opaque. The official form differs from the form commonly used in URLs and URIs and only in order to convert one form into the other one has to know the internal structure. The official form has so many issues with respect to whitespace normalization (you can't even reliably copy&paste it!) that make it compeletely impractical to use. The convenience form for usage in links or URIs is usually not presented to users, therefore it has to be extracted from URLs or can be deduced from the form presented on the web site at hand by applying the proper algorithm. Depending on what form you are presented the most naive conversion ("simply omit all punctuation and blanks") may get an illegal number and the variant "LCCN encoding with structural markup" known from the various authority control templates can be considered as kind of captcha and transports the meaning: "This entry was made by a LCCN-internal-structure-aware human or process and therefore has a slightly higher probability to be syntactically correct". -- Gymel (talk) 08:03, 12 March 2013 (UTC)
    I feel like for simplicity sake we should use the simplest usable form, unless there is some reason to believe that we might have a need for atomic pieces. If string form is a hack, it is Library of Congress's hack since they developed it and are using it as their internal identifier. --Jarekt (talk) 13:50, 12 March 2013 (UTC)
    If the purpose of Wikidata is to be a backing data store for Wikipedia, then it makes sense to treat LCCNs as opaque identifiers and simply store them as strings, just as with any other identifier. But from what I gather, the long-term goal of Wikidata is to be a global semantic data repository in its own right. With that in mind, LCCNs and other complex data types have to be acknowledged as such. For better or worse, an LCCN is by design not a mere serial number, it has internal semantics, which to some future user of Wikidata might be relevant, even if it is not relevant for the purpose of storing data from Wikipedia infoboxes.
    As a side note, from what I gather, the closest thing to an official LCCN format is the normalized/canonical form. What you refer to as official form seen here seems to me to be the format for a MARC record, of which an LCCN is only one part.
    Silver hr (talk) 00:16, 13 March 2013 (UTC)
    As of yesterday, templates can now call functions written in Lua. It would be quite simple to write Lua parser to split normalized LCCN into smaller components. --Jarekt (talk) 17:56, 14 March 2013 (UTC)

    Alternative

    Why not make that two or three or four separate properties ("LCCN with slashes", "LCCN normalized" etc.), and let Bots do the adding of the "other" forms once one is entered? (As long as Wikidata doesn't have proper handling of more complex formats). --89.244.173.70 06:53, 13 March 2013 (UTC)

    I war thinking about it, but each time you store the same exact data in multiple forms than you have to define what to do with cases there there are incompatible entries. Than you need the whole infrastructure of synchronizing them and detecting conflicts. The best approach is to keep minimal amount of data and use algorithms to extract other forms, and you can derive "LCCN with slashes" from "LCCN normalized" and vice-verse. --Jarekt (talk) 11:52, 13 March 2013 (UTC)
    I agree with Jarekt, that's a bad idea from a data modeling perspective. Any time you have a single datum recorded in multiple places you're setting yourself up for trouble, such as the potential for one copy to change and thus be inconsistent with the others. Plus, it's more maintenance work, even if it's done by bots, and there isn't anything to be gained really. (Further reading: w:Data redundancy.) Silver hr (talk) 14:58, 14 March 2013 (UTC)

    format template at top in plain language not regex jargon

    Can someone please do something with the formatting template at the lead. It is close to utter nonsense with its jargon. Only those familiar with regex could interpret it and our instructions should clearly not be aimed at regex-aware contributors. All that is going to do is to scare off people.  — billinghurst sDrewth 06:04, 17 May 2013 (UTC)

    It's primarily aimed at the bot. You could add a plain text description at "allowed values" in property documentation template. --  Docu  at 06:48, 17 May 2013 (UTC)
    a description is in http://www.loc.gov/marc/lccn-namespace.html. pasted it here and changed the format constraint pattern, it failed to match 18020208 in War and Peace (Q161531) --Akkakk 17:34, 28 June 2013 (UTC)