Difference between revisions of "Property talk:P244"
(used in it:Template:Controllo di autorità)
|Line 23:||Line 23:|
*[[:it:Template:Controllo di autorità]]
Revision as of 03:40, 28 August 2013
- 1 Documentation
- 2 Format Summary
- 3 How does the LCCN work?
- 4 Some more pointers
- 5 None of the above? :)
- 6 Alternative
- 7 format template at top in plain language not regex jargon
The format of the LCCN is not decided yet. LNNC was used since 1800's and there are several competing standards for storing and displaying it. In the few wikidata pages that use this property there are two distinctive formats used.
Possible formats are:
- "Normalized LCCN" format used by Library of Congress and MARC
- This format is described at www.loc.gov. In case of Julius Caesar it would be "n79021400". Used by
- Library of Congress URL http://id.loc.gov/authorities/names/n79021400.html URL
- MARC/XML standard http://id.loc.gov/authorities/names/n79021400.marcxml.xml see "<marcxml:controlfield tag="001">n79021400</marcxml:controlfield>"
- Displayed by wikipedia Authority control templates, like at de:Gaius Iulius Caesar
- WorldCat http://worldcat.org/identities/lccn-n79022935 URL, One of the formats accepted by worldcat.org
- OCLC http://errol.oclc.org/laf/n79022935.html URL
- "Normalized LCCN" format with space(s) separating leading letter and the number, used by MADS and VIAF
- In case of Julius Caesar it would be "n 79021400". Used by:
- MADS/XML standard http://id.loc.gov/authorities/names/n79021400.madsxml.xml see "<mads:identifier type="lccn">n 79021400 </mads:identifier>"
- VIAF http://viaf.org/viaf/sourceID/LC%7Cn+79021400 VIAF website can be accessed with LCCN (see URL). VIAF pages display LCCN as "n 79021400"
- Format with "/" separating 3 parts of the number, used by Wikipadia Authority control templates in few dozen languages
- In case of Julius Caesar it would be "n/79/21400" (notice no leading "0" 3_rd segment. Used by:
- Authority control templates, like at de:Gaius Iulius Caesar. There are probably over 300k identifiers. The format was first introduced by de:Vorlage:Normdaten in 2009 and allows grater flexibility at creating identifiers in different formats. Symbol "/" is used because it is the only string separator which can be easily used in Wikipedia template "language", using titleparts parser function. At this point it is unclear if we need that flexibility, and if we do than Lua freed us from constraints of template "language".
- WorldCat http://www.worldcat.org/identities/lccn-n79-21400 URL. One of the formats accepted by worldcat.org requires non-normalized format easily constructed from "n/79/21400" format.
Votes for format #1 "n79021400"
Support Format #3 would be the easiest for us to keep, but I think that in the long term we should abandon that home-brewed format which is not used or recognized by anybody else. Technical limitations of the past made format #3 the only option in 2009, but I think we can safetly abandon it now. My only concern is how to avoid confusing users which will be filling that property: both bot-drivers and individual users. --Jarekt (talk) 15:16, 11 March 2013 (UTC)
Support a) #2 sucks; b) the slashes of #3 conflict with some elements of the printed card form; c) providing a pattern for syntax checking on entry should be feasible (some time); d) URL usage seems to converge to this form and construction of differing links should be still possible; e) most displays use a contiguous block of digits (8 or 10 depending on type 'A' or 'B') and therefore the problem of inserting the right amount of zeroes at the right position within the number shouldn't pose itself too often. -- Gymel (talk) 14:10, 12 March 2013 (UTC)
Votes for format #2 "n 79021400"
Votes for format #3 "n/79/21400"
How does the LCCN work?
copied from: Wikidata:Project chat#Proposal: Allow string (and may be other future types) properties to be displayed with formatting defined by a template. --Kolja21 (talk) 21:57, 10 March 2013 (UTC)
- According to www.loc.gov "LCCNs have three components: prefix, year, and serial number. The prefix is optional; if present, it has one to three lowercase alphabetic characters. (Prefixes are maintained in a controlled list.) The year is two or four digits. (For 2000 and earlier the year is two digits, for 2001 and later, four digits.) The serial number (after normalization) is six digits. A normalized LCCN is a character string eight to twelve characters in length." Than the site lists the algorithmic rules on how to normalize various forms of LCCNs. Current library of congress site uses normalized LCCN on their website, see en:Template:Authority control/LCCN. However WORLDCAT database which is also using LCCN is using a different form, see en:Template:Authority control/WORLDCAT-LCCN. That is why last time it was discussed community decided on keeping LCCN number as a triplet of codes (separated by "/" ) normalize it on the fly to access www.loc.gov and assemble it in other combinations to access other sites that use LCCN as a key. That does not mean that we need to decide the same. --Jarekt (talk) 16:19, 9 March 2013 (UTC)
For the record, the conversation above has been selectively copied here. I have argued that slashes should not be stored in this property. Ironically the selective copying does not even support Kolja21's desire to have the slashes included. (After all, the above paragraph concludes, "That does not mean that we need to decide the same.")
Kolja21, in response to your talk page message saying that you reverted my example at WD:P, here are some URLs to sites of importance that "work" without slashes, hyphens, or anything. That is, they work with the identifier given by LCCN (for Van Gogh in this example, n79022935) with no extra "punctuation": http://worldcat.org/identities/lccn-n79022935, http://errol.oclc.org/laf/n79022935.html, http://id.loc.gov/authorities/names/n79022935.html. Moreover, does this URL with slashes "work"?: http://id.loc.gov/authorities/names/n/79/22935.html. No.
Here is a document titled Structure of the LC Control Number. There is absolutely no reference to slashes. Here is a document on the "normalization" of LCCN identifiers , which doesn't even include non-normalized examples with slashes (except in a different context, where they are also removed).
Here is an excerpt from the English Wikipedia article on LCCN: "The hyphen that is often seen separating the year and serial number is optional. More recently, the Library of Congress has instructed publishers not to include a hyphen." One of the slashes that you are supporting is equivalent to the hyphen that is mentioned in that excerpt. You will note that even the hyphen is a) "optional" and b) now discouraged.
I will not argue this triviality further, but if I'm wrong, I need something more than arguments from appeal to how Wikipedia stores this string for its own use, to convince me. It is not advisable to re-format canonical identifiers to suit one particular use (Wikipedia templates). Thank you. Espeso (talk) 22:41, 10 March 2013 (UTC)
- For the records: Your interpretation ("Kolja21's desire to have the slashes included") is wrong. I'm happy with both formats, but I want to make sure that editors know how to add the LCCN. If Wikipedia and Wikidata are using the LCCN in a different way, that is an important issue. If you are making errors (changing LCCN n/79/22935 to n7922935), of cause other might do the same. Since there was no consensus in the discussion mentioned above we can make a polling here. --Kolja21 (talk) 00:32, 11 March 2013 (UTC)
Some more pointers
The document The LCCN Namespace from 2003 cited above IMHO has to be seen in the context of the lccn.info URI scheme (cf. http://info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:lccn/ ) which - like all .info schemes - nowadays should be considered as only of historic interest.
Part of the official MARC21 documentation are the following documents dealing extensively with the form of the LCCN ("CN" used to stand for "Call Number" but is now "Control Number"): MARC Bibliographic and MARC Authority. The document Structure of the LC Control Number (last revision 2006) is linked from the MARC Documentation page an appears to combine these.
This means that the official LCCNs still are exclusively of the form ('#' depicts blanks) "n##79051955#", i.e. in comparision to the "printed card form" "n79-51955 the different components are blank-padded to the right (alphabetic prefix, supplement number) and zero-padded to the left (serial number) and one has to know the distinction between the types 'A' (years up to and including 2000) and 'B' (years from and including 2001) to get it right (result is a string of always 12 characters). Clearly the double spaces and trailing spaces of the official form of LCCNs will not only be a problem for human editors but also for generic software processing the numbers.
I checked the XML schema and crosswalks for MARCXML, MADS, and MODS and did not find any constraints, just the instruction "copy the number". Whalt Whitman at id.loc.gov demonstrates this:
<mads:identifier type="lccn">n 79081476 </mads:identifier>
Fortunately enough the "sane" form of the LCCN as expressed in the Namespace document and utilized on VIAF (for linking: display delivers a zero-padded form with one intermediate blank), id.loc.gov (avoids presenting the number), lccn.loc.gov (shows the official form "n##79081476#" with three blanks for the Whitman example), worldcat.org(?) seems to become ever more widely accepted for linking and URL construction. At this point of time however it does not seem to be an official form. This may change suddenly as soon as the library of congress redeclares id.loc.gov or lccn.loc.gov-URLs from "convenience URLs for services" to official URIs for authority records (or better even: introduces the documented format as official "web-friendly alternative") but this has yet to happen.
For the time being I consider the form "n79081476" not different from the form with parsing hints "n/79/81476" known from the various templates: Both are convenient and friendly to processing, and both are neither official nor a direct reproducion of official presentations of the authority number. – The preceding unsigned comment was added by Gymel (talk • contribs). 12:14, 11 March 2013 (UTC)
None of the above? :)
From a data modeling perspective, an LCCN is a complex data type, not an atomic one. As detailed above, it consists of 3 elements. As such, it should be stored as a complex data type--serializing it into a string is basically a hack and it hides semantics. And isn't semantics what we're all about here? :) I realize Wikidata presently doesn't support complex data types, and as far as I've seen from the data model, data types won't be extensible by users, only by developers. On the one hand, that's a shame, but on the other, perhaps what we should use as user-created data types are Wikidata items themselves? So, basically, I am suggesting that LCCNs be stored as items with their components specified by properties, at least for the time being. (This seems to be an issue common to all complex data types, there should probably be a central discussion about it somewhere.) Silver hr (talk) 23:00, 11 March 2013 (UTC)
- Well, LCCNs here are used as identifiers and in semantics as in Semantic Web there is a strong opinion for them to be opaque. The official form differs from the form commonly used in URLs and URIs and only in order to convert one form into the other one has to know the internal structure. The official form has so many issues with respect to whitespace normalization (you can't even reliably copy&paste it!) that make it compeletely impractical to use. The convenience form for usage in links or URIs is usually not presented to users, therefore it has to be extracted from URLs or can be deduced from the form presented on the web site at hand by applying the proper algorithm. Depending on what form you are presented the most naive conversion ("simply omit all punctuation and blanks") may get an illegal number and the variant "LCCN encoding with structural markup" known from the various authority control templates can be considered as kind of captcha and transports the meaning: "This entry was made by a LCCN-internal-structure-aware human or process and therefore has a slightly higher probability to be syntactically correct". -- Gymel (talk) 08:03, 12 March 2013 (UTC)
- I feel like for simplicity sake we should use the simplest usable form, unless there is some reason to believe that we might have a need for atomic pieces. If string form is a hack, it is Library of Congress's hack since they developed it and are using it as their internal identifier. --Jarekt (talk) 13:50, 12 March 2013 (UTC)
- If the purpose of Wikidata is to be a backing data store for Wikipedia, then it makes sense to treat LCCNs as opaque identifiers and simply store them as strings, just as with any other identifier. But from what I gather, the long-term goal of Wikidata is to be a global semantic data repository in its own right. With that in mind, LCCNs and other complex data types have to be acknowledged as such. For better or worse, an LCCN is by design not a mere serial number, it has internal semantics, which to some future user of Wikidata might be relevant, even if it is not relevant for the purpose of storing data from Wikipedia infoboxes.
- As a side note, from what I gather, the closest thing to an official LCCN format is the normalized/canonical form. What you refer to as official form seen here seems to me to be the format for a MARC record, of which an LCCN is only one part.
- Silver hr (talk) 00:16, 13 March 2013 (UTC)
Why not make that two or three or four separate properties ("LCCN with slashes", "LCCN normalized" etc.), and let Bots do the adding of the "other" forms once one is entered? (As long as Wikidata doesn't have proper handling of more complex formats). --18.104.22.168 06:53, 13 March 2013 (UTC)
- I war thinking about it, but each time you store the same exact data in multiple forms than you have to define what to do with cases there there are incompatible entries. Than you need the whole infrastructure of synchronizing them and detecting conflicts. The best approach is to keep minimal amount of data and use algorithms to extract other forms, and you can derive "LCCN with slashes" from "LCCN normalized" and vice-verse. --Jarekt (talk) 11:52, 13 March 2013 (UTC)
- I agree with Jarekt, that's a bad idea from a data modeling perspective. Any time you have a single datum recorded in multiple places you're setting yourself up for trouble, such as the potential for one copy to change and thus be inconsistent with the others. Plus, it's more maintenance work, even if it's done by bots, and there isn't anything to be gained really. (Further reading: w:Data redundancy.) Silver hr (talk) 14:58, 14 March 2013 (UTC)
format template at top in plain language not regex jargon
Can someone please do something with the formatting template at the lead. It is close to utter nonsense with its jargon. Only those familiar with regex could interpret it and our instructions should clearly not be aimed at regex-aware contributors. All that is going to do is to scare off people. — billinghurst sDrewth 06:04, 17 May 2013 (UTC)
- It's primarily aimed at the bot. You could add a plain text description at "allowed values" in property documentation template. -- at 06:48, 17 May 2013 (UTC)
- a description is in http://www.loc.gov/marc/lccn-namespace.html. pasted it here and changed the format constraint pattern, it failed to match 18020208 in --Akkakk 17:34, 28 June 2013 (UTC)