Wikidata:Identifier migration

From Wikidata
(Redirected from User:Addshore/Identifiers)
Jump to navigation Jump to search

Identifier property lists:

Properties in range P1 - P999: /0
Properties in range P1000-P1999: /1
Properties in range P2000-P2999: /2

String datatype properties not included:

/Strings

Characteristics of external identifiers[edit]

  • will remain strings (standard)
Pictogram voting comment.svg Comment The basic idea is that even if properties get converted, in the short term, this may not change anything. --- Jura 04:01, 4 February 2016 (UTC)
  • may have a formatter URL (e.g. VIAF has, WMF language code doesn't)
  • may have several formatter URLs based on criteria (e.g. VIAF has a single one, IMDB has several different ones)
  • may be unique or may not (e.g. VIAF are, car license plates aren't)
Symbol oppose vote.svg Oppose - MUST be "unique value ... assigned to a single thing" ArthurPSmith (talk) 18:30, 3 February 2016 (UTC)
Pictogram voting comment.svg Comment An other non-unique one is: MNC. I don't see why we would exclude that. --- Jura 04:01, 4 February 2016 (UTC)
Symbol oppose vote.svg Oppose I agree that it must be unique, otherwise it isn't an identifier --Srittau (talk) 10:14, 4 February 2016 (UTC)
Symbol support vote.svg Support Many list of identifiers have errors and screw ups, ISBN e.g. is not unqiue, identifiers may be misreported, etc. Whereas it would be great if all identifiers were unique, reality is too messy. I'd rather leave that to bots and clean up tasks to handle this responsibly than hardcode it in the software and make certain usages impossible. --Denny (talk) 17:15, 5 February 2016 (UTC)
@Denny: I don't think the concern is with misreporting or errors (there will always be a tiny fraction of issues like that), the question is whether the identifier is in principle (designed to be) unique. What do you mean in the case of ISBN? Surely if I have an ISBN, I can look up the associated book, the same ISBN wouldn't point to several different books? ArthurPSmith (talk) 20:06, 5 February 2016 (UTC)
@ArthurPSmith: Actually, the same ISBN can point to different books, see this example discussion. The question is not whether the identifier is in principle designed to be unique. Identifiers are always designed to be in principle unique. The issue is that in reality, many identifier scheme fail to meet this standard, because humanity. And whereas I agree that we should be checking these particularly careful, I would be highly opposed to design our system to be so unflexible that it cannot deal with these realities. --Denny (talk) 18:19, 11 February 2016 (UTC)
@Denny: We're trying to decide what an identifier is, so the question here is whether identifiers are in principle designed to be unique. (Maybe it seems like an obvious statement, but there are things which have been added to the list which are not designed to be unique) - Nikki (talk) 18:58, 11 February 2016 (UTC)
@Nikki: If we are merely trying to define what an identifier is in theory, that is fine, then I agree with you. But if we are here specifying a software feature and how it is supposed to behave - e.g. absolutely not allow two items with the same identifier and thus make it impossible for us to state that these two books have the same ISBN - then I would argue for caring about how many identifier schemes turn ou in reality. --Denny (talk) 18:54, 12 February 2016 (UTC)
@Denny, Nikki: no, we are NOT specifying a software feature, we are trying to figure out what should be collected together under the heading "external identifiers". I think the tables I have put together should help - it's the stuff that's frequently not unique that's the problem. Most identifiers are 99% to 100% unique values. Not so for many other properties being considered. That's the concern. ArthurPSmith (talk) 19:05, 12 February 2016 (UTC)
@Denny: (edit conflict) I support the phrasing: "are unique or designed to be unique" or "are unique or generally treated as unique" -- as in the ISBN number example (or in my music research world, RISM identifiers are designed to identify every music library or every music repository uniquely in the world, but there have been errors where the same identifier was accidentally assigned to two repositories). A program can assume that all ISBNs will be unique and function 99.9% of the time (but not 100%). I think that this case differs from something like "birth day" where any program that assumes each person has a unique one will fail in almost all cases. Requiring absolute uniqueness in every conceivable circumstance will rule out too many good identifiers. Mscuthbert (talk) 19:11, 12 February 2016 (UTC)
@ArthurPSmith, Mscuthbert: OK, if we are not defining how external identifiers should be implemented, then I am fine. If this are criteria for a human-curated list of what is an identifier and what is not, and NOT for how identifiers should be implemented in Wikidata software, then I agree that the aim of identifiers is, per definition, to be unique. --Denny (talk) 19:13, 12 February 2016 (UTC)
@Denny: -- agreed, definitions are different from software implementation. Mscuthbert (talk) 19:37, 12 February 2016 (UTC)
Pictogram voting comment.svg Comment The question is not whether identifiers are unique. The question is whether the relation between Wikidata items and things identified by one type of identifier is 1-to-1, 1-to-many, or even many-to-many. -- 14:41, 14 February 2016 (UTC)
  • Symbol oppose vote.svg Oppose -- argued above, phrase as "are unique or generally treated as unique" Mscuthbert (talk) 19:37, 12 February 2016 (UTC)
Symbol support vote.svg Support You cannot but support this at this moment. Wikidata is not an ontology, and if you do not specify in very high accuracy what something is (as in ontology-level ultra precision) then nothing much will be an identifier. The point is that requiring uniqueness implies that you define an 1-to-1 match between the concept of a type in Wikidata with that of the remote database. This is just factually hardly the case in Wikidata. It will practically rule out most of the identifiers used in the life sciences. I support this definition as insisting on uniqueness will make you resort in ontological debates. Egon Willighagen (talk) 14:23, 5 March 2016 (UTC)
  • may have a single database or not (e.g. VIAF does, ISBN doesn't)
  • may have a single issuer or not (e.g. VIAF has a single one, MNC/ISBN several)
Symbol oppose vote.svg Oppose - MUST be a single issuer ArthurPSmith (talk) 18:30, 3 February 2016 (UTC)
ok, changed my mind on this - but it should still have a single top-level authority of some sort that may delegate the assignment ArthurPSmith (talk) 15:56, 4 February 2016 (UTC)
Pictogram voting comment.svg Comment why would we exclude ISBN or MNC? --- Jura 04:01, 4 February 2016 (UTC)
I don't want to exclude ISBN, I think that one does need to be in the list. If by MNC you mean mobile network code (P2259) then I definitely don't agree it should be on the list of identifiers. What is it identifying? If you look at it's use, for example for AT&T Mobility (Q298594), there are multiple codes assigned to one entity, many times with the same qualifier. It doesn't seem like an identifier to me at all. Maybe that item is not coded the way you think it should be? But if that's the way it's used I would NOT consider that property to be an identifier. ArthurPSmith (talk) 15:56, 4 February 2016 (UTC)
Pictogram voting comment.svg Comment should be something like "must be under the control of a single entity". That is badly worded, but identifiers should be managed by a central organization, which can delegate parts of its responsibility to sub-entities. (E.g. ISBN or also domain names.) --Srittau (talk) 10:14, 4 February 2016 (UTC)
  • may have an explicit naming convention or a conventional one (e.g. VIAF are \d, car license plates may not have one)
Symbol oppose vote.svg Oppose - MUST follow a consistent pattern and have a single canonical representation ArthurPSmith (talk) 18:30, 3 February 2016 (UTC)
What about if there are multiple canonical representations but they can be identified as distinct and converted to each other. Such as phone numbers 00 1 (412) 447-1122 and +1-412-44-711-22. Mscuthbert (talk) 19:44, 12 February 2016 (UTC)
I would argue phone number as such shouldn't be considered an identifier for this and other reasons (a phone number - at least in the past - was often reused for a different person or organization within months due to account changes). But in general to be useful for correlating things (for example checking whether two entities are really the same thing as they have the same identifier) it's important there be a single string representation. So to be a good identifier the property shouldn't be just generic phone number, but phone number written in a specific standard format (if we allowed phone number at all). ArthurPSmith (talk) 20:45, 12 February 2016 (UTC)
I don't think I would consider a phone number an identifier because it serves a different purpose. An identifier is typically a way of identifying something in another dataset. While at any given time, an item's phone number will be fairly unique because of the way phones work, the purpose of the phone number isn't to identify the item in another dataset, it's for contacting the item. - Nikki (talk) 14:06, 13 February 2016 (UTC)
Pictogram voting comment.svg Comment ISNI has two. We use one. --- Jura 04:26, 4 February 2016 (UTC)
@Jura1: ISNI only has one canonical string representation - as 4 groups of digits (or X) separated by spaces. ArthurPSmith (talk) 15:48, 4 February 2016 (UTC)
Pictogram voting question.svg Question By "naming convention" do you mean "way to write the identifier"? (E.g. with dashes or spaces, groups of four etc.) --Srittau (talk) 10:14, 4 February 2016 (UTC)


  • may have a single naming convention or not (e.g. VIAF has a single one, car license plates several)
Symbol oppose vote.svg Oppose - MUST have a consistent pattern and single canonical representation as a string ArthurPSmith (talk) 18:30, 3 February 2016 (UTC)
Pictogram voting comment.svg Comment "consistent pattern" for botanist author abbreviation (P428) seems to boil down to a unique name. Either "consistent pattern" is vague or this doesn't actually add much. --- Jura 04:01, 4 February 2016 (UTC)
  • can have current string datatype, but not monolingual string or URL


Additional proposed criteria[edit]

  • property should not be primarily used as a qualifier (it should a property directly on the item identified)
    • added by ArthurPSmith (talk) 18:36, 3 February 2016 (UTC)
    • Pictogram voting comment.svg Comment This would exclude things such as "article ID" (Property:P2322). What might work as an exclusion criterion is that it shouldn't be any identifier appended to some other identifier. --- Jura 04:01, 4 February 2016 (UTC)
      • @Jura1: I'm not sure on how 2322 is used, but I would exclude page(s) (P304) also. I view article ID (P2322) and page(s) (P304) as essentially identical kinds of properties - ways to locate a portion of a book or journal issue. Despite the use of the word "identifier" in P2322, I don't believe it satisfies the criteria we are looking for here. DOI (P356) on the other hand is definitely suitable as an identifier type. ArthurPSmith (talk) 15:48, 4 February 2016 (UTC)
    • Pictogram voting question.svg Question Can you elaborate or give example? I don't understand this criterion. --Srittau (talk) 10:14, 4 February 2016 (UTC)
  • It shouldn't be an identifier appended to some other identifier (see previous discussion).
    --- Jura 10:45, 7 February 2016 (UTC)
  • It shouldn't be merely a codified way of representing characteristics (sample: SMILES, TeX)
    • added by Jura 4 February 2016 (UTC)
  • It should be opaque, with no necessary meaning outside of the identification scheme. For example coordinate location (P625) would not qualify as an identifier (even if it were of string datatype) because it has the precise meaning of locating something on Earth's surface, even though it could also be used to look things up in a database. I would argue located at street address (P969) and related properties have the same problem. If part of the purpose of identifiers is to hide them away in the main wikidata display for an item, then properties that provide additional relevant meaning aside from being useful for lookups should not be among those hidden.
    • added by ArthurPSmith (talk) 14:45, 5 February 2016 (UTC)
      • So finally, you are opposing the "alternative proposal", is this correct? For clarity it would help if you would state below.
        --- Jura 01:19, 6 February 2016 (UTC)
        • No, the purpose of these "additional criteria" is to put down suggestions for what it occurs to me are important characteristics of an identifier. If there is a consensus that these are indeed helpful and important then we should add them to the proposal below. If there is no consensus or they are redundant or unhelpful, then I don't have a problem with leaving them out of the definition. But the definition does I believe need some slight refinement. If you have a specific proposal on that feel free to make it. ArthurPSmith (talk) 14:35, 8 February 2016 (UTC)
  • the identifier string should be persistent - the same string should identify the same item essentially forever; if a particular identifier becomes obsolete (due to a merger, end of life of item, etc) it should not be re-used subsequently for a different item.
    • added by ArthurPSmith (talk) 16:24, 5 February 2016 (UTC)
    • Pictogram voting comment.svg Comment I don't believe there is a property for domain name (Q32635), but if there was, it would not qualify under this rule: there is no guarantee of non-reuse of the same name by a different entity. ArthurPSmith (talk) 16:29, 5 February 2016 (UTC)

Alternative Proposal[edit]

An identifier is part of a consistent system to uniquely identify entities that are issued under the control of a single authority.

This definition is short and sweet and does not delve into technicalities, such as having formatter URLs or specific representations, which in my opinion just confuses the question of "what is an identifier".

Let's have a look at some examples:

  • Database IDs in general ✓ OK, as long as they are used as external identifiers and are guaranteed to stay consistent.
  • Car number plates Nuvola apps error.png Not OK -> not unique (possibly the same for multiple countries), not issued under the control of a single authority (every country has its own authority or authorities).
  • ISBN ✓ OK -> under control by a single authority that delegates responsibility for groups of numbers
    • Pictogram voting comment.svg Comment - Multiple books / works can have the same ISBN http://www.librarything.com/topic/66549 ·addshore· talk to me! 09:30, 9 February 2016 (UTC)
      • @Addshore: - but according to this FAQ from a publisher "once a title is published with an ISBN on it, the ISBN can never be used again. Even if a title goes out of print, the ISBN cannot be reused since the title continues to be catalogued by libraries and traded by used booksellers. " That there are exceptions to this rule among thousands of publishers and millions of books I have no doubt, but I would guess they are very rare. Or is there other evidence on this question? ArthurPSmith (talk) 17:10, 9 February 2016 (UTC)
  • EAN ✓ OK -> same
  • ISNI ✓ OK
  • IATA/ICAO codes ✓ OK
  • inventory number (P217) Nuvola apps error.png Not OK -> not part of a unified system, multiple authorities
  • postal code (P281) Nuvola apps error.png Not OK -> not a unified system, multiple authorities, does not identify anything except its own area
  • ISO country codes ✓ OK
  • website username (P554) Nuvola apps error.png Not OK -> no single authority
  • Paris city digital code (P630) ✓ OK

--Srittau (talk) 10:39, 4 February 2016 (UTC)

Symbol support vote.svg Support This proposal makes sense to me. I agree on the "single issuer" concern - many things that are clearly identifiers are issued via delegation rather than centrally from a single issuer but I think it's important there be a top-level single authority in some form. ArthurPSmith (talk) 15:35, 4 February 2016 (UTC)
Pictogram voting comment.svg Comment I think it would help it if you would attempt to sort based on either these criteria or the other ones. If you just sort to "good to convert" based on some other criteria or some assumption, it gets even more confusing.
--- Jura 21:54, 4 February 2016 (UTC)
@Jura1: I went through all the properties with id less than 400 and tried to apply the above criteria; if I made a mistake feel free to add to the "disputed" area. If you don't like what I've done at all then SOMEBODY needs to start going through these - it's quite a bit of work to look through each property definition, through usage examples, etc. Srittau's examples have been a very helpful guide (eg. "database id's" in general should be ok).We may want to refine the criteria a bit to allow small fractions of exceptions in practice (for example if a property is unique or single-valued in 90+% of cases is that ok, even though there are exceptions? ArthurPSmith (talk) 14:27, 5 February 2016 (UTC)
Maybe you should propose to amend the proposal: "it's an identifier if in 90% of the cases it's unique".
--- Jura 01:19, 6 February 2016 (UTC)
The intention of the issuer should count. If a property is designed to be unique and single valued but there is a small fraction of exceptions in the database it should still count as identifier. --Pasleim (talk) 15:00, 8 February 2016 (UTC)
Pictogram voting comment.svg Comment Somehow this seems to tie the definition of an identifier to the creation of Wikidata properties. An identifier is only an identifier if it's used in a identifier specific property? This doesn't actually describe identifiers.
--- Jura 01:19, 6 February 2016 (UTC)


Pictogram voting question.svg Question what is the advantage of using this definition compare to the previous one?
--- Jura 01:19, 6 February 2016 (UTC)
because the previous definition allows anything that is string datatype to be listed as an identifier. That is not helpful in distinguishing between strings and identifiers. ArthurPSmith (talk) 14:33, 8 February 2016 (UTC)
Well characteristics are not criteria, but what would be the problems with if a car license plate would be considered an identifier instead of a string?
--- Jura 18:08, 8 February 2016 (UTC)
A car license plate does not uniquely identify a particular vehicle. Aside from the multitude of issuing authorities, the same license plate can be re-used on another vehicle; number sequences may also be reused over time, there is no guarantee or expectation that it can be used as a reliable identifier. It can be used to look up a car in a database, but so can characteristics like make, model, color, and street address where it usually parks. That does not make those things identifiers. On the other hand, vehicle identification number (Q304948) seems to me to be clearly an identifier that IS suitable for vehicles; if we had a property for that I would be happy to add it to the list of external identifiers. Can you see the difference between vehicle identification number (Q304948) and "car license plate"? ArthurPSmith (talk) 19:05, 8 February 2016 (UTC)
In the proposal you supported it was excluded merely because " not unique (possibly the same for multiple countries), not issued under the control of a single authority (every country has its own authority or authorities).". Now what would be the problem if we convert such properties from "string"-string to "external identifier"-string. Are there any technical issues with this? What would be the impact of multiple authorities, if any. What of the non-uniqueness? I can't see any.
--- Jura 19:23, 8 February 2016 (UTC)
I don't know what you even mean by a technical issue - why would there be a technical issue converting anything in /Strings to an external identifier? The issue is whether or not it's really an external identifier. The main concern with external identifiers as far as I can tell is that they will be separated off from the main body of statements on an item into a special external identifiers list, and will be linked where possible via formatter url. Presumably only things that really represent external identifiers should be there. Do you see any technical issue with making for example quantity symbol (P416) into an "external identifier"-string? And yet I don't believe it should be as it's really not what I would call an external identifier. The issues are not technical, they are conceptual. ArthurPSmith (talk) 20:29, 8 February 2016 (UTC)
I think there may be a mistaken assumption. Identifiers that haven't been linked now wont be linked on conversion. This is stated above and was discussed before. Any properties on /Strings had been sorted out last year already. If it's merely a conceptual issue for you, you agree that identifiers remain identifiers even if they may have been issued by different authorities and/or are not unique? An inventory number (P217) or website username compared to P2511 or P2013?
--- Jura 05:52, 9 February 2016 (UTC)
Jura, this whole discussion is about what is the meaning of an identifier (or more specifically an "external identifier" as that is what the datatype is being called if you for example try it on test.wikidata.org). Many people have asserted they should be unique (and perhaps have certain other characteristics) as evidenced by the discussion here and previous discussions listed below. You seem to think differently. We need to reach a consensus before we can proceed. I have asked Lydia for a delay in the release of the identifier update until this can be resolved. I don't think the two of us can resolve it between us, particularly when you are wholesale reverting my changes. ArthurPSmith (talk) 15:03, 9 February 2016 (UTC)
Apparently one of the initial Wikidata developers tends to agree with this and this proposal doesn't seem to draw a lot of support here either. Obviously, if there were advantages, it might be worth adding these requirements. Oddly, you can't even state what the advantages would be. It seems that you just started out on erroneous assumptions that a current single formatter url is required. As we already sorted these properties last year, stating some possibly not helpful criteria, resorting properties based on these if they are matched in 90%+ cases, isn't going to advance us much. We might just as well go ahead with the initial list.
--- Jura 03:23, 10 February 2016 (UTC)
Jura, I have no idea what you are referring to, you have not provided a link to "one of the initial Wikidata developers", nor have you provided any link to the discussion involved in whoever "sorted these properties last year" - who did it and what criteria did they use? Nor have you answered a huge number of questions that have been asked of you here and on the linked pages (for example, Pasleim specifically asked you to explain your criteria for disputing the dozens I had reviewed as "good" here and you did not respond). As far as your claim that I "can't even state what the advantages would be" I have stated one: the previous criteria provided absolutely no guidelines on whether something of string datatype was an identifier or not. You have provided no clear guidelines yourself. You provided no answer to my question about why quantity symbol (P416) could not be an external identifier other than that it was decided "last year" - on what grounds? Where was the discussion? But let me state for the record the OTHER advantages I see for clear criteria (such as the above) for "external identifiers":
  1. External identifiers are to be grouped in a separate box, somewhat hidden from the main list of statements on an item; users would expect everything in that box to indeed be of a similar type, not vastly different sorts of things (like UUIDs vs street addresses)
  2. The name "external identifier" indicates that the string involved should "identify" something - there should be a way to look up what that thing is just from the string (and property it came from) with no other information. Having a formatter URL that provides a link to a page about the item clearly does this job, but that piece isn't essential - but there needs to be some way to go from the string to the item unambiguously (with rare exceptional cases).
  3. The name "external identifier" also indicates that there is some "external" and common authority involved that is providing this identifier for public use, not for its own internal purposes.

So on these grounds the above criteria clearly fit the name of this datatype and the main purpose for which it seems it will be used. That is a significant advantage over Jura's criteria which appear to consist of "any property with datatype string that was not excluded last year". ArthurPSmith (talk) 14:55, 10 February 2016 (UTC)

You are still confusing the characteristics enumerated above with exclusion criteria (which you don't seem to care to read).
Besides, I would assume that you would at least try to read the comments and notes on this page and not require people to provide you with links and diffs to comments made by myself and others.
If we attempt to discuss the sample of " not unique (possibly the same for multiple countries), not issued under the control of a single authority (every country has its own authority or authorities).", the digression to "quantity symbol" might be amusing, but doesn't really help us the license plate sample nor with other external identifier you don't want to convert based on that. These are:
  • inventory number (P217) - per ArthurPSmith: "non unique, not managed by a single authority"
  • ticker symbol (P249) - per ArthurPSmith: "non unique, not managed by a single authority (symbol depends on market)"
  • postal code (P281) - per ArthurPSmith: "not unique, not managed by a single authority (each country has its own system of postal codes)"
  • station code (P296) - per ArthurPSmith: "not unique, many different sources (every railway system can use a different code)"
  • Commons category (P373) - per ArthurPSmith: "multiple items may share the same commons category, I don't think this is suitable as an identifier"
Possibly you may want to convert them now, as they meet your new criteria (1: presentable in a separate box, 2: possibility to look it up somewhere, 3: external authority).
I suppose one could argue that Commons is not external, so we could exclude that one based on that.
--- Jura 13:18, 11 February 2016 (UTC)
Jura, this comment is clear evidence to my mind that YOU are not reading or understanding what has been discussed here. All of the properties you have just asked me about (except Commons category) plainly do not meet #2 on my "advantages" list: you can not look up the entity knowing only the string value and the property number. Where would you look up "inventory number" "INV 779"? What does that tell you? What about "station code" "820"? What railway station is that do you suppose? ArthurPSmith (talk) 14:37, 11 February 2016 (UTC)
by the way, you again have provided no references, and you introduce a term for the first time on this page - "exclusion criteria" - with no reference or explanation. What is it you want me to read? Please point to it! I've read every comment on this page and on the linked pages and I have no idea what you are talking about. ArthurPSmith (talk) 14:39, 11 February 2016 (UTC)
Looking at the first item with the first property (Q594#P217) it seems fairly obvious.
--- Jura 18:14, 11 February 2016 (UTC)
What seems fairly obvious? Jura, you are being cryptic. Q594#P217 has "inventory number" value of "DBYMU.1946/48" but additionally on the property has the qualifier collection (P195) Derby Museum and Art Gallery (Q8012). So that inventory number is in reference to a specific collection. That is additional information required for lookup - it does not meet the description I gave: " there should be a way to look up what that thing is just from the string (and property it came from) with no other information" if you need also a qualifier to look up the item. The example I gave above, "INV 779", is already in wikidata under another item. How would you look that item up externally? Where would you go to find out what it referred to? You need more information. Please try to answer questions plainly without this cryptic style. Thanks. ArthurPSmith (talk) 19:58, 11 February 2016 (UTC)
Symbol support vote.svg Support this proposal, mainly per ArthurPSmith --Pasleim (talk) 14:38, 8 February 2016 (UTC)


BTW, how about Alexa rank (P1661)? --Liuxinyu970226 (talk) 09:11, 9 February 2016 (UTC)
That doesn't qualify on several grounds - it's a quantity, not a string datatype for one. As a ranking it would identify a website at one point in time but it is continually changing and so not useful for identification over time. ArthurPSmith (talk) 14:59, 9 February 2016 (UTC)

Discussions[edit]

  • Oct 2015
  • Jan 2016
  • February 2016.
    • Note particularly definition from Nikki - " strings which are unique values created by a single issuer (following a consistent pattern) and assigned to a single thing can definitely be considered identifiers" @Pigsonthewing, Srittau, Nikki, Multichill: please comment on the above, which so far has been edited just by myself and the original writer Jura - thanks! ArthurPSmith (talk) 21:50, 3 February 2016 (UTC)
      • @ArthurPSmith: I'm not sure if you misunderstood what I wrote, but it wasn't intended as a set of criteria that identifiers must match. I only meant that the subset which do match those criteria can (in my opinion) definitely be considered identifiers. The idea is to move the properties which we can agree on out of the way so that we can look more carefully at the rest. - Nikki (talk) 09:57, 4 February 2016 (UTC)
        • @Nikki: - I guess it was a slight misunderstanding, but it seemed to me to capture the essence of what I consider to be an identifier nicely in a single sentence, so I ran with it. I think it's a better starting point than the alternative we have which seems to be "it's a string". ArthurPSmith (talk) 15:38, 4 February 2016 (UTC)
  • Technical information - this was posted to a mailing list Friday Feb 5 describing the technical implementation of the new identifier (and math) data types. Note in particular the plans to use the identifiers in semantic-web mode as URI's (defined I think by the formatter URL where possible) - that is, there is a real need for these to be unique (in principle) values. Relevant portion of the email: ExternalId data type, we would like to use resource URIs for external IDs (in 'direct claims'), if possible. This would only work if we know the base URI for the property (provided by a statement on the property definition). For properties with no base URI set, we would still use plain string literals. ArthurPSmith (talk) 21:08, 8 February 2016 (UTC)
    • From the development side uniqueness is not a requirement. Lydia Pintscher (WMDE) (talk) 10:15, 10 February 2016 (UTC)
      • Right, and that's a good thing, as we definitely will have exceptions. Nevertheless, the intention as stated there to "use resource URIs for external IDs" implies something important in semantic-web terms, i.e. the external ID in many (most) contexts should correspond to a resource URI, a node in the semantic web. The logic of linked data requires a given URI to always mean the same thing. Non-uniqueness will break stuff (result in incorrect inferences), and ideally we ought to want to break as little as possible... ArthurPSmith (talk) 15:35, 10 February 2016 (UTC)

Anyway, I have created the following pages which may help this discussion move forward:

These have the same lists of properties on the pages here, but with additional statistics on number of uses, percentage of single-valued entries, and percentage of unique values within wikidata as of this morning. Properties colored red have less than 90% unique values, and probably should not be considered as identifiers, unless there's some error in their usage here in wikidata or there's some other clear reason they really should be (depending on exactly what our criteria are!). Properties colored yellow have between 90 and 99% unique, and should be looked at carefully as to whether they qualify or not. Properties with 99% or more unique (uncolored in the tables) ought to qualify on uniqueness grounds if they've been used at least 1000 times within wikidata, but may fail on other criteria, such as a large fraction of multiple-valued entries (if that's a criterion). Properties with less than 1000 uses should probably also be looked at somewhat carefully concerning the intent of the issuing authority etc. Properties that have not been used at all are identified by a blue color; there are about a dozen of those in the list, I'm not sure why they weren't at least set up with example values when created... ArthurPSmith (talk) 20:32, 11 February 2016 (UTC)

  • Pictogram voting comment.svg Comment I would like to apply the above as practical "good to convert" criteria: i.e. over 1000 uses in wikidata with over 99% unique values; if I don't hear otherwise then later today I plan to move these properties to the "good" lists on the Addshore pages here with an associated comment, so the developers will at least have something to work with next Tuesday & Wednesday. ArthurPSmith (talk) 15:41, 12 February 2016 (UTC)
    • Not sure if this actually tells us if they are external identifiers or not. There doesn't seem to be a consensus for the implied criteria. We might just as well go ahead with the sort done last year.
      --- Jura 16:34, 12 February 2016 (UTC)
      • Um, there is DEFINITELY no consensus to "go ahead with the sort done last year". You haven't provided any link or reference on how that was done, what criteria were used. You already agreed there were at least some mistakes in that sort (Danish Bibliometric Research Indicator level (P1240)). I'm trying to get at least a minimum consensus list here; if you don't agree with the above proposal, and I don't agree to "go ahead with the sort done last year", how do you propose we resolve this? ArthurPSmith (talk) 17:52, 12 February 2016 (UTC)
  • Pictogram voting comment.svg Comment well I am going to at least move everything colored in my charts to "disputed" as they definitely are not meeting some of the criteria we have discussed. ArthurPSmith (talk) 14:13, 15 February 2016 (UTC)

Question to sort out (1)[edit]

The main question seems to be if the value of "DBYMU.1946/48" for Q594 is an external identifier identifier or not. There seem to be three opinions:
  • (A) It is an external identifier because it can identify the object and it is used in a string property.
  • (B) It were an external identifier if it was used in a property called "Derby Museum and Art Gallery inventory number", but it is not because "inventory number" could be an identifier of a museum in Manchester.
  • (C) It is an external identifier because the applicable inventory can be deduced from the statement with the inventory number or another property (Q594#P217 has a qualifier).
Did I miss any options?
--- Jura 16:34, 12 February 2016 (UTC)
  • So obviously I am of the opinion (B). This is not the only question though - see my above question on IATA airline designator (P229). The same string is frequently used for multiple different airlines (either over time or if they are well separated in operations) - is it an external identifier or not? ArthurPSmith (talk) 17:13, 12 February 2016 (UTC) - Also I would note that "DBYMU.1946/48" is not the only sort of value provided by the property inventory number (P217) - there are 7 items in wikidata with the simple value "705" for that property. And thousands of other duplicates of that sort. So my main complaint about inventory number (P217) is essentially the same one as for IATA airline designator (P229) - it is not (necessarily) a distinct-enough string to be used to identify anything. ArthurPSmith (talk) 17:18, 12 February 2016 (UTC)
    • I'm not sure what forth option to provide that could link this to IATA airline designator (P229)? Can you suggest one? I'm not even sure how that could related to this question. It's seems to me a rather unfortunate amalgam obscuring the question about a value of P217. If we keep confusing criteria, we wont be getting anywhere.
      --- Jura 14:34, 15 February 2016 (UTC)
      • It is not a question of a 4th option on this question, it is a question of another question. To be explicit, I have added the second question below. ArthurPSmith (talk) 15:41, 15 February 2016 (UTC)

Question to sort out (2)[edit]

I believe the main question is whether deliberately non-unique "identifiers", or indeed things that are more classifications than identifiers, should be grouped with and considered to also be "external identifiers". For example, inventory number (P217) has 7 different wikidata entries with the value '705'. IATA airline designator (P229) has 6 different wikidata entries with the value 'MV'. These particular examples are short strings and so not surprisingly non-unique. Do such non-unique strings belong in the separate box that is being planned for "external identifiers"?
  • (A) Yes, every string-valued property that was not decided to be a non-external identifier last yearis an identifier, there's no need to question this further.
  • (B) Yes if the property uses the term "identifier" at all, it's an identifier, regardless of how it looks or works in practice
  • (C) No, users will expect strings grouped as "external identifiers" to uniquely identify an entity and always point to that same entity, with only minor exceptions.

ArthurPSmith (talk) 15:41, 15 February 2016 (UTC)

  • Maybe you want to add (D), a lengthy and detailed review of the sort done last year found 5 properties that shouldn't have been included + a series of more recently created properties that property creators forgot to add to the list. Thanks to those who helped sort them out.
    --- Jura 07:09, 17 February 2016 (UTC)
    • Who did this "lengthy and detailed review"? What were the criteria for finding 5 (it's 5 now?) that were incorrect? We have hundreds of properties still on the "unchecked" lists that I don't think anybody has thought carefully about yet. We have hundreds that are "disputed" but you seem to think only 5 of them are actually problematic - why? Your assertion that they have been reviewed in detail is absurd. I don't even know what criteria were used in the "sort done last year" and you have failed after repeated queries on this point to produce anything resembling a coherent description of how that sort was done. I have done a lot of work here putting together statistics and suggesting options. It's way past time for you to actually respond to the questions you have been repeatedly asked here. I assume you think IATA airline designator (P229) is just fine - why do you think so? Please describe in concrete terms what makes you think it qualifies, beyond the fact that somebody decided it qualified in the "sort done last year". ArthurPSmith (talk) 14:44, 17 February 2016 (UTC)
      • It was a suggestion for an option. If we can't even formulate a multiple choice question without digressing, I'm afraid this is not helping sort this out. Maybe you could help us with another proposal of Sbrit to sort out instead: which samples to pick for father (P22).
        --- Jura 15:17, 18 February 2016 (UTC)
        • Your suggestion for an option was ridiculous, as I explained. There has been no "lengthy and detailed review", there has been a random and haphazard review, many items spot-checked in some detail, but the only systematic detailed work I'm aware of has been the statistical analysis I ran which you have discounted for reasons you still will not explain clearly. What can I do to get you to actually answer questions? Do you or do you not claim that IATA airline designator (P229) is suitable as an external identifier? And if so, what are your reasons, other than that somebody decided so last year? And now you cryptically point to something else I have no idea what you are referring to - there is no user named "Sbrit" on wikidata, and I can find no discussion of anything about father (P22) that seems relevant to anything I might have expertise in or this discussion. You complain of me "digressing" - I don't think I'm the one who is avoiding answering questions here. ArthurPSmith (talk) 16:15, 18 February 2016 (UTC)
          • If you think the nature of IATA airline designator (P229) is a question that needs to be discussed in detail, why don't you formulate as a separate question to be sorted out? If you just bring it up in every other discussion I'm not sure how that helps. You even seem disappointed if people don't follow your digression.
            --- Jura 07:11, 19 February 2016 (UTC)
            • Jura, THIS question specifically mentioned IATA airline designator (P229) as an example. Is your answer your proposal D? I have asked you to explain D because to me it makes no sense. This is not a digression, this is the key question, question #2. ArthurPSmith (talk) 14:30, 19 February 2016 (UTC)
              • To me none of your options seem to make much sense. This is probably why I don't understand why you are holding things up. I wonder whether you try to replace identifiers with the authority control templates that some Wikipedias have. At least, let's try to sort out the airline one in a separate section.
                --- Jura 07:41, 21 February 2016 (UTC)
                • I am holding things up? I'm not the one who moved dozens of items from "good to convert" to "disputed" for unexplained reasons. ArthurPSmith (talk) 14:43, 22 February 2016 (UTC)
  • An inventory number does not identify anything without the collection it belongs to. The identifier is the couple "collection, number". The whole statement, not only the main value, has to be taken into account. I don't know if the identifier datatype is made for that kind of complexity. author  TomT0m / talk page 10:24, 21 February 2016 (UTC)
    <side note>as far as I remember, I was against the use of "inventory number" as a main statement property, but was more for a qualification of a "collection" or a "part of" statement.</side note> author  TomT0m / talk page 10:26, 21 February 2016 (UTC)

What is an identifier, anyway?[edit]

My PhD thesis contains a section on identifiers I'd like to share here. During research I found there is not much literature about the general concept of identifiers. Maybe the following except from section 3.2 (page 59-71) is of interest to somebody:

In its most general form, a digital identifier is a piece of data (string, number, letter, symbol, etc.) that refers to an object. [...] In contrast to general metadata, an identifier should be unique (no homonyms), persistent and short, at least in some context. Distinct objects must have distinct identifiers to avoid ambiguity, and the number of identifiers that refer to the same object (synonyms) should be low for practical reasons. [...]

Unambiguity (each identifier must refer to only one object) and uniqueness (for each object there should be only one identifier) of often combined as uniqueness as most important requirement for an identifier. Other properties frequently cited as important qualities are persistence (identifiers should not change over time), scope (the context of an identifier should be broad or even ‘global’), readability (identifiers should be easy to remember or contain information), and actionability (given an identifier one should be able to do something with it, for instance access the identified object). [...]

To ensure the required properties it needs an identifier system as controlled mechanism or convention for creating, managing, and using identifiers [...] An identifier system defines which identifiers exist (registry); or how identifiers are created and managed (assignment politics); how recorded associations between identifier and referent can be looked up (resolving); which syntax rules as naming conventions apply (grammar); or which relations to other identifier systems exist [...] Above all, many goals of an identifier system cannot be solved on a purely technical level.

I doubt that this discussion will find a commonly agreed on, strict and easy definition of identifier. As rules of thumb I'd say identifiers

  • need to refer to something
  • should at least aim at unambiguity, uniqueness and persistence
  • come with some kind of identifier system

-- JakobVoss (talk) 19:50, 15 February 2016 (UTC)

  • Hi Jakob - very interesting! I think your definitions (uniqueness etc) are opposite of ours as we are looking at the identifiers via properties on items. That is, to me (and I think wikidata), "unique" refers to the relationship associated with the constraint Template:Constraint:Unique_value, but that actually corresponds to what you call "unambiguity" for identifiers. Your use of the term "unique" actually corresponds to the constraint Template:Constraint:Single_value - what I have been referring to as single-valuedness. I think uniqueness or "unambiguity" in your terminology is the more important constraint, as your first paragraph seems to indicate also. ArthurPSmith (talk) 15:21, 16 February 2016 (UTC)
  • Still have to read it, but the excerpt looks interesting. I think it's in a direction we want to go as we wouldn't want to split the screen differently merely based on some calculation. (Unique and single value constraints have no bearing on the screen layout).
    --- Jura 17:45, 16 February 2016 (UTC)
    • Everyone (other than you?) seems to agree that identifiers are intended in principle to be unambiguous and unique. That shows up in the data as a high percentage of single and unique values. It's not "merely" a calculation, it's a way of actually measuring how unambiguous and unique the property is (based on the data we have). - Nikki (talk) 09:51, 17 February 2016 (UTC)
      • I think we mainly disagree on the way to measure that. I think most of us agree that this shouldn't be constrained to some current property-item relation. It seems we just don't understand the same way how IATA airline designator (P229) or catalog codes work when identifying.
        --- Jura 10:38, 17 February 2016 (UTC)

Beside the form there is also a content aspect. Let's take a book. A "book" can have an identifier as

  1. a work GND ID (P227) [= Wikipedia article]
  2. an edition DNB editions (P1292) [should have it's own item, see Wikidata:WikiProject Books ]
  3. hard or soft cover (first, second, third ... edition) ISBN-13 (P212)
  4. the work or edition (translation of a work) on a website, that talks about books

All four aspects have their own "identifiers" but they are quite different. For a bookshop an ISBN is unique. For Wikidata the ISBN is ambiguous (= one item can have multiple ISBNs). --Kolja21 (talk) 13:25, 4 March 2016 (UTC)

  • Kolja21 - just to be clearer on terminology, "one item can have multiple x's" is being referred to mostly on this page as violations of the single-valued constraint, though in Jakob's discussion above the term used is "unique". In any case "unambiguous" in Jakob's discussion = "unique constraint" on the rest of this page is a different criterion. I agree that one item having multiple instances of an external identifier is not really a big problem. It's even more the case with ISSN's than ISBN's due to the long-running nature of serials that the ISSN identifies, but I still think it belongs in the external identifiers list. That is I believe the criteria we are settling on are that identifiers should be "unique" ("unambiguous" in Jakob's terminology) but they may not be single-valued (but probably shouldn't usually have more than a handful of values for a given item). ArthurPSmith (talk) 15:33, 4 March 2016 (UTC)
    • Thanks for the clarification. For non-native English speakers this is a tough subject. --Kolja21 (talk) 17:34, 4 March 2016 (UTC)
      • confusing for all of us I think! I wonder if we should try to turn all this into an RFC for a clear definition of external identifiers? (there was a comment on project chat asking why that wasn't done...) ArthurPSmith (talk) 18:57, 4 March 2016 (UTC)

/Strings[edit]

JakobVoss: Do you think anything on the above list would qualify as an identifier (except the ones already crossed-out)?
--- Jura 06:45, 17 February 2016 (UTC)

I have not checked all of them but they don't look much like identifiers to me. The only exception I found was sRGB color hex triplet (P465). -- JakobVoss (talk) 16:11, 17 February 2016 (UTC)
I was wondering about sRGB color hex triplet (P465) also. However it doesn't seem to be used in practice just to identify colors so I'm not sure. For information, it's been used 140 times, with 7 duplicates and 9 exceptions to being single-valued. Oddly, Terreneuvian (Q515287) has a sRGB color hex triplet (P465). ArthurPSmith (talk) 19:20, 17 February 2016 (UTC)
  • I skipped that one as it's being used instead of color (P462), this despite its description and constraints. Maybe to two uses should be with different properties. At least we don't appear to have lost other identifiers in the initial selection.
    --- Jura 15:29, 18 February 2016 (UTC)

Completeness[edit]

What's the current state of the conversion process? I notice, for instance, that ORCID iD (P496) does not appear to have been converted. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 23:41, 21 February 2016 (UTC)

Andy Mabbett see the attached /0, /1, /2 pages (linked at the top of this page) for detailed lists of identifiers - what has been converted are the things listed as "good to convert" there. There should have been some analysis justifying the conversion. However as the discussion on this page indicates, we don't seem to have an entirely settled definition of what "good to convert" means, so that's a bit of a problem. ArthurPSmith (talk) 14:45, 22 February 2016 (UTC)
Thank you. I've seen those pages. ORCID iD (P496) is listed on /0 as "good to convert", along with a number of others, but not yet done; hence my question. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 06:01, 26 February 2016 (UTC)
Ah, yes, good question, I'd like to know too. The developers seemed to be moving through the list steadily since the introduction of external ID's last week, but the last update was I think Tuesday (Feb 23). So maybe we need to prod some developers to start working on them again? Or maybe better to wait until we have sorted through all the "unchecked" items first? ArthurPSmith (talk) 14:50, 26 February 2016 (UTC)
  • Update! A whole bunch were converted today - see /2. So the work continues. ArthurPSmith (talk) 22:17, 26 February 2016 (UTC)

How is IATA airline designator (P229) an identifier?[edit]

This property is just one example but I feel it is typical of the cases that I find problematic. IATA airline designator (P229) is a "designator" by the name, and according to the property description in English at least it is a "two-character identifier for an airline". It is indeed possible to identify the airline given this 2-character code IF you know some additional context: the date of applicability, and the general location of operations. Without that additional information, however, this two-character code is ambiguous, and this is by the intention of the designation authority. If you go to the IATA website and enter 'MB' (for example) right now, you receive 2 results, not 1. If you enter 'MV', you receive zero results, despite the fact that that is the value of this property for 6 airlines in wikidata: Armenian International Airways (Q200412) (dissolved in 2005), Maldivian Air Taxi (Q1423590) (merged in 2013), Air UK Leisure (Q4698233) (renamed to Leisure International Airways (Q6520288) in 1996, merged in 1998), Aviastar (Q4828598) (still in operations?), Manas Air (Q6746971) (ceased operations in 2001). I don't believe this designator should qualify as an "external identifier" among the other wikidata external identifiers because of this ambiguity and lack of long-term persistence. Do others feel differently? If so please explain below. ArthurPSmith (talk) 15:30, 22 February 2016 (UTC)

I agree, this is more a case of "abbreviation". I also wonder about IATA airport code (P238), ICAO airport code (P239), and ICAO airline designator (P230), which are similar in concept. --Srittau (talk) 20:48, 24 February 2016 (UTC)
  • I think they use them to uniquely identify an airline.
    --- Jura 15:37, 5 March 2016 (UTC)

Datatype of properties from Template:Medical properties[edit]

I'm posting it from admins noticeboard, as @Multichill: suggested.

Could someone check and change (if not done yet) datatype of properties from {{Medical properties}} Databases list, to External identifier? As they're Identifiers, not a regular Statements, their type shouldn't be String . Thanks! --Rezonansowy (talk) 17:20, 24 February 2016 (UTC)

  • Rezonansowy - see the /0, /1, /2 pages linked at the top of this page. Many of the properties listed in that Databases list have been reviewed and are disputed for one reason or another as not actually qualifying as external identifiers - for example SMILES and inChI. If you disagree with those arguments please explain your position through a comment on the associated property. A number of the other properties in that list strike me more as classifications than identifiers (operations and procedures key (OPS) (P1691) for instance - though that's only been used twice in wikidata so it's hard to tell how it would work). ArthurPSmith (talk) 22:18, 24 February 2016 (UTC)
    • @ArthurPSmith: And what about MeSH ID (P486) and MedlinePlus ID (P604)? Most of properties ending with ID qualify to be the External identifier, I think. --Rezonansowy (talk) 23:28, 24 February 2016 (UTC)
      • Have a look at /0. It seems it was initially determined to be an identifier, but disputed by ArthurPSmith.
        --- Jura 09:05, 25 February 2016 (UTC)
      • Rezonansowy - the main definition we seem to have agreed on so far is the above stated "An identifier is part of a consistent system to uniquely identify entities that are issued under the control of a single authority." In practice, identifiers that seem to have that single authority and have been widely applied in wikidata in a way that leads to 99% unique and single-valued statements are not being disputed by anybody (except Jura when I make that assessment). However, MeSH ID (P486) and MedlinePlus ID (P604) don't seem to fit those criteria - in practice it is very often the case that several distinct wikidata items have been given the same ID. Is this a mistake, or are these things not really identifiers in the above sense? If you disagree with this definition and feel it is important these be included, please indicate your reasoning here, thanks! ArthurPSmith (talk) 13:05, 25 February 2016 (UTC) - As a specific example, see MeSH ID (P486) value 'D008579' - it has been assigned to 17 wikidata items, apparently various varieties of meningioma (Q369157). Is this correct? Perhaps this is more a classification system than an identifier system? ArthurPSmith (talk) 13:23, 25 February 2016 (UTC)

Another reason external identifiers should be (almost always) unique[edit]

I just ran across this phabricator task to provide a means to look up wikidata entities by their external id values. Sounds useful - but only if external id's are indeed (almost always) unambiguous. Magnus seems to have worked on a tool to do this, though the link seems to be broken at the moment. Anyway, something else to keep in mind. I kind of wish I'd known about this sooner though! ArthurPSmith (talk) 20:18, 29 February 2016 (UTC)

  • Can you clarify this? Consider Wikidata entries that have a ChEBI and a ChemSpider identifier. Currently, both are accepted as unique identifier, but both use a different definition of uniqueness. In fact, there is not a strict 1-to-1 relation between the two. How do you envision Wikidata can overcome this, if it insists on identifiers to be unique? Egon Willighagen (talk) 15:01, 5 March 2016 (UTC)
    • The question is what would you expect a service that looks up a wikidata entity by ChEBI to respond with? A single item (in 99%+ of cases) or a list? If it ought to be a list them maybe it's not really an identifier, but more a classifier. ArthurPSmith (talk) 15:15, 7 March 2016 (UTC)

Open development questions[edit]

  • 1. generate links for properties that would remain string (so we don't have to keep rely on the gadget) or have monolingual string
  • 2. generate links for properties that need qualifiers/other properties to generate links
  • 3. generate links for properties that currently have some special coding in the javascript.

I had asked Lydia for an update on (1.)
--- Jura 15:41, 5 March 2016 (UTC)

We can generate links for properties that remain string. But for now the gadget will have to do this. It'll take a while. I think 3 mostly happens for 2. I don't have an answer for this yet. I'd like us to get a list of these once we're a bit further in the conversion and look at specific cases to see what we can do about them. Lydia Pintscher (WMDE) (talk) 16:49, 10 March 2016 (UTC)

Reference work[edit]

I'm not sure if an article in a reference work should be treated as an identifier.

For the future ("Wikidata - in pretty!", Reasonator) it would be nice if identifiers (like LCAuth) and references are listed separately. --Kolja21 (talk) 21:10, 6 March 2016 (UTC)

PASE name: identifier or string?[edit]

It would be useful to have some input on whether the proposed property PASE name should be an identifier or a string.

This is proposed as a companion property to the just created PASE ID (P2625). Entries in the PASE database have a URL based on the numeric string contained in P2625 (eg Athelstan (Q170017) -> 13909 -> [1]). They also have a unique string intended to be more human-friendly, eg Athelstan (Q170017) -> "Æthelstan 18". It is the latter that the proposed "PASE name" property is intended to capture.

The PASE database is more fine-grained than Wikidata -- there are some instances where reliable sources, and Wikidata, will treat all mentions of a name in a set of documents as referring to the same person, whereas PASE will sometimes have more than one entry, with a headnote that they very likely refer to the same person. In such cases we are likely to be able to identify a primary value for P2625, but we should also note a secondary "probable" additional value. Since the P2625 identifiers and the PASE names are in 1:1 correspondence, it makes sense for the PASE name to be a qualifier to P2625.

Since therefore "PASE name" (i) would not have a URL formatter, and (ii) is envisaged to be used as a qualifier to a P2625 statement, rather than a standalone property, I originally asked for it as a string property, rather than an external identifier. On the other hand, in many respects it is an external identifier, since (a) it is an identifier, and (b) it is used externally. Andy therefore held off from approving the property, pending consideration of what its appropriate type should be.

There are now about 2500 that PASE matches have been identified for (stored in the en-wiki template PASE). It would be much the easiest to populate both PASE ID and the PASE name at the same time. Since PASE ID (P2625) is now created and ready to go, it would be useful to get a clear decision on the property type for "PASE name", to get that created as well, so that the two could then be populated.  – The preceding unsigned comment was added by Jheald (talk • contribs) at 14:21, 22 March 2016 (UTC) (UTC).

We have generally not been converting qualifiers to "external id" - the main purpose of the "external id" designation is to list the properties in a separate area and include their URL links directly, which isn't an issue for this proposal. I would recommend this be simply a String datatype. ArthurPSmith (talk) 17:46, 22 March 2016 (UTC)

Proposal to speed up conversion[edit]

Please see Wikidata:Project chat#Conversion of datatype to external-id: process is stuck. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:11, 22 March 2016 (UTC)

This is an embarrassing mess, and some people should look internally and review their proposals and processes. There are some clear gimmes that should just get done, and resolve those. Then work on narrowing the gap.  — billinghurst sDrewth 21:39, 22 March 2016 (UTC)

Let's get this done![edit]

It seems we've reached the point where things are not moving anymore and there isn't much consensus building around the remaining ones. Therefore I suggest the following: I go through all the remaining undecided and disputed properties and make a decision on them on Wednesday (11th of May). Then everyone can have a look at it until Monday (16th of May). And then on Tuesday (17th of May) we do the final conversion. Lydia Pintscher (WMDE) (talk) 09:42, 9 May 2016 (UTC)

Hi Lydia - here's the list I don't have anything against converting - but others might: ISNI (P213) CAS Registry Number (P231) EC chemical compound ID (P232) InChI (P234) InChIKey (P235) ISSN (P236) FAA airport code (P240) OCLC control number (P243) ATC code (P267) CALIS ID (P270) ISO 3166-1 alpha-2 code (P297) ISO 3166-1 alpha-3 code (P298) ISO 3166-1 numeric code (P299) ISO 3166-2 code (P300) IETF language tag (P305) MeSH ID (P486) ISO 15924 alpha-4 code (P506) Historic Scotland ID (P709) Minor Planet Center observatory code (P717) ISIN (P946) Brazilian municipality code (P1585) Parliamentary record identifier (P2172) Wiki Loves Monuments ID (P2186) ArthurPSmith (talk) 15:32, 9 May 2016 (UTC)
To add to the list of ArthurPSmith, I would welcome the conversion of: OMIM ID (P492) MeSH ID (P486) OMIM ID (P492) MedlinePlus ID (P604) ICD-9 (P493) ICD-10 (P494) NCI Thesaurus ID (P1748) MedlinePlus ID (P604) Ensembl Gene ID (P594) HomoloGene ID (P593) RefSeq RNA ID (P639) Ensembl Transcript ID (P704) --Andrawaag (talk) 19:34, 9 May 2016 (UTC)
Would you add some sort of helpful explanation of why you are moving them where?
--- Jura 12:52, 13 May 2016 (UTC)
It is taking super long to do this already and I'd rather concentrate on the ones where there is real disagreement. I am moving many that are clearly identifiers. --Lydia Pintscher (WMDE) (talk) 15:12, 13 May 2016 (UTC)

Sorry it took longer. The ArticlePlaceholder rollout took more of my time than I expected. I have now gone through /0. Please review. I am moving on to the next pages. --Lydia Pintscher (WMDE) (talk) 14:44, 13 May 2016 (UTC)

I've now done /1 as well. --Lydia Pintscher (WMDE) (talk) 15:12, 13 May 2016 (UTC)
And /2 is done as well. Please go over it and see if I made any mistakes or if there are any remaining that people have strong feelings about. Since I've taken longer than I wanted to do this I'm moving the time to review it as well. New conversion date therefore is 19th of May. --Lydia Pintscher (WMDE) (talk) 15:37, 13 May 2016 (UTC)
I don't agree with your approach. Quite a few disputed identifiers are now in the good to convert sections. I prefer quality over speed. By forcing this you're creating a messy situation. Multichill (talk) 16:51, 13 May 2016 (UTC)
I have people complaining one way or the other. I have given everyone a lot of time to get this moving. I don't see how we ever get it done otherwise. --Lydia Pintscher (WMDE) (talk) 19:22, 13 May 2016 (UTC)
I support what Lydia's trying to do here, though I agree some of them shouldn't have been declared "good" based on the discussion that had preceded. I've created a separate section on the '0' page for properties with serious objections to being used as an identifier and moved the ones Multichill specifically objected to there (along with some others I thought were really not close to meeting the criteria we did discuss here). Maybe instead of interspersing objections we should try to separate/sort better so we can at least know the ones that are not going to be a serious problem and let them be taken care of? ISNI in particular has waited an awfully long time here... ArthurPSmith (talk) 19:43, 13 May 2016 (UTC)
Yes that would already help a lot I believe. Thanks! --Lydia Pintscher (WMDE) (talk) 19:48, 13 May 2016 (UTC)
I support the conversion of all properties in the section Good to convert and Mix of good to convert and disputed. In the latter section there are mainly poprerties with unique value violations. But this violations arise because there is a mess on Wikidata and not on the external site. --Pasleim (talk) 12:19, 16 May 2016 (UTC)
+1 to Pasleim. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:24, 16 May 2016 (UTC)
Conversion of ISNI should be avoided unless some other changes are made or if WMDE doesn't care that all links are broken.
--- Jura 15:18, 16 May 2016 (UTC)
Can you give a specific reason why ISNI is not an external identifier? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:24, 16 May 2016 (UTC)
Jura1 is probably referring to the problem that we use here on Wikidata the display format of ISNI, i.e. four blocks of 4 digits, separated by spaces. However, for the links we need to omit the spaces. Similar to IMDb ID (P345) or SOC Code (2010) (P919) this reformatting is done in MediaWiki:Gadget-AuthorityControl.js. We need to ensure that this will still work after the conversion. --Pasleim (talk) 18:44, 16 May 2016 (UTC)
Yes, the question is if a specific property for ISNI should be using property datatype "external id" and its current features, not if ISNI is an "external identifier".
--- Jura 18:48, 16 May 2016 (UTC)
If there is no question that ISNI is an external identifier(!) then it should be converted ASAP. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:24, 16 May 2016 (UTC)
I don't think you understand the question at hand. Why would we ignore it?
--- Jura 05:49, 17 May 2016 (UTC)
There is a solution to this if the gadget could be made to work at its own URL, just change the formatter URL's to point to the gadget instead of to isni.org or whatever. In any case, this is far from the only external id that does not have a correct formatter URL, it should NOT delay converting ISNI to this form. ArthurPSmith (talk) 15:18, 17 May 2016 (UTC)
If it's something that you know how to do and that just needs to be done before to avoid breaking hundreds of thousands of links, the constructive thing to do would be do just that.
--- Jura 17:16, 19 May 2016 (UTC)
tools.wmflabs.org would be the place for it I assume. I could work on it but somebody who's actually got an account on there already and owns a space they know how to use might be better suited. But if nobody steps up I'll take a look probably in a week or two. ArthurPSmith (talk) 19:58, 19 May 2016 (UTC)
@Jura1, Pigsonthewing, Lydia Pintscher (WMDE): - check out the new wikidata-externalid-url tool - I imported most of the translations from the authority control gadget, so for example it should work for IMDB id's also. I haven't attempted to actually switch any of the current formatter URL's to use this, I wanted a little feedback from others first. But it should work. ArthurPSmith (talk) 20:24, 31 May 2016 (UTC)
Thank you so much! --Lydia Pintscher (WMDE) (talk) 09:25, 1 June 2016 (UTC)
Please can you clarify which external-ids this is intended to be used for? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:24, 1 June 2016 (UTC)
@Pigsonthewing, Lydia Pintscher (WMDE): I have listed the properties it works for on the home page there. I've also added formatter URL values that use this resolver to all the relevant properties - ISNI (213), IMDB (345), HURDAT (502), E number (628), SOC code (919), TA98 (1323) and CricketArchive (2698). However I don't yet see it actually in action - I assume there's some caching on the external id links? Lydia, how long before this should take effect? ArthurPSmith (talk) 15:28, 1 June 2016 (UTC)
Thank you. LGTM. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 11:35, 2 June 2016 (UTC)
The pertinent question is, "Is ISNI an external identifier?". I understand that perfectly, and have answered, not ignored, it. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:24, 1 June 2016 (UTC)
@Lydia Pintscher (WMDE): regarding the new external id URL service - I updated the formatter URL for the external id CricketArchive player ID (P2698) yesterday (it never had a formatter before) but when I go to an item using it, for example, Abdul Aziz (Q22958463) there is still no URL link for that id as of this morning. For other properties such as IMDB and ISNI also it is not working, but I only made the new URL a 'preferred" formatter so that might possibly be another issue. How long is the delay before we could expect a replacement formatter URL to function correctly for external id's? ArthurPSmith (talk) 14:26, 2 June 2016 (UTC)
Hmm I think it should happen pretty quickly. Can you try purging the item that has a linked identifier? And turn off the authority control gadget to make sure it doesn't interfere? --Lydia Pintscher (WMDE) (talk) 17:56, 2 June 2016 (UTC)
Hmm. Maybe browser related but I'm not sure how. If I go to Sachin Tendulkar (Q9488) the link to cricketarchive.com does work - so ok, it seems to have taken effect. Anyway, that means ISNI ought to work too now. So I think there's no remaining obstacle here on those grounds. ArthurPSmith (talk) 20:45, 2 June 2016 (UTC)
@Lydia Pintscher (WMDE): Ok, I gave it some time and it looked like (for example for IMDB) external id linking was even after a week still not using the new tool-based formatter URL's in general, despite them being marked as "preferred". So I have removed the old formatter URL's and left only the new wikidata-exteranlid-url links for the properties it handles. However, this is now incompatible with the old authority control gadget so it will have broken at least some links (ISNI should work as the prefix Url is the same, but some of the others do different kinds of transformations). I guess I may need to edit the authority control gadget to fix this. Anyway, just so you know with that change the links do work correctly (in particular all the IMDB links now work - hurray!) ArthurPSmith (talk) 14:52, 8 June 2016 (UTC)
hmm, looks like the gadget is edit protected. Which is probably a good thing. Anyway, I left a note on the talk page there about the issue. ArthurPSmith (talk) 15:06, 8 June 2016 (UTC)
Awesome! :) Are we then good to go and convert more? Especially ISNI? --Lydia Pintscher (WMDE) (talk) 09:38, 10 June 2016 (UTC)
Yes, let's do everything in "Good to convert" including ISNI ASAP, thanks! ArthurPSmith (talk) 14:25, 10 June 2016 (UTC)

Requests[edit]