Wikidata:Property proposal/mapping relation type

From Wikidata
Jump to navigation Jump to search

mapping relation[edit]

Originally proposed at Wikidata:Property proposal/Authority control

Descriptionqualifier, which definies the relation of the item to the external identifier more precisely in terms of the SKOS mapping relations
Data typeItem
Domainstatements of properties of type external identifier (Q21754218)
Allowed valuesOne of exact match (Q39893449), close match (Q39893184), broad match (Q39894595), narrow match (Q39893967) or related match (Q39894604)
Exampleoverseas countries and territories (Q1451600)STW Thesaurus for Economics ID (P3911) → 29738-2 (Overseas territories)
"mapping relation" → close match (Q39893184)

Lake Constance Region (Q34947397)STW Thesaurus for Economics ID (P3911) → 30083-1 (Lake Constance region)
"mapping relation" → exact match (Q39893449)
Assessment center (Q265558)STW Thesaurus for Economics ID (P3911) → 12570-6 (Executive selection)

"mapping relation" → related match (Q39894604)
SourceSimple Knowledge Organization System (Q2288360) (see formal and extended definition of mapping relations)
Planned usemore complete mapping of STW Thesaurus for Economics (Q26903352) to Wikidata (for now), and of other knowledge organization systems already linked by external identifiers
See alsoexact match (P2888), narrower external class (P3950) (for values of type URL)
Motivation

Originally proposed in the Wikidata:Project chat under the title Making Wikidata fit as a linking hub for knowledege organization systems and copied over here, as suggested by ChristianKl. Jneubert (talk) 11:00, 25 August 2017 (UTC)[reply]

Wikidata covers more and more aspects of human knowledge. In certain domains or subject fields, well established knowledge organization system (Q6423319) (such as thesauri, classification schemes, taxonomies, or subject heading systems) exist, and organize knowledge by defining concepts and optionally their mutual hierarchical or associative relationships. Connecting these systems, originating mostly from the library and information science world, to Wikidata - via either external identifier (Q21754218) properties or via exact match (P2888) - is a widely used and encouraged practice.

For these knowledge organization systems (KOS), Wikidata already works as a linking hub, because it connects concepts defined in one KOS via the according Wikidata property to all other concepts connected to the same Wikidata item via other external id properties. However: While mapping external entities is straightforward for people, it gets increasingly difficult for organizations, locations, or abstract concepts. Sometimes the concepts only partly overlap, sometimes there are slight differences in meaning, sometimes there is only an associative relation (e.g. between an activity and the tool used in that activity). Currently, such kinds of matches cannot be coverd cleanly. This suggestion aims at changing this, making Wikidata fit as an universal linking hub for KOS.

W3C's Simple Knowledge Organization System (Q2288360) (SKOS) standard defines and explains five mapping relations for linking to external concepts:

  • exactMatch indicates that two concepts have equivalent meaning, and the link can be exploited across a wide range of applications and schemes. The link is meant to be transitive (A = B and B = C means A = C).
  • closeMatch indicates that two concepts are sufficiently similar that they can be used interchangeably in many applications. This link is not meant to be transitive.
  • narrowMatch indicates that one concept is narrower than the other (for the representation of hierarchical links). The link is not meant to be transitive.
  • broadMatch indicates that one concept is broader than the other (inverse of narrower). The link is not meant to be transitive.
  • relatedMatch indicates a non-hierarchical assoziative relationship between two concepts. The link is not meant to be transitive.

I here suggest to implement these relations in Wikidata as follows:

  1. Create four items "close match", "broad match", "narrow match", "related match", with definitions according to SKOS.
  2. Create a new property "mapping relation type" (instance-of Wikidata qualifier (Q15720608)) for use as an optional qualifier for properties of type external identifier, with values constrained to "close match", "broad match", "narrow match" or "related match". An exact match would be implied by a missing "mapping relation type" qualifier.
  3. Create according constraints, in particular constraints to be used instead of "single value" and "unique value", taking into account that 1:1 relationships between an item and a concept identified by an external identifier may be complemented or replaced by multiple qualified mappings (n:m). The use of such constraints in the definition of the external id properties could indicate if they are intended to be used with different mapping relation types.

That implementation could work as an extension of the already well established mechanisms. By "exact match implied in absence of a 'mapping relation type' qualifier", it would leave all existing external identifier relations undisturbed, yet would allow adding more precise mappings where appropriate.

As an important point in practice, Mix'n'match or equivalent tools can be applied for creating links to external vocabularies and catalogs without change. Once the external id property value has been added to the matching Wikidata item, it can immediately be qualified more precisely with a mapping relation type.

As a point of caution, queries would have to be aware of qualifiers to external id properties, actually meaning "this is not an exact match". However, that would not apply to the large group of existing properties for persons or other entities where only exact 1:1 relationships exist.

What do you think of this proposal? - Jneubert (talk) 11:59, 24 August 2017 (UTC)[reply]

Definitions of External identifier

These are the definitions of external identifier (Q21754218) I was able to find (if you know other, deviating or more expressive, sources, please add them here). Jneubert (talk) 16:42, 6 September 2017 (UTC)[reply]

"String that represents an identifier used in an external system. Will display as external link if a formatter URL is defined." (Help:Data_type)
"Some properties have values that are strings used in the databases of external organisations. They uniquely identify an item. For example, an ISBN for a book or the unique part of the URL of a movie or an actor in the Internet Movie Database." (Wikidata:Glossary)
Discussion

(First three comments copied from Wikidata:Project chat - I hope that's ok. Jneubert (talk) 10:59, 25 August 2017 (UTC))[reply]

This sounds reasonable to me. How about creating a property proposal? I think that's a better place to have this discussion. ChristianKl (talk) 13:52, 24 August 2017 (UTC)[reply]
I mostly support this. However, the SKOS relations (other than "exact") are somewhat vaguely defined: it's not clear what are the boundaries of "close", and "broad" (and conversely "narrow") can refer to a subclass relation, an instance-class relation, or a part-whole relation (or possibly other types of "broadening"). "related" could refer to any sort of relationship. Therefore I think in general it is preferable that, if no wikidata item already exists which is an "exact match" to an external URI, a new item should be created that is an "exact match", and the more precise wikidata properties used to relate that new item to existing items. Using these SKOS non-exact relations should, in my view, be only a fallback when creation of new wikidata items is not practical for some reason. ArthurPSmith (talk) 14:05, 24 August 2017 (UTC)[reply]
Well, KOS are not ontologies, and almost always have messy corners from an ontological point of view. However elaborated, they have been created to index or classify publications, and their concepts were tailored for that purpose. E.g., the STW Thesaurus for Economics (Q26903352) has a concept "Content management", and uses that for "Content management system", too. (Probably because among the threehundred-somthing publications, which have been found relevant for economics, and indexed with this concept, only very few focus on the system aspect.) To the more precisely defined Wikidata items content management (Q173373) and content management system (Q131093), any mapping will be necessarily inexact.
In the first example above (overseas territories), concepts on both sides are defined not very well (brittle re. Brexit on the WD side, and defined only by the indexed publications on STW side). I would very much hesitate to declare an exact match in that situation. Jneubert (talk) 11:17, 25 August 2017 (UTC)[reply]
@Jneubert: a qualifier seems useful because this applies to every property,  Support d1g (talk) 15:36, 24 August 2017 (UTC)[reply]
  •  Support. Thank you for proposing these ideas. I support the creation of these properties/qualifers and their application as described. This seems like a reasonable first-step toward being able to map external systems to Wikidata. YULdigitalpreservation (talk) 13:47, 25 August 2017 (UTC)[reply]
  • Currently, I'm doubtful that the name/description is clear enough to tell a new user what this property is about. Is there prior art that we can use to orient ourselves when it comes to name/description? ChristianKl (talk) 08:12, 27 August 2017 (UTC)[reply]
Unfortunatly, I'm not aware of something from the KOS domain which could apply here. Perhaps somebody else? Particularly if similar mechanisms have been used elsewhere in Wikidata?
If we have to make it up ourselves: What would you think of "mapping relation modifier" as name and "Qualifier for the in-exact mapping of an item to an external identifier (derived from SKOS relations)"? That would stress the point that the user modifies a (by default exact) relation. Jneubert (talk) 07:18, 28 August 2017 (UTC)[reply]
  • In general I support this, but I'm not sure it is a good idea to indicate exact matches by the absence of a qualifier. In my mind, the mapping relationships should default to something like closeMatch, if they should have a default type at all. exactMatch implies transitivity and that is always a bit dangerous, when you have to link between KOS (including Wikidata) that have different points of view. But safest would be to also declare exactMatch as one possible qualifier. Absence of qualifier would simply mean that we don't know the precise type of relationship - somewhat like skos:mappingRelation (the superproperty of all SKOS mapping relationships), in effect. Osma Suominen (talk) 14:24, 29 August 2017 (UTC)[reply]
Thanks for your thoughtful comment. The rationale behind proposing exact match as default was, that - as far as I can oversee - external id properties were up to now used to express identity (without ontolgical baggage). This is true for people, and probably also for some or most of the widely used identifiers for genes/proteins/rna and the like. I didn't want to devalue the millions of existing external id links where an exact match is the most appropriate interpretation. However, you are right re. the possibly unconsidered and undisrable consequences of the the transitivity of "exact match". That may already be an issue with the third-most used external id, Geonames. So I can agree on your suggested solution to additionally introduce an explicit "exact match" relationship, and considering the unqualified properties as "matching" the external entity in an extent which can sometimes be derived the domain of the items (e.g., for people), and sometimes users and applications have to take an (hopefully informed) guess. I wonder what others think of that proposed modification of the proposal. Jneubert (talk) 10:05, 30 August 2017 (UTC)[reply]
@jura1: You've made a comment re. exact matches in the project chat. I'm not sure if it hits the same spot, and if the proposed solution would solve your concerns. Jneubert (talk) 10:09, 30 August 2017 (UTC)[reply]
An additional advantage of an explicit "exact match" value may be, that it would allow to define a "mandatory qualifier" constraint on the external id property, which would play nice with a "one of" constraint on the qualifier. That would allow to impose a fairly strong regime on external ids, for which a community consensus can be reached that they should be "properly" mapped. Jneubert (talk) 07:44, 31 August 2017 (UTC)[reply]
A point of caution, however: We should carefully consider how a mandatory qualifier plays with Mix-n-match and mapping workflows. It will enhance the workload, because each statement has to be qualified. Doing this manually for every statement is a serious additional burden. On the other hand, for each larger vocabulary mapping approach, any relation can be considered as a default and, after working through Mix-n-match lists and adding qualifiers for the deviating relations, the qualifier for the "default" can be added automatically (e.g., via Quickstatements). Jneubert (talk) 09:17, 31 August 2017 (UTC)[reply]
Hi @Magnus_Manske: Does Mix-n-match support the insertion of external id properties, when these properties have mandatory qualifier constraints? (What would allow to qualify the properties in a second step, independent from M-n-m). Jneubert (talk) 09:17, 31 August 2017 (UTC)[reply]
Mix'n'match is intended for "equivalent" matches. It does not support adding qualifiers when adding statements to Wikidata. There is no storage, no interface code for this. Adding such qualifiers when matching would likely slow down and/or clutter the interface. I suggest that adding such qualifiers would be a separate task. It should be simple to generate a list of items with statements lacking such qualifiers via SPARQL. Qualifiers would then have to be added manually, as the decision about a "matching type" is likely one that cannot be done automatically. --Magnus Manske (talk) 15:32, 8 September 2017 (UTC)[reply]
Thanks! completely agree. Jneubert (talk) 15:44, 8 September 2017 (UTC)[reply]
Agree as well. Applications similar to the above wouldn't be possibly any more or be slowed down. Accordingly, the domain of such a qualifier needs to be limited.
--- Jura 09:19, 9 September 2017 (UTC)[reply]
✓ Done Added "exact match" as explicit value (instead of default), as proposed by Osma Suominen. Jneubert (talk) 11:02, 13 September 2017 (UTC)[reply]
  •  Oppose domain. if this is added to any property, effectively the current approach for external ids is invalidated.
    --- Jura 06:31, 2 September 2017 (UTC)[reply]
    • What is the “current approach for external ids”? I don't get your point. Sorry. :-/ --Michael Büchner (talk) 14:13, 5 September 2017 (UTC)[reply]
      • Currently people know they can do a 1-1 matching with external ids. If this is applied to random properties, one would need to check for the absence of qualifiers every time.
        --- Jura 07:00, 6 September 2017 (UTC)[reply]
        • I agree that external ids "normally" are used to express 1-1 relationships. Their definition however actually only says that an external id "uniquely identifies an item" in an external system (see definitions above). That means, the identifier is unique in that system, but in my eyes says nothing about the Wikidata side or the relationship. So I cannot see that "the current approach for external ids is invalidated".
The 1-1 relationship is normally expressed by a combination of single value and unique value / distinct values constraints - which is used by many properties of type external id, and which can be taken as a strong indication against the use of relation type qualifiers. The absence of qualifiers can be further and formally enforced with an allowed qualifiers constraint with "no value" (as explained on the help page). On the other hand, qualifiers can be made mandatory for selected properties. So it can be actively precluded that qualifiers are "applied to random properties". Additionally, as stated above, qualifiers obviously do not make sense in large domains (e.g., persons). And finally, it takes effort to attach qualifiers, so it is not very likely that it happens randomly. Jneubert (talk) 17:05, 6 September 2017 (UTC)[reply]
Constraint or not, can you explain how the applicability of this qualifier would be limited beforehand?, e.g. it wont apply to any properties with single & unique value constraints?
--- Jura 09:19, 9 September 2017 (UTC)[reply]
As mentioned above, I doubt that could evolve to a widespread problem (due to the nature of most external id properties, and the effort and level of expertise required to apply qualifiers at all. I've checked for VIAF ID (P214) - for more than a million IDs, only a handful of qualifiers show up, some probably assigned in a wrong position). I'd suggest to handle that on a per-property basis. Trying to create some sophisticated custom constraint would add load to the machines involved, and more so, to users trying to understand what's going on. Jneubert (talk) 07:56, 11 September 2017 (UTC)[reply]
It's probably even worse if it's handled separately. People could match the identifier for Berlin to the item for Germany and might think that the exact qualifier would be added later. Users exporting to a list of identifiers and items are mislead because qualifiers haven't been included.
--- Jura 14:36, 11 September 2017 (UTC)[reply]
Perhaps this is a misunderstanding: With "handle on a per-property basis" I meant the definitions of the external id properties (which may require or preclude the qualifier), not the individual property values. Jneubert (talk) 17:30, 11 September 2017 (UTC)[reply]
No, sorry. The idea is to restrict this property to well-known SKOS relations, in order to make Wikidata fit as a linking hub for knowledge organization systems, which can be queried in a predictable and reliable way. If I understand your use case correctly, confidence values by ABS are quite well definied, but only for their domain. Confidence values applied in other domains, e.g., to the mapping of GND and DDC, may be definied with a quite different mindset. Jneubert (talk) 07:56, 11 September 2017 (UTC)[reply]

@ChristianKl, ArthurPSmith, D1gggg, YULdigitalpreservation, Osmasuominen, Jura1:, @Hanshandlampe, 99of9: I've modified the proposal by adding "exact match" as explicit value (instead of default), as suggested by Osma. Additionally, I've slightly modified the description (in an attempt to be more clear to users, as pointed out by ChKlein). Since the values would be restricted to five explicit options, the property name could perhaps simply be "mapping relation" - what do you mean? -- Jneubert (talk) 11:02, 13 September 2017 (UTC)[reply]

  •  Weak support I don't feel this is essential, but given the discussion above I do feel it could be helpful. I've shortened the English label as suggested by Jneubert just now, that seems fine. I do have a concern about what Jura mentioned just above - we perhaps should add usage instructions that these relations should only be used where there is a very close relationship. That is to make sure not to, as Jura suggests, link everything in Germany to "Germany". Again that may be more appropriate to specify on identifiers property-by-property. ArthurPSmith (talk) 13:32, 13 September 2017 (UTC)[reply]
  • I've added a draft explanation When to use "mapping relation" qualifiers?. When anchored at the start of the property talk page, I'd expect that it could additionally clarify the intended use and help to avoid the insertion of random nonsense relations, as described by Jura above. (If there is a better place to add such explanations, please let me know.) Jneubert (talk) 09:20, 15 September 2017 (UTC)[reply]
  •  Oppose I think the Lake Constance sample illustrates well why this is a bad idea. We do have a dedicated item, but the sample uses another one. Similarly this can lead people to add identifiers for Berlin to the item for Germany.
    --- Jura 07:41, 1 October 2017 (UTC)[reply]
  • Hi Jura, thanks for the hint to Lake Constance Region (Q34947397). I was not aware of that item until now. Apparently it has been introduced by ChriatianKl during this discussion and linked to the according STW descriptor. So I've adapted the example above, because it was not intended to make the case for redundant links.
Though the introduction of the new item is of course legtitimate, I wouldn't have created it myself, for two reasons: The new item is much less "attractive" for linking, because it has no Wikipedia pages and no other external identifiers attached. Secondly, it would have been defined no better than it is currently (has part Lake Constance, externally identified by a descriptor of an economics thesaurus). It lacks properties which would make it more useful (population, surface area, etc.). Probably, exact definitions of the region in Germany, Switzerland and Austria and values for the according properties could be researched in the publications linked from the STW descriptor, but that by far exceeds the amount of work which can be done during a mapping effort.
While the case was solved here by creating a new item, I still wouldn't see that as a general solution. While entities such as "Lake Constance region" are clearly identifiable, this does not hold for many concepts in the social sciences or humanities. Introducing slightly differing concepts in these fields, sometime perhaps only coined by some particular school of thought, only to create an exactly matching linking point from external vocabularies would, in my eyes, do no good to Wikidata. Jneubert (talk) 13:02, 4 October 2017 (UTC)[reply]
  •  Comment @ArthurPSmith, Jura1: Completely agree about the "fallback" status and that it's better to create new exactly matching WD entities, rather than go rampant with broader/narrower.
This proposal thematically overlaps with external subproperty (P2236), external superproperty (P2235), narrower external class (P3950), equivalent property (P1628), equivalent class (P1709), exact match (P2888). It's true those props are "for values of type URL" and universal rather than per-authority. But as you can see at Wikidata:Property_proposal/Schema.org_ID, the boundary between the "URL" approach and the "external-id" approach is not so clear-cut. The guidelines also overlap with Help:Statements/Guidelines_for_external_relationships. So while I think this proposal is useful, together with those previous policies it will create more confusion!
The current tens of millions of existing external-id statements are created with the meaning of "identity". Just because there are constraint violations (eg one WD matched to several in one external authority), doesn't mean the intent was not "identity". So @OsmaSuominen, Jneubert: "exact match" is an appropriate default value and I propose to remove it. Or do you volunteer to go check those tens of millions of statements and add it?
I also propose to remove "related match", which is too vague to be useful; I don't know many linksets to use this prop: @Jneubert: feel free to prove me wrong. --Vladimir Alexiev (talk) 10:13, 2 October 2017 (UTC)[reply]
  • Hi Vladimir Alexiev, re. "exact match" as a default: While I agree that "identity" probably was the starting point for external identifiers, it has never been stated that the external id property should hold strong enough for transitive matches (please have a look at the definitions above). Redefining it afterwards with such an extended meaning seems wrong to me, exactly because nobody can verify the existing millions of links.
As the ones responsible for one vocabulary (STW), we can do this (add the "exact match" qualifier automatically to the few hundred existing links, and verify them), but this is hardly an option for large and already widely linked vocabs such as AAT. Yet, the proposal does not require that you (or Osma, or I) do that. The existing (unqualified) external id values stand as defined, and it is in the responsibility of the user if she exploits it transitively. Jneubert (talk) 13:26, 4 October 2017 (UTC)[reply]
  • I agree on removing "related match", although I've occasionally wondered whether wikidata ought to have a property for "related item" (there's "see also" for properties, but nothing like that for items as far as I know). In fact I think this might be helpful if just limited to "exact match" and "close match". But if we need to include "broader" and "narrower", I think we need usage instructions on the lines of "this should be used to describe the relationship of the external URL to the closest matching wikidata item, not to more distantly related items". By the way, please note both opposing votes here are from the same person. ArthurPSmith (talk) 20:40, 2 October 2017 (UTC)[reply]
  • Hi Vladimir Alexiev, ArthurPSmith, I think "related match" should be kept. There are cases where it is better suited than close or broad/narrow match. A few examples:
  • In economics, Wikidata has Kuznets curve (Q1349471) (empirical relationship between economic development and inequality level), while STW additionally has Environmental Kuznets curve (which alleges the same highly disputed relationship between economic development and environmental quality).
  • I've added Assessment center (Q265558) and Executive selection (STW) in the example section above.
  • For a work created by two or more authors, an archive may provide a link to the correspondence between the authors.
  • For an occupation, an educational database may have an entry about a course of instructions for that occupation.
Besides single examples, supporting the full set of SKOS mapping relations is an important factor for making Wikidata fit as a linking hub for knowledge organization systems. I'd agree to Vladimirs argument that the use of skos:relatedMatch is less wide spread than the use of other mapping relations (though from STW alone, it is used in link sets to GND and to the Thesaurus for the Social Sciences). Because Wikidata is already so comprensive and can be extended easily, non-exact relations will be less frequent than in mappings between different subject thesauri, and perhaps a "related match" qualifier will be rarely used. But we should not arbitrarily restrict the suitability of Wikidata for a complete mapping from external knowledge organization systems. Jneubert (talk) 14:54, 4 October 2017 (UTC)[reply]
  • Hi ArthurPSmith, I agree to your statement that "this should be used to describe the relationship of the external URL to the closest matching wikidata item, not to more distantly related items" as part of some usage instructions. Would you mind inserting it to the draft here? Or should that go to another location? Jneubert (talk) 15:07, 4 October 2017 (UTC)[reply]
  • I think we should attempt to include the information in the proposal above. Too much copy-and-paste for discussion from other forums have been added to this page already. If the objective if merely to map STW Thesaurus for Economics ID (P3911) to pages that have Wikipedia articles (maybe even at German Wikipedia), the proposal should probably be limited to be a qualifier for P3911 only. Applying the initially proposed qualifier to any external identifier on items that have in the proposers view sufficient statements and Wikipedia articles linked is just to likely to break the entire system.
    --- Jura 10:24, 8 October 2017 (UTC)[reply]
Orphanet describes 3-methylglutaconic aciduria type 3 (Q2823332), which is a specific kind of metabolic disorder, and provides cross-references to (among other things) MeSH and OMIM. Theses are records that also have information about this specific disease. Orphanet also provides a cross reference to ICD-10 ID: E71.1. The issue is that E71.1 in ICD10 is actually "Other disorders of branched-chain amino-acid metabolism", which is a catch-all category for certain metabolic disorders that aren't explicitly defined by other codes in ICD10. This is not an exact match; it is a narrower -> broader term relationship. (And this fact is captured by orphanet and retrievable). There are several items in Wikidata with this ICD-10 code. So the two ways I can think of to properly capture this relationship are:
  1. Add "E71.1" as an external ID to 3-methylglutaconic aciduria type 3 (Q2823332) with the qualifier broadMatch
  2. Create a new item "Other disorders of branched-chain amino-acid metabolism", and make 3-methylglutaconic aciduria type 3 (Q2823332) (as well as the other 11 items) subclasses of this.
I much prefer option 1 as I think it is significantly cleaner and easier to understand. Creating the item "Other disorders of branched-chain amino-acid metabolism" would serve to just pollute the disease space in Wikidata with nonsense structural diseases that serve little purpose.
Gstupp (talk) 21:20, 10 October 2017 (UTC)[reply]