Wikidata talk:WikiProject Taxonomy/Archive/2020/06

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Understanding Wikidata taxonomy

I'm trying to make sense of the way Wikidata models taxa and taxonomic names, and have attached my currently understanding of the relationships between taxa (and publications). I've also written a blog post on this Making sense of how Wikidata models taxonomy. I'd welcome any comments on this diagram. My goal here is to understand the current situation, and use that to design queries to use in an app I've been developing to navigate Wikidata. The app is here ALEC, and there's some background to this project here: Wikidata and the bibliography of life in the time of coronavirus. Given the enormous potential of Wikidata I'm keen to come up with user-friendly ways to explore taxonomic information, especially links between taxa, their names, and the evidence supporting that taxonomy. --Rdmpage (talk) 10:07, 20 April 2020 (UTC)

Hi Roderic, thanks to having you here again. I think you missed reference has role (P6184). In theory nomenclatural acts could be queriable like the ones from Taxon (Q2003024). Regards --Succu (talk) 20:53, 20 April 2020 (UTC)
When I say "ontologically logical" I mean we don't say "reference has role → original combination", therefore I don't see why we should say "reference has role → new combination", hence the item that I created "new combination reference..." Christian Ferrer (talk) 07:36, 21 April 2020 (UTC)
@Succu: Nice to be "back", I keep circling around this stuff, and having built my own knowledge graph for Australian animals I'm beginning to see the attraction of doing that sort of work here rather than in my own little world. I've added reference has role (P6184) to the diagram. There's clearly some overlap between having references linked to taxon names taxon name (P225) and P5326 (P5326). Is there a consensus about which approach is preferred?  – The preceding unsigned comment was added by Rdmpage (talk • contribs) at 08:53, 21 April 2020‎ (UTC).
It's a wiki, so there isn't a recommendation. Personally I do not like P5326 (P5326). It's separeted from taxon name (P225) and the usage is not clear. reference has role (P6184) is a lot more flexible and fits in perfectly to the general usage of references. --Succu (talk) 19:59, 21 April 2020 (UTC)

Any attempt to understand - or fix - taxonomy in Wikidata needs to address the issues described in, for example What heart rate does your name have? Further discussion may be found at Scientific names of taxa should be a separate entities from the taxa themselves. Good luck! Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:41, 21 April 2020 (UTC)

@Pigsonthewing: Yes there is a distinction between taxa and taxonomic names, and initially I thought Wikidata had these so intertwined that I avoided interacting with it too much. But the idea of having taxon name (P225) that has qualifiers helps tackle the distinction between names and taxa, and I don't think there's a consensus on the best way to model names and taxa. And given the multiplicity of classifications available, I think Wikidata may be well placed to help capture those without having to decide on which one to follow. So I'm optimistic that things aren't quite as bad as you suggest. --Rdmpage (talk) 15:17, 21 April 2020 (UTC)
@Rdmpage: Hello. I think the diagram is very useful, by the way, and should be added to the taxonomy tutorial.
I think the distinction between taxa and taxonomic names is an extremely severe problem with this part of Wikidata, and no, there is no consensus how to solve it. One taxon can correspond to several taxon name items and we badly need a way to distinguish one of them to hold the properties like wikilinks which apply to the whole organism (and are not name-dependent). Also we should have a way of expressing that one name or another is the claimed current name, but that should be independent.
At present I don't think that the data model is well enough defined that it is possible to automatically derive the author information string from the Wikidata taxon items (I proposed an algorithm to do that in the taxonomy tutorial comments). I also tried to do an exercise (described here) to add the fungus author information strings semi-automatically based on Index Fungorum, but I had to stop due to another user deleting my taxon name (P225) properties. The root of the dispute was that I was adding too many invalid (or obsolete) taxon name items, but in fact I needed to do that to conform to the rules for the author information. I have tried to make proposals for how to clearly distinguish old name items, such as in this discussion, and also how to have items which represent a whole taxon. The proposal in Scientific names of taxa should be a separate entities from the taxa themselves would be good, but I think it would be too big a change.
It would be excellent if we could make progress on these issues. But I think that things are very bad - the current data model is not fit for purpose. Strobilomyces (talk) 19:16, 31 May 2020 (UTC)
@Strobilomyces: Glad you found the diagram useful, I'm still tweaking as I learn more about the various properties. I'm torn between arguing that the best way forward is to start from scratch and distinguish names and taxa, or accepting things are what they are and trying to figure out how best to work within those constraints. The problem with any new model is getting people to agree on how to model names and taxa. People who do this for a living don't agree, and any discussion of taxa and names seems to inevitably descend into a quagmire. I suspect it would require a lot of energy to make major changes. I'm mildly consoled that this is not just a problem with taxonomy, even things as "simple" as books start to get complicated very quickly (and have similar discussions) Wikidata_talk:WikiProject_Books#Physical_book_vs._a_literary_work_published_in_book_form.
Not sure what the way forward is, but personally I guess I'd be interested in establishing:
  1. For a given terminal taxon (e.g, species), can we retrieve all facts about that species (which may be attached to different names)
  2. For a given taxon, can we retrieve all names that have been used for that taxon?
  3. Can we retrieve a given classification from Wikidata (e.g., here is the classification of mammals according to Mammal Species of the World (Third edition) (Q1538807))
In this context I view "taxa" as nodes in a classification. I think (1) is the biggest challenge, given that I think this is what most people will want from Wikidata, namely facts about a given species. I think (2) is doable, although there are an alarming number of properties involved, and of course (2) is closely related to (1). Classification is challenging because Wikidata supports multiple classifications (without always being explicit about where those classifications come from), which is related to (1) because the same species can appear in multiple places in Wikidata as part of different classifications. For myself, I think the answers to these questions will help me decide how much to invest in adding taxonomic and nomenclatural data to Wikidata, versus doing that somewhere else. --Rdmpage (talk) 09:47, 1 June 2020 (UTC)
@Rdmpage: It is certainly a problem to get agreement on a taxonomy model, and I also think that it is best to aim at minimal changes.
As far as I know, the only present specification of how to use the taxonomy data model is the taxonomy tutorial. It would be a great improvement if this could be updated and if a more formal definition of the acceptable model of the taxonomy items could be given.
To retrieve all names for a given taxon, the tutorial says that taxon synonym (P1420) and subject has role (P2868) = synonym (Q1040689) with of (P642) should be used. By the way, I think that the pointer in the reverse direction is extremely useful, though in theory it is redundant (in a situation where you can easily do SPARQL queries). The claims should be accompanied by references to the classification authorities. For a given reference source it is necessary to choose one of the name items as the current one (which will have the P1420 claim). It should be illegal for two current name items to point at the same synonym item for a given reference source. It is true that in principle the set of names which belong to a given taxon depends on the classification authority.
I think that synonyms need to be divided into "stable" and "contentious" ones, and the two types have to be treated differently. The w:en:Synonym (taxonomy) page describes "objective" as against "subjective" synonyms for zoology and "homotypic"/"nomenclatural" as against "heterotypic"/"taxonomic" synonyms for botany. I think that normally all experts agree that the names linked as "objective"/"homotypic"/"nomenclatural" synonyms identify exactly the same sets of organisms, even though there may disagreement as to which name is the current one to use. For instance when a name is found to be nomenclaturally invalid, there is no doubt about the valid name which replaces it, and so that is a stable synonymy. For fungi we have lots of such synonyms. The "subjective"/"heterotypic"/"taxonomic" synonyms are more contentious but I think that in cases where the synonymy was established long ago, they could also be regarded as stable. If someone came along and argued that such a synonym was uncertain, it would be necessary to change the synonym over to being contentious.
I don't think we can do anything for names linked as "contentious" synonyms - those groups of names need to be treated like separate taxa. But each group of names linked by "stable" synonyms constitutes a different taxon. Each item linked through taxon synonym (P1420) could have multiple "stable" references (as well as "contentious" ones). The "stable" synonym references don't have to agree as to the sets of name items which belong to the taxon or which name of the set is the current one; the name set of the taxon would be the union for all "stable" references of all the name items found through the links going both ways. If the "stable" references didn't agree, that would be an indication that some of the synonyms were actually "contentious", but it would not be an error.
The stable synonyms should be identifiable as such, perhaps by a new property of the reference.
Apart from this, one of the name items belonging to the taxon should be selected as the taxon item, and the wikilinks and name-independent properties (such as mushroom cap shape (P784) in the case of fungi) should be gathered together on the taxon item only. This is especially important in the case of wikilinks - since WP articles are per taxon, the system doesn't work unless the wikilinks are all on the same item. The taxon item should not be confused with the current name item (which their could be several of), and it should be emphasised that the choice of the taxon item is not important.
We should have a way of showing which is the taxon item. It could be with a new value or qualifier of the instance of (P31) claim. Only one item of all the names connected by "stable" references should be allowed to be a taxon item.
I think the problem of the classification is simpler, it is all done through parent taxon (P171). There can be multiple parents and references can be used to show which system is relevant. But there needs to be a defined default reference to be used by software such as taxoboxes, and such software would need to be changed. All of that needs to be defined very clearly.
So in my view quite a small set of changes agreed in a RFC, with a very clear instruction manual, could make the system viable. Strobilomyces (talk) 20:57, 1 June 2020 (UTC)

Items about fish

https://w.wiki/Sd8 contains a list of items about fishes that currently aren't instances of taxon. Can someone who knows more about animals look into them and merge or fix them? ChristianKl17:07, 2 June 2020 (UTC)

Looking more into it, there are in total 107 instances of species right now. Likely all those items need work. ChristianKl17:16, 2 June 2020 (UTC)
@ChristianKl: Done. --Succu (talk) 19:01, 2 June 2020 (UTC)
Thanks. ChristianKl19:55, 2 June 2020 (UTC)

Linking (lots of) taxonomic names to literature

So I'm now at the point where I'm thinking about bulk uploading links between taxonomic names and literature. By "bulk" I'm talking the order of 100's of thousands. Obviously this will depend on how much of the literature is in Wikidata, but it's already a lot, and I'm trying to add as many taxonomic publications as I can (on the order of 10's of thousands).

Obviously I'd like to do this "right", that is, in a way that the community finds useful and aligns with what has already been done. If, following @Succu: we use reference has role (P6184) then it looks like first valid description (Q1361864) first valid description (Q1361864) recombination (Q14594740) and replacement name (Q749462) are the most obvious roles to annotate each taxon name → reference pair. These are certainly the most commonly used roles based on this query. For example, at the moment there are a little under 6000 taxon names linked to "first description" publications, and approx 8000 linked to "recombination" publications (e.g. species moved fro one genus to another).

So what I envisaging is taking data from, say International Plant Names Index (Q922063) and my own mapping between IPNI and DOIs, etc. and uploading that to Wikidata. Each existing taxon name that has an IPNI plant ID (P961) and the corresponding publication exists in Wikidata would be linked.

Complications

Plant names are a bit easier to handle because botanists track name changes. Zoologists don't, so in several databases (e.g. Mammal Species of the World (Third edition) (Q1538807) we have current names linked to original description. These are references for the name in a sense, but not aways the current name (World Spider Catalog (Q3570011) is another example of this. I'm not sure what the best thing to do in this case. If we know the original name, then the reference can be linked to that name, but maybe it's worth having the reference linked to the current name (without a role?) so that it is available to people.

Request

What do people think about this? My plan is to keep adding the references needed while seeing what people think about linking reference to names on a big scale.

Hi, Roderic. I think this query gives a little bit more accurate counting of reference has role (P6184). Linking taxon name (P225) to the nomenclatural act (Q56027914) is very important step. I hope you can add the exact page reference (page(s) (P304)) too. Some days ago I started creating missing taxon names and references with a doi from IPNI from 2019 to 2000. --Succu (talk) 19:02, 9 May 2020 (UTC)
@Rdmpage: Thanks for getting involved with this. The general approach to doing such things at scale: (i) pick a few example cases, perhaps a mix of typical cases and some atypical ones that may help define the scope of the initial modelling efforts, (ii) look at how this is modeled currently and compare that to what you think it should be, (iii) improve those models by editing the items for the example cases and perhaps proposing additional properties, (iv) apply the improved model to some other items (tens rather than thousands) and see how that fits or is perceived, (v) set up quality checks (property constraints, Shape Expressions, maintenance queries), explore what they give and adjust accordingly, (vi) pipe your workflows into existing tooling like QuickStatements or make a bot request, (vii) adjust as necessary. I am sure you are aware of much of that, and from your posts and edits here and elsewhere, I see that you are well on this path already. However, I am bringing this up here because I think the first step is really important and because not everyone following this discussion may be aware that there is an effort to use a dedicated Wikibase instance (where properties can be created or deleted more quickly than here and without interfering with existing Wikidata functionality) to serve as a playground for the modelling of taxa and taxon names, including their respective relationships to the literature. While that wiki is seeded with Tyrannosaurus rex-related content, other example cases are expected to be modelled there as well, and anyone is welcome to join that effort (for which I am also pinging Andrawaag, Qgroom and Mtrekels) and to help document it here on Wikidata. While that is ongoing, the part that we can work on immediately (and have been for a while) is filling the gaps in terms of the relevant literature (especially sources other than journal articles with DOI) as well as annotation with topics and authors, so that when there is scalable progress in the taxonomic and nomenclatural modelling part, the literature part here is ready to be linked up. --Daniel Mietchen (talk) 16:16, 10 May 2020 (UTC)
@Daniel Mietchen: So there seem to be two separate things here. Regarding the literature, yes there's lots to do here, and I'm adding lots of missing stuff, building on the huge amount of work you and others have been doing. For example I'm adding articles with non-CrossRef DOIs, JSTOR content, articles with Handles, BioStor content, etc. There'a also a lot of CrossRef DOI content not in Wikidata, and huge gaps in citation linking (for all sorts of reasons, not least that CrossRef's citation data is spotty at best). So lots to do on that front. Re the taxon name stuff, if I look at Tyrannosaurus rex in the stand alone Wikibase I see two(!) properties, one of which points to a Zenodo record (using a URL with some weird hash included, not even a DOI) and this is for a reference that has an existing identifier (a Handle) (I confess I am not a fan of people minting Zenodo DOIs for content that has an existing publisher who may well plan to mint DOIs for this content in the future, but that's another story). My concern here is that this stand alone project seems to have made somewhat modest progress so far, and there are existing ways to include links between names and the literature in Wikidata which already have thousands of entries. I hope you can see that the dilemma I face is that I either go ahead using the existing framework, which may have limitations but is at least being used, or wait for something which does yet exist.--Rdmpage (talk) 17:25, 10 May 2020 (UTC)
The trex-wikibase is an experiment in isolation and not intended to extend beyond that specific example. It grew from the observation that the way taxa are modelled in Wikidata is rather monolythic. Potential, minor or major, changes to the underlying datamodel can only be suggested one property at the time. This makes adjusting the underlying data schema to cater different taxonomic use-cases hard if at all possible. With the recent developments, where deploying Wikibases through Docker or wbstack is made easy, it is suddenly possible to first craft a complete implementation/description of a taxon in a designated wikibase before engaging in Wikidata. Call it a draft. At a second stage the draft can act as a description of one view on a taxonomy. If the Wikidata community at large does not accept that view, there is that wikibase where that use-case can be catered. In this scenario other examples deserve their own wikibase, if needed. --Andrawaag (talk) 20:09, 10 May 2020 (UTC)
@Andrawaag: Thanks for clarifying. The way Wikidata models taxa and names is not ideal, but I think it's probably flexible enough to be usable. It's seems clear that Wikidata isn't "monolythic" in the sense of supporting only once classification - you can't guarantee to retrieve a tree simply by collecting parent-child taxon relationships. Hence you could argue that it can store multiple classifications simply by give a source for each parent-child relationship (so that a query can retrieve a given classification by restricting parent-child pairs to just those with a given authority). Personally I often wish that Wikispecies had restricted itself just to names not taxa, and so it would be a database of nomenclature, but I think the reality is that too few people care about the name/taxon distinction for it to gain enough traction. My worry about having separate Wikibases is that doing that immediately cuts you off from the two greatest assists of Wikidata, the community of people and bots (often not at all interested in taxonomy) that can contribute, and the lack of connectivity which is at the heart of the knowledge graph idea. But I understand the desire to be able to made things in other ways, I often find myself looking at Wikidata and wondering why on earth people did things the way they did. But maybe sometimes it's better to play within the constraints? --Rdmpage (talk) 08:53, 11 May 2020 (UTC)
@Daniel Mietchen: @Andrawaag: And just to be clear, I'm trying to figure out the best way forward here. If you hadn't set up a separate Wikibase I'd probably be arguing that is exactly what we should be doing, and Wikidata itself is broken! So, there's an element of me just trying out different arguments to help clarify what is the best use of all the data I've accumulated. --Rdmpage (talk) 09:29, 11 May 2020 (UTC)
@Rdmpage, Andrawaag: Perhaps a good way forward on the taxonomy/ nomenclature end here is to propose a few more taxa/ names other than T. rex to serve as test cases for the integration with the literature. These can be worked on here or via some similar Wikibase playgrounds and serve to set up workflows that can be refined at small scale before being scaled up. Also pinging Myrmoteras who is particularly interested in the taxon treatment aspect of that. --Daniel Mietchen (talk) 13:11, 11 May 2020 (UTC)
@Daniel Mietchen, Andrawaag: Excuse my ignorance, but is it clear that the existing Wikidata properties aren't adequate for modelling names and taxa? In own attempt to figure out where we are now I was struck by how much work has already been done Wikidata_talk:WikiProject_Taxonomy#Understanding_Wikidata_taxonomy, are there compelling reasons not to use what's already there? --Rdmpage (talk) 14:36, 11 May 2020 (UTC)
@Rdmpage, Andrawaag: No, it is not "clear that the existing Wikidata properties aren't adequate for modelling names and taxa", but neither is it clear that they are. Apart from the question which properties to have, data modelling also includes considering which properties to use stand-alone and/ or as qualifiers, with what constraints (e.g. on what classes of items) etc. — things like whether and how taxa and taxon names (valid or invalid) should be combined into one item or spread out over separate ones, or whether the names should perhaps be lexemes, and how all that relates to types and other specimens as well as to molecular sequences, taxon treatments, the nomenclatural codes, the wider literature and data landscape and the respective authors. Some systematic exploration of this space seems merited in my view, but it should not stand in the way of carefully enriching what we already have here. As long as we strive for consistency within our current data models (and by and large, we are), we can minimize the likely pain that any potential future changes to them might cause. --Daniel Mietchen (talk) 13:22, 12 May 2020 (UTC)
@Daniel Mietchen: Yes, there is a lot that could be explored, and lots of things to think about, especially regarding specimen and sequence databases. The sense I'm getting is that we can do this in parallel: on the one hand we could populate Wikidata with names linked to the literature using existing properties, at the same time we can also experiment with other ways to model these entities. Doing one doesn't preclude the other, so long as we are aware that things might change in the future. Does that seems reasonable summary? --Rdmpage (talk) 14:31, 12 May 2020 (UTC)
@Andrawaag: What kind of „different taxonomic use-cases“ do you have in mind? Which questions can't be answered with the current WD model (given we have the data here)? --Succu (talk) 18:00, 11 May 2020 (UTC)
@Succu: so linking the reference to the nomenclatural act (eg: recombination) is preferable to the taxon name being an instance of recombination? I'm new here and trying to understand how best to add data. Thanks. Friesen5000 (talk) 23:16, 12 May 2020 (UTC)
@Succu: It is not so much the existing data, but more the mismatch in underlying schema's that seems to cause concern. Although I do see Wikidata as the perfect place to present and maintain disagreement, it can be really hard to convince others for the existence of multiple views. e.g. There seems, to be an issue on how to order taxa, taxon names, and treatments. Further more the schema seems to differ between the taxonomies of plants, animalia, etc. Again I am convinsed all (sometimes disagreeing) views can happily co-exist on wikidata as long as proper references are maintained. However, getting there is discouraging to some. That is why first maturing a schema on a (partly) isolated wikibase might be the best way forward to the integration into wikidata and that is exactly what we are doing in this wikibase. Call it a staging platform. --Andrawaag (talk) 07:32, 23 May 2020 (UTC)


Hi Rdmpage, I am not quite sure to understand what this all is about, but are you aware of the existence of the property P5326 (P5326)? (I strictly restrict its use to original combination and I do not know how to model literature for recombinations). Totodu74 (talk) 13:13, 18 June 2020 (UTC)
@Totodu74: Hi, yes I'm aware of 5326 Search (it's in the diagram I discuss above in "Understanding_Wikidata_taxonomy"). As seems often the case in Wikidata, there are several ways to do the same thing, and it's often not clear which way is best. Should we use P5326 (P5326) to link a name to a reference, or add the reference to taxon name (P225) View with SQID? (I would lean towards the later approach as it links the reference to the name).

New property: World Checklist of Vascular Plants ID?

WikiProject Taxonomy has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

Hello! Is it possible to create a new property for World Checklist of Vascular Plants (Q96147023) ID ? TED 22:26, 8 June 2020 (UTC)

@TED: That is possible, as it's a good ID and haven't created yet. You just need do a proposal. Mr. Fulano! Talk 23:48, 11 June 2020 (UTC)
@Mr. Fulano: how can I do a proposal? TED 23:53, 11 June 2020 (UTC)
@TED: As you want a proposal of an authority control, you should access this page, insert the name of property (World Checklist of Vascular Plants ID) in "Property name", create a request and fill the template. If you want know how to fill it, you can use this proposal as example. Mr. Fulano! Talk 00:07, 12 June 2020 (UTC)

Plagiobothrys plurisepaleus & Plagiobothrys plurisepalus

WikiProject Taxonomy has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Plagiobothrys plurisepalus (Q17417249) and Plagiobothrys plurisepaleus (Q63153811) are orthographic variants of one another, with CHAH considering Plagiobothrys sepalus to be an orthographic variant of CHAH's accepted name, Plagiobothrys sepaleus. They are the same taxon concept. Hence these two items should be merged. MargaretRDonald (talk) 02:29, 19 June 2020 (UTC)

Synonyms for a novice editor

Hello all! Just a quick question. Balsamia polysperma (Q96677624) is deemed to be a synonym of Balsamia fragiformis (Q10425586) by Index Fungorum (Q1860469). Other external IDs however does not mark it as a synonym. I do not have professional knowledge of taxonomy, so I just wanted to see if I have structured this correctly on the item page. (tJosve05a (c) 07:06, 27 June 2020 (UTC)

@User:Josve05a: Please use taxon synonym (P1420) at Balsamia fragiformis (Q10425586) to model this. With your reference of course. --Succu (talk) 20:03, 27 June 2020 (UTC)
Alroght, done that now for this item and will do so in the future. But can we create a query (or warning) for all instances when "P31:synonym of X" exists but no matching taxon synonym (P1420) at X? (tJosve05a (c) 07:42, 28 June 2020 (UTC)

FYI--GZWDer (talk) 16:28, 28 June 2020 (UTC)