User:Strobilomyces/IFWDFAL detail

From Wikidata
Jump to navigation Jump to search

Index Fungorum/Wikidata Fungus Author Loader - detailed procedure

The software is written in Python and runs under Windows. I use PyCharm as IDE. Each batch of species names has a designation like g_m where "g" identifies a large group of batches and "m" is the particular batch. The folder and file names are then built in a standard way based on "g" and "m". During the exercise I was continually revising the software and introducing more automation.

After creating the input text file for a batch of species, the subsequent Python steps of the procedure can be invoked from Python monitor commands. Also I started to develop a graphical dashboard (using "tkinter") to make the process more convenient, but that is an independent sub-task not further described here.

The main TinyDB database table is called "taxa" and contains one "document" for each taxon item to be updated or generated. In this type of database a table consists of "documents" which can be queried. The document for a taxon includes fields for name and status information (including errors and warnings) and also data brought in from Wikidata (WD) and Index Fungorum (IF). A complete list of the fields can be seen here.

I started off with small batches of species and was working up to 50 and more at a time, hoping to automate the process better with more experience. First I applied the process to family Marasmiaceae, then I was working through order Agaricales in alphabetical order of genus. The steps of the process are described in more detail in the following sections.


1. Create the input species list for the batch

The overall process adds author citation information to species which already exist in WD. To create the input I used a SPARQL query like

select ?WDtaxonName where {
  VALUES ?parent { wd:Q1860430 wd:Q5184890 wd:Q10461686 }
  ?WDsp wdt:P31  wd:Q16521.   # instance of taxon
  ?WDsp wdt:P171 ?parent. # parent taxon
  ?WDsp wdt:P105 wd:Q7432.    # taxon rank species
  ?WDsp wdt:P225 ?WDtaxonName.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}  ORDER BY ?WDtaxonName

The output is downloaded as CSV to Notepad++ and the tabs are replaced by spaces. The ?parent values are genera in WD, which were found earlier by another SPARQL query.


2. Create and populate the database entries

First skeleton entries are created in the TinyDB database for each input species in the batch.

Then for each input species, the WD Q number is found using the SPARQL API based on property taxon name (P225), then the full WD information is obtained in JSON from the WD API and translated into a Python object. Then the relevant properties are extracted, checked and written to the TinyDB entry. This WD information is needed so that changes will not be executed if they have already been done.

Then for each input species without an error, the relevant information is extracted from IF based on the species name. Since IF often has multiple entries for the same name (but with different author citations), this involves an algorithm to select one entry if possible : entries are rejected if there was a parsing error, the author string or publication year is missing, or the record is flagged with certain bad status indications, and (if that still leaves multiple entries) if the entry has no current name. For the case where no unique entry is found, the TinyDB database contains a "special taxon ids" table which maps names to IF id. numbers; the mapping in this table establishes the IF entry immediately and overrides the algorithm - where needed the table can be updated manually and the database entry repopulated. When the IF entry is found, the relevant fields (the ones listed here which begin "IF") are loaded into the database, together with error and warning messages if any.

If there is already an author list present in WD, it is checked against the authors from IF.


3. Generation of basionym and parent entries in the DB

For each input species with a basionym (and without error), the basionym item is searched for in WD and if it exists, the Q number is recorded in the DB. If it does not exist, a new entry is created for it in the taxon table and the relevant fields are populated from IF. A similar process is carried out for replaced synonyms.

When a new basionym entry, replaced synonym entry, or parent entry is created, its own parent is checked and if necessary created and populated as another entry in a recursive process. Also a parent may itself have a basionym or replaced synonym. In fact there is no limit to the number of new taxa which could be generated in this process, since an item at the species level can have a basionym at the subspecies level, whose parent species can have a further basionym at the subspecies level, and so on.

The parent names of species and subspecies are derived automatically, but for higher ranks this process needs manual intervention.


4. Missing authors

For each author abbreviation in the IF author citation string, the item for that author is searched for in WD based on property botanist author abbreviation (P428) using a SPARQL API call. A space may need to be deleted from the IF author abbreviation to convert it to the IPNI abbreviation which is used in WD. The Q numbers of the authors found are added in the TinyDB DB. If the author is not in WD, it is searched for in IPNI and if found there, V1 QuickStatements commands to create the author item are generated. However the sex or gender (P21) property of the author is not included as it is not in IPNI.

A manual search in WD is needed before creating the new author item; it could be that the person is already in WD, and it is only necessary to set the P428 value. IF should use the IPNI author names (sometimes with one extra space) but the two databases may be out of step, and if the given abbreviation is not in IPNI, that database should be searched manually based on the name of the author which may be found manually from IF. The software includes a table which maps IF author abbreviations to IPNI/WD author abbreviations in exceptional cases; this table needs to be updated if there is a discrepancy and also Paul Kirk of IF should be informed. If the author is completely missing in IPNI, it is necessary to send an E-mail to ipnifeedback@kew.org; they normally add the missing entry promptly.

After the author item has been set up in WD, it will be necessary to refresh the relevant taxon item(s) in the tinyDB DB.


5. Problems finding names

There are various reasons why a name in WD may not be found in IF, in which case manual investigation will be needed to decide how to proceed.

  1. Sometimes very doubtful names are introduced into WD, especially from the iNaturalist site, which seems very unreliable. For instance Gliophorus fenestratus (Q49629887) does not seem to exist anywhere else - anyway the IFWPFAL process will not work without an IF entry.
  2. Alternative spellings can cause names not to be found in IF. For instance Clavaria gibbseae and Clavaria gibbsiae both had items in WD but the former is just a mis-spelling and I merged them manually. Entoloma readii and Entoloma readiae are similar. However it is not possible to merge them if both spellings have wikilinks to wikipedia articles of the same language. Swedish WP items are generated automatically and that sort of duplication can happen. In one case I followed it up using a talk page and the Swedish editors were helpful and merged their articles accordingly.

    Another example of variant spelling is Entoloma chalybaeum, which was in WD as Entoloma chalybeum; in this case I altered the WD taxon name (P225) name to agree with IF.

    Accents are not allowed in species names, but Calyptella nabambissoënsis is with accent in WD & Mycobank, though not in IF.

  3. There may be a complete mistake in WD. For instance Gymnogaster is a genus of fungi and also a genus of insects (that sort of name conflict is allowed by the nomenclature rules). Gymnogaster buphthalma was wrongly given Gymnogaster (Q1610174) as a parent instead of Gymnogaster (insect) (Q28946615). Also Flammulina may be a fungus or a mollusc.
  4. Early mycologists sometimes used Greek letters such as β as a sort of rank indication, for example the basionym of Clitocybe mortuosa is Agaricus metachrous β mortuosa (evidently the latter is below the species level). Although the IF API returns those letters in its output, queries which use them do not work (as far as I can tell). Sometimes the characters seen on the interactive IF interface are different from those returned through the API. The "special taxon ids" table (see above) can be used to get around this problem with manual intervention.
  5. In rare cases the species may have been hidden in IF, probably because correct publication data are not available. For instance Agaricus aurantioviolaceus is in Mycobank. The IF/Mycobank number 463125 comes from IF and so the name must exist in IF, but querying on this name is not allowed. In fact the IF information can still be obtained from IF through the API based on the IF/Mycobank number.
  6. If multiple IF entries are found even with the selection algorithm described above in section 2., the "special taxon ids" table can be used to set one of them (chosen manually). For instance the name Agaricus decipiens existed with four different author citation strings and corresponding different meanings: "Scop.", "Willd.", "Pers.", and "W.G.Sm."; one of them had to be chosen manually.


6. Creation of new items in WD

When all possible interventions have been made to correct problems, and the corresponding database entries have been recalculated ("repopulated"), the QS commands to generate the items to be created (for basionyms, replaced synonyms or parents) are generated. A warning is always generated for replaced synonyms (so that they will be checked manually) and for those taxa which depend on other taxa which are not yet in WD. For a given batch, 3 files are generated for those new items which have no errors against them:

  1. commands to create items for which there are no warnings,
  2. [for new items with warnings] those command lines which do not need Q numbers of items not yet created, and which therefore can be executed immediately,
  3. [for new items with warnings] those command lines which use taxon ids not yet created; the Q numbers in question being replaced by a phrase in asterisks.

The new items sometimes cannot be created in one run since two new items have to point at each other and the Q number of each one is not known until it has been created. For instance the basionym of Floccularia straminea (Floccularia straminea (Q54359393)) is Armillaria straminea, which replaces synonym Agaricus stramineus Krombh.; items have to be created for the last two names. The Armillaria stramineaAgaricus stramineus is needed to set the replaced synonym (for nom. nov.) (P694) value of Armillaria straminea, and the item number of Armillaria straminea is needed to set the subject has role (P2868) value of Agaricus stramineus.

Each file will be executed using V1 QuickStatements. First the commands are copied and pasted to the QS page and the "Import V1 commands" button is clicked. On the resulting screen checking is easy (since item labels are given as well as Q numbers) and "Run" is clicked to actually make the changes in WD.

If there are any lines in file 3, they can only be completed after file 2 has been executed so that the relevant Q numbers are known. In the present state of the procedure the necessary substitutions in file 3 are made manually.

Then these changes are checked in Wikidata.


7. Generation and execution of updates in WD

After the new items have been inserted, the database entries for items which depend on them have to be recalculated ("repopulated") so that the new Q numbers are found where appropriate, and the relevant warnings will no longer appear.

Every query to find a WD item based on claims has to use SPARQL and the SPARQL interface has a very troublesome limitation in that it depends on a cache which sometimes takes hours to refresh. This means that after loading new items it may be necessary to wait hours before the repopulation can take place successfully.

When the taxon entries depending on new items have been repopulated, the QS statements to do the updates to existing taxon items are generated. Note that for updates not only the IF information, but also the WD information which is already there has to be checked to determine what changes need to be done; for instance if the taxon name (P225) author qualifiers are already there, they are not added again. For a given batch, 2 files are generated for those new items which have no errors against them:

  1. commands to update items for which there are no warnings,
  2. commands to update items for which there are warnings, and which therefore need special checking.

Again each of these two files will be executed using V1 QuickStatements, as described above for newly inserted items.

Finally the batch of updates is checked by comparing WD and IF for particular species. Special attention is paid when the warnings show that problems might occur (such as with replaced synonyms). On the other hand, where experience shows that the process normally works smoothly, only a proportion of cases are checked. The problems and error cases are recorded in a Libreoffice spreadsheet.