Our goal is to make Wikidata the canonical resource for referencing and translating these identifiers. The goal of this sprint is to:
- continue adding all identifiers for human genes and proteins from User:ProteinBoxBot/201408 sprint.
- Create a generic workflow for additional resources
- Create an update bot
- Search for a property ID: www.wikidata.org/w/api.php?action=wbsearchentities&search=<property>&language=en&type=property (e.g. Entrez)
Original source data files
- Ensembl: ftp://ftp.ensembl.org/pub/release-76/mysql/homo_sapiens_core_76_38/
- Entrez Gene: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
(all information coming in via http://mygene.info )
- Separate two bot functions: 1) creating stubs, 2) adding additional properties and updating; confirm that both can use the Wikidata API on the Entrez Gene property
- Add the remain entrez genes as stubs in WikiData
- create infrastructure for continuously-running update bot
Separate two bot functions
In this sprint three bots have been developed:
This bot takes a text file containing Entrez Gene Identifiers and their labels as input. Subsequently the label of each Entrez gene identifier is added to WikiData if it doesn't exist already. Three properties are added:
- The Entrez Gene Identifier (Property:P351)
- Subclass of Gene (Property:P279 and Q7187)
- Found in Taxonomy Human (Property:P703 and Q5)
It took about a week for the StubBot to process all of Entrez Gene. A log file is kept registering Entrez Gene Ids and their respective WikiData ID: Entrez2WikiData_Success.log
Once the stubs are created for the Entrez Gene entries, a second bot takes over to enrich with content provided by http://mygene.info. When writing this update this bot is still running and covered 34% in one week.
The third bot written during this bot, was to correct an error that was made in last months sprint. In the first development stages the bot did not check whether or not an entry for the entry gene label already existed in WikiDatam, but annotated as another subclass of ...., than Gene. This error made around 200 erroneous additions. The geneProteinCorrectionBot fixed this. This bot is a bot that is only needed once, since in the latest ProteinBoxBot code, the bug responsible is fixed.