Our goal is to make Wikidata the canonical resource for referencing and translating these identifiers. The goal of this sprint is to:
- continue adding all identifiers for human genes and proteins from User:ProteinBoxBot/201408 sprint.
- Create a generic workflow for additional resources
- Create an update bot
In the last two sprints we focused on getting the genes of the human genome into Wikidata. Finally, the process consisted of two bot tasks. The first being a stub creator, where for each entrez gene entry, a stub was created. The stub contained, the title of a gene, its aliases, its entrez gene identifier and that claims that that entry was a subclass of a gene and that it was from the human species. The second bot, enriched each entry with related identifiers and chromosomal positions. The overal process was quite slow, due to multiple api calls that made. To add a claim, the bot needs to check if that claim already exists (1 call), if not a claim needs to made (2nd call), after which a reference was made (3rd call).
Check if it exists if not create an entry, add label, add aliases, state its a gene from the human species and its entry gene identifier, plus a reference to Entrez for each claim, which is 10 individual api call. For each subsequent identifier added, 3 api calls were needed. So a gene with 5 identifiers would results in a process of at least 25 api calls. In total processing all entrez human entries took us 6 weeks in total.
On the Wikidata mailing list it was suggested to use a different api call "wbeditentities", where the whole datamodel of a Wikidata entry is submitted in one single api call. It is the objective of this sprint to adapt the ProteinBoxBot to use this "wbeditentities" call to increase its performance. As a testcase the genome of the house mouse will be added.
- create stubs for all DO classes on Wikidata (Action Andra)
- create stubs for all SYMP classes on Wikidata (Action Andra)
- Manually complete the entry for Chapare hemorrhagic fever (Action Elvira)
Original source data files
- Entrez Gene: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
(all information coming in via http://mygene.info )
- adaptation to wbentityedit API
- add mouse
- automation on schedule
- import DO
adaptation to wbentityedit API
- Refactor the existing ProteinBoxBot to use the wbeditentity api call.
- test on 10 genes
- test on 100 genes
- test on 1000 genes
- full run
automation on schedule
- investigate automation by Sentry
- investigation integration with the update cycle of mygene.info
- confirm all necessary properties exist
- list what is already captured in wikidata
- prototype an example disease by hand
- write the bot
- test on 2 diseases
- test on 10 diseases
- full run