Hi, I am Chinmay Naik, an open source enthusiast, a free soul and an Arsenal fan. My first foray onto wikidata was through GSOC-2013 project "Convert Gene Bot to write to Wikidata" for the organisation Crowdsourcing Biology at the Scripps Research Institute. Now focusing on adding Global Economic data, Wikidata toolkits ...
Pygnewiki (Gene Wiki Bot) is complete and successfully added 10,000+ gene items onto wikidata. Cheers!! Now moving onto Global Economic data .
We have reworked the model representation. The fields of the PBB templates will now be captured onto four seperate wikidata items --
- HUMAN GENE
- HUMAN PROTEIN
- MOUSE GENE
- MOUSE PROTEIN
The project is hosted here. It is under active progress!
Short description: Pygenewiki(Gene Wiki Bot) automatically creates/updates ProteinBox Templates which form the infoboxes of gene wiki articles. It retrieves information about genes from databases (such as NCBI, HUGO etc),populates the ProteinBox Templates with this gene information which are then inserted into gene Wiki articles. Wikidata is free knowledge database aimed to provide structured data easily accessible to anyone. The new proposed bot would capture particular gene information onto a wikidata item and map it to the corresponding gene article. This will help to create structured data about genes in an easily accessible database. Also, other aspects about a gene such as its relationship with other genes, diseases etc can also be captured.
The current Gene Wiki Bot has the following design:
- It retrieves gene information from public databases through MyGeneinfo.
- Populate the Gene Templates with Gene information and these templates form the infoboxes in gene articles.
- The bot also creates new article stubs for new gene articles.
The proposed bot will have the following design:
- Continue to retrieve Gene Information from public Databases through MyGeneinfo
- Populate the associated Gene Wikidata item with the corresponding Gene information.
- In order to map infobox of new gene article, an associated gene Wikidata item should be first created which is then linked with the Wikipedia article.
Consider the outline of the current implementation of Gene Wiki Bot:
- The GNF_Protein_box Template defines set of fields to capture gene information. The fields represent the data present in infoboxes of gene articles. A ProteinBox class is defined with the above fields.
- The Bot accesses public databases through MyGene.info . MyGene.info provides REST web services to retrieve gene data. The bot retrieves three JSON documents to construct a ProteinBox. They are
gene.json – information about about human gene meta.json – meta data information about gene homolog.json – information about corresponding mouse gene. These JSON objects are parsed and the gene values are extracted and ProteinBox object for the corresponding gene is constructed.
- The Bot uses Redis database for linking genes to actual published data found in PubMed. Redis database is initialized with pubmed-ids (published article ids) for each gene. These published articles of PubMed form the references of gene wiki articles.
- When a new article stub is created, these references are inserted through Template:Cite pmid template. This syntax helps to fill out citation template every time we want to cite a source. The template is then converted to Template:Cite journal syntax by a bot.
- The Bot scans Commons for protein structure images. If it is not present in Commons, it downloads PDB file from RCSB database, renders it using PyMOL and uploads it to Commons.
- The Bot uses MwClient to query wikipedia. MwClient is Python client library of MediaWiki API which provides access to all implemented API functions. Using MwClient, the bot retrieves ProteinBox Template of the corresponding gene from wikipedia . It parses the Template and extracts current gene information.
- The current gene information from ProteinBox Template is matched with the constructed ProteinBox gene information. The outdated / missing gene values are updated in the ProteinBox template of the corresponding gene.
- The implementation of the Bot is Template driven. For each gene, a corresponding ProteinBox Template is maintained. Ex: Template:PBB/5649 for reelin. Gene information are collected in PBB templates. PBB Templates are included in gene wiki articles.
The proposed outline of the implementation of the Bot will be as follows:
- The first link of the Bot will be retained. Thus, the bot will continue to access gene info from Databases as the current one(i.e. through Mygene.info). Redis database which handles gene-pubmed mapping, Template:Cite pmid to fill PubMed references, inserting protein structure images through Commons and using PyMOL for image rendering(if not present in commons), Py.test for Unit Tests will also be retained in the proposed bot.
- The current Bot will move away from Template drive implementation. The main purpose of maintaining a separate Template for each gene was to store gene specific information . This helped the bot to work around with templates without directly accessing/touching the gene wikipedia articles. It helped to abstract the process of maintaining gene article infoboxes through templates and to shorten the length of wikipedia gene articles as well. Since a separate wikidata item will store gene info, we do not need PBB templates.
- The original GNF_Template structure will be retained with some modification. The parameter values will be replaced by Wikidata inclusion syntax .
- Wikidata API is an extension of the current MediaWiki API. As such, MwClient will be ineffective to query Wikidata. Thus the proposed bot will use PywikipediaBot framework (rewrite branch)to query Wikidata. This framework supports all necessary functions such as creating Wikidata items , Statements, properties, querying wikidata etc.
Create/Update WIKIDATA Item
To capture gene information, a separate wiki data item for each gene should be created. To prevent duplication, it should be first checked whether such an gene Wikidata item exists.Wikidata is document-oriented database focused on items. It has been proposed to classify these items depending on various categories(wikidata development notes). Thus, it will be possible to query the entire group of gene items by specifying a category similar to the current set of templates as “Human Proteins”. Else, we would have to keep track of the entire set of gene items.
Create /Update WIKIDATA Item Statements
Wikidata Item statements will store various aspects of gene information. These include identifier properties such as Entrez ID, Type etc, relationship properties etc. Each statement will have a property,associated value, references, qualifiers etc A separate wikidata item for mice will also be created to capture last eight identifiers(Mm_EntrzGene etc).The 'HasOrthologousID' will point to wikidata item of mice.
The structure of references is under development. The wikidata model specifies that website links will be supported in the future A URL/URI datatype is also in the pipeline.It would better to work with properties as strings and when URL datatype is available we can automatically convert those properties.
There is long list of external links associated with each gene(ex: gene ontology information, HUGO nomenclature etc). These external links are hidden by default and working with them as they are would work fine as well. However, some of these links are significant enough to be stored on wikidata and then link them to gene articles. Populate Gene Item Statements will Correct Gene Information
- Data retrieved from MyGene.info will be used here. If an particular gene value is empty, then the corresponding property value will be empty. The bot will first need to check the Valuetype of each property.
- When a wikidata item/property is created, it is associated with an constant identifier. (Ex reelin has wikidata id as Q414043. Property 'Entrez Gene ID' has id p351). This gene item is an important object within Wikidata/ Wikipedia. As such, separate dictionaries to hold entire set of gene wikidata items, entire set of possible properties used to represent gene items can be maintained. Whenever a new gene wikidata item/ biological property is created the corresponding dictionary can be updated. Ex:
Gene_Wiki_items = [ 'reelin' : 414043, ….]
Map Wikidata Items to Wikipedia articles By default, every Gene Wikipedia article will be linked with the corresponding Wikidata item. The main principle of the proposed bot can be termed “Transclusion of wikidata items”.
The source of the GNF_Protein_Box Template can be modified as (sample modification of image template field) Infobox3cols ….. | image = …..........
Infobox3cols ….. | image = …..........
- All template field values can be replaced by corresponding Wikidata inclusion syntax for property. When the page is rendered, the template structure will be first inserted. The parser will then map the calls with the linked Wikidata item. There will be no modification in the Template syntax in gene wikipedia articles. Currently, only “Transclusion of default wikidata items” is supported.
- It has been planned by the wikidata development team to support “Transclusion of wikidata items other than the default item” as well . This feature will help us in the following scenario. Suppose we want to transclude wikidata information about wikidata item representing mice or another organism. Then, in the GNF_Protein_Box Template we will be able to extract such info by specifying the link of the wikidata item. The syntax can be something like this:
PBB|<id of human wikidata item>|<id of mice wikidata item> Transclusion of other items is yet to be implemented and the syntax above is just a possible way of supporting it.
The proposed bot will support the following functions
- Create a new property for every gene wikidata item. This can be implemented by obtaining the entire set of items from category “Human Proteins”,then update each item.
- Update a specific set of gene items. The set of items to be updated can be specified as input through their gene names/ wikidata ids in text file(would be better than input through console if we have large quantity of wikidata items)
- Update specific properties (Ex: only image property) of particular set of gene items. Possible implementation could be specify gene names and properties through text file.
- Output information to bot host about a specific set of wikidata items.Output last set of revisions made to specific set of gene items. Also specify the modifications made by each of these revisions.
Authorized Login Similar to the one supported in the current bot, an authorized user name and password will be required to operate the bot. This is currently implemented on Pywikipedia bot framework. It checks for authorization of all MediaWiki sites. Some modification to check for only wikidata login is needed.
Tackle Errors/manual incorrect edits / spamming / vandalism
The Pywikipediabot framework handles network errors login errors incorrect edits(adding a wrong data type for a property value) invalid items
- The proposed bot will build upon existing error handling to tackle other errors like 'modifying non-existent properties', 'valuetype mismatch' ,'valid references' etc. Similar to the current bot, we can build a logger to effectively handle errors.
- Wikidata is easily accessible and can be edited by humans/bots alike. A human / bot may edit some gene wikidata item with incorrect / correct information deliberately or by accident. Such information will propagate onto gene wikipedia articles. It is necessary to have mechanisms to handle such situations.
- The gene wikidata items can be made as semi-protected pages so that only auto-confirmed users can edit these items.
- A simple mechanism against vandalism would be to patrol the recent changes made to each item. The bot can monitor a watchlist of gene wikidata items.
- If we have gene wiki statements which are guaranteed to have correct information, then we need a separate mechanism to prevent modifications to such statements.(example: Entrez Gene Id 5649 for Reelin) Currently, various ideas are being discussed. http://meta.wikimedia.org/wiki/Wikidata/Preventing_unwanted_edits#Protect_data_that_will_not_conceivably_change
- Revisions made by bot are distinguished from manual revisions since they are not added to recent changes list. From the watch list, get recent changes made to gene wiki data items. If a manual revision is encountered, get the revised data. Check whether the revision reflects updated information from MyGene.info . If not, alert bot administrator that a manual revision has occurred. In this case, the bot will display revised info . The bot administrator needs to determine whether such a revision is valid or not. The bot itself is unable to make such decisions.
The single biggest challenge will be adaptability. Wikidata is currently in its infant stages. Thus , a key aspect would be to be up-to-date with the current development phase of wikidata and to best utilize implemented features in this project. This can be achieved by constant communication with developers of Wikidata project. I have already reached out to them on their mailing list and have posted a few queries regarding current implemented features of the Wikidata project. To be truly flexible, I have tried out various examples on Italian Wikipedia (since English Wikipedia was not activated then) with their help.
A comprehensive set of properties for a gene wikidata item to capture gene information. A modified GNF_Protein_Box Template. Create a complete set of gene wikidata items Deployment of gene wikidata items to gene articles A highly reliable Bot to automatically update,maintain and query Gene Wikidata items.
Presently, I have worked out rough bot scripts to create sample wikidata items, query wikidata properties using Pywikipediabot framework.
Present - May 27th : Study the Pywikipediabot framework further. Get familiar with Wikidata API. Keep in constant touch with Wikidata development community and Crowdsourcing Biology community. As new phases of wikidata are rolled out, be familiar with these features. Study the properties created by Molecular Biology task force further. Plan out the entire structure of statements to represent a gene wikidata item.
May 27th – June 17th (Community Bonding Period):
Manually create a test case gene Wikidata item with proposed statements/properties. Fill these statements with the correct property-values, references. Plan the complete set of how interactions should work between items with articles and items with items.
Work with the community to design a new GNF_Protein_Box Template structure.Manually link the test case gene item with a test case wiki article. Work on transclusion of the new GNF_Protein_Box Template through Wikidata inclusion syntax.
(Deliverable 1 )
By the end of this period, implementation details would be clear to me.I will be having University exams during 29th May – 6th June. Apart from that week, I will be fully devoting my time on the project.
June 17th – June 30th (Coding Phase 1 - [2 weeks])
Work on first link of the bot , i.e. extracting data from public databases. Work on creating containers to hold data needed to populate a gene wikidata item.
July 1st – July 20th (Coding Phase 2 - [3 weeks])
Work on second link of bot, i.e. Scripts for creation of Wikidata items. Work on filling gene Wikidata statements with appropriate gene information through the bot.
July 20th – August 4th (Coding Phase 3 - [2 weeks])
Deployment of wikidata onto gene wiki articles. The infoboxes of gene wikipedia articles will be filled with data from gene wikidata items. Incorporate any modifications (change in datatype of properties i.e. references etc) to wikidata items. Work on security mechanisms for the gene wikidata items(tackle errors etc).
August 4th – August 18th (Coding Phase 4 - [2 weeks])
Add various functionality to bot such as update specific set of wikidata items, output data of specific set of wikidata items, modify only specific properties of given set of items,create a new property for all gene wikidata items etc.
August 19th – Sep 9th (Testing Phase - [3 weeks])
Rigorous Testing of the bot. Review existing code with help from mentor. Work on all possible cases/errors that can occur during handling of bot such as Network failure, invalid edits and so on.
Work on tests to prove correct handling of the bot. After first setup of bot, these tests can be used to verify correct installation of bot. Provide gene test Sandbox items. First time bot users can learn handling of bot using these test items.
Sep 10st – Sep 14th (Documentation) Exhaustive documentation of the bot along with examples. Specify the setup requirements, step by step installation guide, bot usage examples etc.
Sep 15th – Sep 23rd (Final Phase) Review bot code, improve documentation ,tests etc.
Stretch goals There are other aspects of gene information which can be collected on a gene wikidata item. Ex: Relationship properties from gene to gene, mapping gene with diseases etc. This will involve creating new set of properties for each gene wikidata item retrieve such information from public databases Create a new template to capture these additional gene information or redesign existing GNF_Protein_box template. Additionally, working towards further improvement of visualization elements of genewiki and genewiki+.
WHY ME? My preparation for the project
I am now familiar with the existing codebase and have worked out issues as well. Since I have started early to contribute to the project, I am involved in wikidata development from Phase 2 roll out of Wikidata on the 11 live wikipedias.
I have been in constant touch with mentors(Ben,Max) to discuss proposal ideas,with the wikidata developing community to identify the current implemented features of Wikidata. From implementing Test Examples on Italian Wikipedia to building prototype wikidata items have given me a concrete understanding of the how the bot needs to operate.
I have studied the development notes of the Wikidata development community which gave me a brief idea of the next set of features to be implemented. This has aided me to understand which direction wikidata development is currently heading towards. This will immensely help me to identify which set of features to work with (for ex: utilizing references, Transclusion etc) and to defer utilization of other features until they get implemented. Though Wikidata is in its initial stages, I can clearly identify the path ahead even in a bit of a foggy development atmosphere.
Link to code that you have written:
I have worked on variety of projects of different scale. Small projects include Notepad type application in C++, Hash Table implementation etc. Medium scale include Recommender Systems (based on Singular Value Decomposition). For the existing Gene Wiki Bot, I have isolated issues #1 and #2 and submitted patches for the same.I have begun working on prototype scripts for the current project.
Gene Wiki patches - https://bitbucket.org/sulab/pygenewiki/pull-request/4/issue-2-handling/diff https://bitbucket.org/sulab/pygenewiki/pull-request/7/issue1/diff (output log after updating templates affected by issue#1 -----https://bitbucket.org/chinmay26/myrepo/src/20a5255e5ea383bfb05c8f4cd804ab512315765d/log1?at=default)
Other projects repo----https://bitbucket.org/chinmay26/myrepo/overview
Personal goals for the summer
It will be a great deal of personal satisfaction for me, when I will play a part in maintaining the Gene Wikipedia articles. Developing code to maintain a form of central repository for genes which will be used by researchers will be amazing. I have always admired the work of the Wikipedia community and would absolutely love to be a part of this wonderful community. I consider this as a starting point in a long endeavour with the Open Source Community. It will also help me improve upon my biological concepts. Perhaps, next time I read a diagnosis report, I can make something out of it or at least know where to look for information :)