User:ProteinBoxBot/Gene and protein items

From Wikidata
Jump to navigation Jump to search

Introduction[edit]

The ProteinBoxBot maintains information about Genes, Diseases and Drugs in Wikidata. The entities in these three domains are maintained by different corresponding sub-processes of the main bot.

The objective of the gene sub-process (sub-bot?) is to add and update Wikidata with information about genes.

Current Scope[edit]

The set of entities maintained by this bot are determined based on their presence in the expert-curated NCBI Entrez Gene database.

At present, the bot is limited to genes from Homo Sapiens and Mus Musculus.

Items maintained by this bot[edit]

Gene properties currently maintained by this bot for these items[edit]

Property Description Datatype Expected value

(if not listed, see property definition)

Property:P279 subclass of Item Should always include gene (Q7187)
Property:P351 Entrez Gene ID String Should exist for EVERY item processed by this bot
Property:P703 found in taxon Item Currently should only include either human Q5 or mouse Q83310
Property:P594 ensembl Gene ID String
Property:P704 ensembl Transcript ID String
Property:P353 Gene symbol String
Property:P354 HGNC symbol String
Property:P593 homologene id String
Property:P639 RefSeq RNA ID String
Property:P1057 chromosome Item
Property:P684 ortholog item should be a gene property suggests using a species qualifier. this information could also be pulled from the targeted gene's taxon property, so perhaps skip the qualifier ?
Property:P644 genomic start String requires qualifier indicates genome build
Property:P645 genomic stop String ''
Property:P671 Mouse Genome Informatics ID String
Property:P688 encodes Item

Gene properties PLANNED for this bot[edit]

Property Description Datatype Expected value

(if not listed, see property definition)

Note
Property:P692 Gene Atlas Image image candidate for improvement if a newer, similar data source could be found. But, lets put in the links so we can completely reproduce the current template. Note that the genes that were already in Wikipedia pretty much all have their geneatlas images linked. The reason this is here is that the bot does not maintain these links - this was a one time import.

The 'encodes' property links gene items to items specifically about the protein, RNA, or other 'product' of the gene. A single gene corresponds to a particular region of a genome that is related to some set of functions. These functions are carried about by the gene's products. Different products may perform vastly different functions. Hence we separate functional information from the gene item itself, and attach this information to the product items wherever possible. (See discussion.)

PROTEIN properties maintained for this bot[edit]

Property Description Datatype Expected value

(if not listed, see property definition)

note
P279 subclass of Item One of: Protein (Q8054), RNA (Q11053), non-coding RNA (Q427087), .
P352 UniProt ID String Should exist for EVERY item processed by this bot
P638 PDB ID String
P637 RefSeq Protein ID String
P702 encoded by Item Should exist for EVERY item processed by this bot
P705 Ensembl Protein ID String
P591 EC number String
P18 Protein Structure Image Commons Media File should this be to a commons category with multiple images?
P681 Cell Component Item
P682 Biological Process Item
P680 Molecular Function Item

PROTEIN properties PLANNED for this bot[edit]

Data sources[edit]

The bot retrieves its content from the following trusted sources:

Bot approval[edit]

Bot approval discussion July 2013: Wikidata:Requests_for_permissions/Bot/ProteinBoxBot

Re-approval discussion June 2015: Wikidata:Requests_for_permissions/Bot/ProteinBoxBot_2

  1. http://www.ncbi.nlm.nih.gov/pubmed/23175613


Implementation[edit]

The bot code is open source and available for inspection. It is implemented in Python and is intended to be deployed as a cron job. The current bot is able to update Wikidata with all content (labels, synonyms, and external references) in less then 12 hours.