User:ProteinBoxBot/201408 sprint

From Wikidata
Jump to: navigation, search

Overall summary[edit]

The goal of this sprint is to load all identifiers for human genes and proteins. Our goal is to make Wikidata the canonical resource for referencing and translating these identifiers.


Background info[edit]

Current example (Reelin)[edit]

human gene: Q414043

human protein: Q13569356

mouse gene: Q14331135

mouse protein: Q14331165


pyWikidata documentation:

Past code repo:

Wikidata properties and “schema”:

PyWikidata example scripts:

Sandbox repository:

Data files[edit]

Game plan[edit]


  1. confirm all necessary properties exist
  2. list what is already captured in wikidata
  3. prototype an example gene by hand
  4. write the bot
  5. test on 2 genes
  6. test on 10 genes
  7. full run

Step 1 -- define properties[edit]


Missing properties


Step 2 - List what is in Wikidata[edit]

When adding duplicate entries into WD, they are just added. So before adding we need to first identify if an entry already exists in WD before adding. @andrawaag hasn't been able yet to resolve an existing entries based on either an identifier or a label.

  • List of Genes
Gene resource Number of entries Number in Wikidata python script script results
Ensembl 62757 5247 (on 20140810) source Ensembl genes in wikidata
Entrez Gene 47808 1891 (on 20140810) source Entrez genes in wikidata
HGNC tba tba
  • List of Proteins

Step 3 - prototype an example gene by hand[edit]


  • glyceraldehyde-3-phosphate dehydrogenase Q17487710


  • Breast cancer type 1 susceptibility protein Q17487737

Step 4 - Write the bot[edit]

  • Add ensembl entry: [1]
  • Add entrez gene entry: [2]

Step 5 - test on 2 entries[edit]

  • Ensembl: ENSG00000121207 lecithin retinol acyltransferase Q17501046
  • Entrez Gene: 6256 retinoid X receptor, alpha Q17516011

Bot code

Step 6 - Test on 10 entries[edit]

  • Entrez gene 2, alpha-2-macroglobulin Q17543707
  • Entrez gene 32, ACACB Q17543787
  • Entrez gene 361, Aquaporin 4 Q17543931
  • Entrez gene 649, Bone morphogenetic protein 1 Q17543993
  • Entrez gene 5618, Prolactin Receptor Q17544128
  • Entrez gene 10413, YAP1 Q17544238
  • Entrez gene 22914, KLRK1 Q17552950 * skipped due to multiple entries in Ensembl listing in
  • Entrez gene 406947, MIR155 Q17553105 * skipped due to multiple entries in Ensembl listing in
  • Entrez gene 26238, C6orf123 Q17553915
  • Entrez gene 101241891, LINC00850 Q17554559

Bot code

Step 7 - Full run[edit]