User:SCIdude

From Wikidata
Jump to navigation Jump to search
"things, not strings"

Amit Singhal

Hello, I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for UniProt-GOA (Q28018111) and Reactome (Q2134522).

Ralf Stephan (Q67363620)

Authority control
Babel user information
de-N Dieser Benutzer spricht Deutsch als Muttersprache.
en-3 This user has advanced knowledge of English.
fr-1 Cet utilisateur dispose de connaissances de base en français.
la-1 Hic usor simplici lingua Latina conferre potest.
ru-0 Этот участник не владеет русским языком (или понимает его с трудом).
it-0 Questo utente non è in grado di comunicare in italiano (o lo capisce solo con notevole difficoltà).
Wd-microbio.svgThis user is a member of WikiProject Microbiology.
GeneWikidata-logo-en.pngThis user is a member of WikiProject Molecular biology.
Nuvola apps edu science.svgThis user is a member of WikiProject Chemistry.
Users by language

Current ideas:

Illustration of Wikidata gene items properties (2019-08).svg
Illustration of Wikidata protein items properties (2019-08).svg
  • subc-of-protein AND foundation of anatomy ID --> usually a duplicate
  • for every GO complex, list parts and make subunit families
    • eyeball GO components with 'complex', equal to subc-of-complex?
  • pfam with func carbohydrate bindng physically interacts with carbohydrate (GO: intersection_of: GO:0005488 ! binding|intersection_of: has_input CHEBI:16646 ! carbohydrate
  • instance of protein fragment with 'of"
  • construct accessible pipe for verifying TCDB ID of proteins / families
  • import https://www.ebi.ac.uk/complexportal/home for linking reactome complexes
    • use cpx-homo.tab and UniProt IDs to associate CPX ids with Reactome complex ids
  • complexes from PRotein Ontology
  • MeSH protein entries are usually species-independent. Check heuristically and use
  • connect Reactome entities with existing families
  • Arabidopsis and Dictyostelium import
  • incomplete ChEBI: add reference for all InChi leys, InChi, isomeric SMILES, can.SMILES
  • incomplete ChEBI: add reference for all (subst is-a class)
  • incomplete ChEBI: add reference for all (class subc-of class)
  • ChEBI: import completion, full class hierarchy
  • ChEBI: import completion, all substances
  • ChEBI: check all substances are in their classes
  • PMCREF: use https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=2685584
  • use "substrate of"
  • subc-of-enzyme inhibitor + phys. interact. with XY --> inhibitor of XY
  • subc-of-agonist + phys. interact. with XY --> agonist of XY
  • use subc to describe pgroups exactly
  • UniProt protein families
  • sync prot-->part of-->enzfam if exact molfunc is annotated
  • multifunctionnal enzymes?
  • some proteins encoded by same gene, mark as variants
  • interpro and superfamily with description "InterPro Domain"---> really are domain superfamilies
  • check IPR items for correct Pfam (via IPR), also move Pfam from other item
  • GO items with changed label are suspected to be WPedians fumbling result
  • if TCDB fam X subclass-of TCDB fam Y --> missing reference dbhierarchy heuristics
  • peptidases with endopep func
  • IUPHAR IDs without Wikidata, anyone?
  • IUPHAR family IDs, anyone?
  • mebranome classes https://membranome.org/
  • add "stated as" qual. to ChEBI ids of amino acids / their zwitterions; make special contraint including this
  • BindingDB ids?
  • dbSNP import?
  • missing OMIM phenotypes, e.g. 1?
  • OMIM phenotypic series, see their FAQ
  • orthology group ids/groups, see bot issue
  • do all ions have charge in their label?
  • next MONDO sync?
  • german labels from Brockhaus
  • remove em-dashes from labels
  • items with dewiki but without enwiki and en-label
  • industry processes don't have-parts all reactants/modifiers

In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around insulin (Q7240673), Angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), Glucagon (Q66310097), Proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)

What I'm doing is roughly this:

  • if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
  • remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
  • create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
  • separate out aliases to resp. objects
  • add "has part" with all fragments to prepro object
  • complete "encodes/encoded by" everywhere
  • add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
  • add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
  • add "part of" Reactome process or reaction if missing
  • (maybe) move GOA function annotations to resp. fragment if applicable

misc[edit]

{{section resolved|~~~~}} {{Q|21105303}}