From Wikidata
Jump to: navigation, search


The ProteinBoxBot and associated MicrobeBot are working to populate Wikidata with information at a level of quality and up-to-dateness that would justify the use of this content to drive scientific and potentially even medical applications. To achieve this goal, it is vital that data consumers be able to easily access and interpret the evidence underlying the claims. The patterns for representing the semantics of the associated evidence and the provenance trails linking back to the original sources of information must be stably defined and implemented such that software can be written (most likely by other groups) that exposes this information to the intended end users. The purpose of this page is to identify and resolve any inconsistencies in the approaches taken so far and to define specific guidelines for moving forward. These can serve to organize the gene wiki team's efforts and potentially serve as a model for other groups with similar aims.

Provenance Patterns[edit]

Given a Statement (claim & semantic qualifier), who made the claim? When? Where can I get the data? In other words what are the references?

Guidelines for Referencing Databases, Ontologies and similar Web-native information entities.[edit]


  1. stated in (P248) with the object of the reference pointing at a Database item (e.g.: Monarch Disease Ontology (Q27468140)) or, if it exists, a Database version item in wikidata (e.g.: InterPro Release 59.0 (Q27135875)). (Version item is preferred if it is feasible)
  2. retrieved (P813) the most recent point in time when the claim was checked against the database. E.g. if a bot re-examines a data source and nothing has changed, this date can be reset to show this.
  3. In order to provide a direct link to the data use either a data source specific identifier like InterPro ID (P2926) or, if none are present, a reference URL (P854) linked directly to the relevant entry in the database.

DO NOT USE: imported from (P143), software version (P348), language of work or name (P407)

If the information item owner uses a stable version system - e.g. 'official releases' on a periodic schedule:

Then use a Database version item (e.g. Phenocarta release 2016-02-04 (Q22978334)) to record:

  1. publication date -> timestamp for release
  2. archive URL -> URL for downloading data from that release
  3. edition or translation of -> Connection to parent database item (e.g. Phenocarta (Q22330995))

Else, (the information item is updated chaotically as with unpredictable pushes to version control system):

Then use one item to reflect the information entity (e.g. an item for the Disease Ontology) to record publication date, archive URL and update those fields whenever the information entity is changed (and those changes are noticed and acted upon by a relevant bot or other editor.

Possible model extension in the future:

  • Add the concept Dataset Distribution to link different representations of a database such as isql, tsv, RDF, ... to the database version item in wikidata, to align with the W3C dataset description specification [[1]].

Guidelines for Referencing Articles (e.g. journal articles)[edit]

Use: stated in (P248) with the object of the reference pointing at a item for the article (e.g. The Structure of Electronic Excitation Levels in Insulating Crystals (Q21709348))

Guidelines for referencing individual curators[edit]

Use: curator (P1640) as a property on the reference

Guidelines for referencing Gene Ontology annotations (and template for other complex annotations coming from multiple parties and presented through multiple aggregation services)[edit]

GO annotations should referenced in a manner similar to how they are displayed in QuickGO (example. Format described in detail here and here). Data from the "With" column is not captured at this time. Each GO term statement should have qualifier stating the determination method (P459). A statement can have multiple determination methods and multiple references. The reference should include the following properties:


Example GO annotation reference model Q27553062

Backgroud info[edit]


  • Discussion [2]

Miscellaneous notes on references[edit]

Not all statements require sourcing: Help:Sources/Items not needing sources

Property:P2352 (Applies to taxon) would be a useful qualifier to use for describing claims coming from model organism research.

Property:P1013 (criterion used) might be useful in combination with 'determination method' - should be careful not to conflate.

Scientific Evidence Semantics[edit]

Given a Claim such as "Metformin Treats Diabetes" or "P53 has subcellular localization Nucleus", we would like to be able to answer the question of "according to what experimental method?" How was the 'fact' created?

Evidence Code Ontology[edit]

Wikidata items synchronized with ECO could play a role here, providing a stable reference for recoding the semantics of evidence statements. Proposing to do this as qualifiers on scientific claims.

Qualifier Properties used by the Gene Wiki project
name definition notes
determination method qualifier stating how a value has been determined so far used for genetic association and gene ontology (molecular function, biological process, cellular component) claims.
use main use of the subject (includes current and former usage) So far used in claims linking proteins to drugs to indicate the nature of the relation - e.g. agonist.
as generic qualifier So far used in claims linking drugs to proteins to indicate the nature of the relation - e.g. agonist.

Gene Wiki ACTION item - review this reciprocal model, seems wrong.

genomic assembly specify the genome assembly on which the feature is placed Used to indicate specific genome build used for gene coordinates/

Proposal for use of Instance and Subclass properties[edit]

These properties are important to use consistently as basically all downstream software will want to employ them. As one example, it would be useful for building cytoscape apps or apps as part of query filters and as a way to color network nodes.

  1. For hierarchical entities - e.g. diseases in an ontology, processes from the gene ontology, etc. Propose the 'punning' pattern. Assign each item to be an Instance Of a metaclass (e.g. 'disease') and use the subClassOf relation to reflect the hierarchical structure (isa or subclassof) of the ontology. (Side note to avoid over-using subclassOf for e.g. part-of relations).
  2. For bottom level concepts - e.g. specific variants, genes, or proteins that you would never reasonably expect to have subclasses. Propose to treat them as instances and not use the subclass relation.
Current Usages
Name Instance of Subclass of Notes
Interpro Items Protein Family, Protein Domain, Active Site, Binding Site, Conserved Site, PTM, Repeat Many. Represents hierarchy
Drugs chemical compound, pharmaceutical drug Many. Represents hierarchy
GO Terms None Many. Represents hierarchy Make all instance of GO term?
Gene None gene, protein-coding gene Make all instance of gene?
Protein None protein Make all instance of protein?
Taxon taxon None. parent taxon encodes hierarchy
Disease disease Mostly hierarchy. ~100 are subclass of disease

Use of reciprocal properties[edit]

This is a question. When there are reciprocal properties such as 'part of' and 'has part', when should we use both and why?

Current landscape of content added by the gene wiki project[edit]

Reference Properties used by the Gene Wiki project
name definition notes Gene wiki action/discussion
stated in to be used in the source field, to indicate where a claim is made This should be the primary way we connect claims to sources including both databases and published articles.
imported from source of this claim's value (use only in References section) The consensus appears to be that this property should only be used to describe data imported from a Wikipedia. (See Help:Sources and imported from talk). We are currently using 'imported from' to
  • refer to a database and 'stated in' to refer to a specific release of a database (e.g. a particular genome build such as on Q414043 (Q20950174)).
  • refer to a third party service we are using for convenience (e.g. the OLS for the gene ontology statements such as on Q29548.)

Unfortunately, it seems there is no way for others to know that these are our intentions as they are inconsistent with each other and do not match what is described in Help:Sources

Adapt strategy more consistent with current property semantics. Suggest:
  • use 'stated in' to connect to an item representing a specific version of the database (e.g. PhenoCarta Feb 4) that is, in turn linked to the primary, database entity via an 'edition' or similar property.
retrieved date or point in time that information was retrieved from a database or website (for use in online sources)
language of work or name for works (for original language use P364 and for persons P103 and P1412) This property is not really intended for nor really useful in the gene wiki context. Further it is frequently considered for deletion (Property talk:P407) . Note that if we wanted a way to indicate a language (which we really don't need to so far) we could also use Property:P2439 which seems more appropriate and less contentious. Suggest: Remove these from all of our references. They don't do harm per se but are unnecessary and add complexity and hence confusion. We want clean examples for others to emulate.
software version version(s) of the software, current and past Right now, this looks like it is only being used to provide a source for the 'has part' relations linking proteins to domains drawn from InterPro. This looks like an alternative way to model database/software version from the item-based approach taken with genome builds. We should pick one method and be consistent. Note also that the protein 'has part' domain relations have redundant 'imported from' and 'stated in' references. (Same problems appear on the 'subClassOf' relations between proteins and families.) Suggest:

Do not use for databases. Use 'stated in' linked to an item for the database version.

reference URL should be used for internet URLs as references Only use this when Wikidata does not have an authority control property referring to the resources you are linking to. For example, use PubMed Id (Property:P698) instead of a referenceURL pointing to an entry in PubMed. Note Help:Sources#Web page suggests that uses of this reference property should be qualified with publication date (P577) , title (P1476) , archive URL (P1065), archive date (P2960), language (P2439) . Gene Wiki ACTION item - decide if these qualifiers are important. Main usage is for gene-disease associations coming from PhenoCARTA. Raises the question of whether we should be citing PhenoCARTA (an aggregator) or if we should be citing primary sources and in either case, whether we should propose an authority control id (e.g. pubmed id) property for these sources.
Authority Control Properties UniProt ID, NDF-RT ID, Guide to Pharmacology Ligand ID, Disease Ontology ID When available, these should be used instead of reference URL for online databases.
Articles. When citing articles, we should cite wikidata items for the article and create them if they are absent. e.g. stated in : wikidata item . where wikidata item uses PMID or PMCID itself. NOTE: can make an item for you out of the PMID. Suggest: replace use of PMID as a reference property with a stated in connected to an item for the article in question.

Examples of all (non-identifier) claims we are concerned with[edit]

Example item Subject Instance of property Bi? Object Instance of qualifiers props reference props More
Interleukin-26 None. Subclass of Protein biological process,

cellular component,

molecular function

No None. SubclassOf parent on GO is a graph determination method stated in , UniProt ID , retrieved , language of work or name
IL26 None. Subclass of Gene Chromosome No None. SubclassOf autosome. stated in, imported from, retrieved , Entrez Gene ID
IL26 None. Subclass of Gene Encodes

(Encoded by reverse this with same properties)

No None. Subclass of Protein stated in , UniProt ID , retrieved , language of work or name
IL26 None. Subclass of Gene Genomic Start, Genomic End No Number genomic assembly stated in, imported from, retrieved
IL26 None. Subclass of Gene strand orientation No None. either reverse strand or forward strand genomic assembly stated in, imported from, retrieved, Entrez Gene ID why would this have a gene id and start position not?
IL26 None. Subclass of Gene found in taxon No Taxon stated in, imported from, retrieved, Entrez Gene ID
IL26 ulcerative colitis None. Subclass of Gene genetic association Yes None. SubclassOf disease determination method reference URL, stated in, imported from, retrieved Suggest: drop the imported from.
Osteoporosis None. Subclass of Disease drugs used for treatment No Pharmaceutical drug stated in, NDF-RT ID, language of work or name, retrieved gene wiki intention
Mm3 None. Subclass of Protein physically interacts with Yes Pharmaceutical drug chemical compound use stated in, Guide to Pharmacology Ligand ID, language of work or name, retrieved Intended use, gene wiki.
Acetylcholine Pharmaceutical drug chemical compound physically interacts with Yes None. Subclass of Protein as stated in, Guide to Pharmacology Ligand ID, language of work or name, retrieved
Mycobacterium leprae Many, we use: Taxon to stand for organism cause of Many, We use SubclassesOf disease Disease Ontology ID, stated in
Human HTR2A Many, we use Subclasses of Protein has part Protein Domain, Active Site, Binding Site, Conserved Site, PTM, Repeat stated in, imported from, software version, publication date, reference URL
Human HTR2A Many, we use Subclasses of Protein, Gene disease subclass of Protein Family stated in, imported from, software version, publication date, reference URL

Examples of some identifier claims that matter to us[edit]

Osteoporosis exact match external URL stated in, Disease Ontology ID, retrieved, archive URL, reference URL
Dinoprostone Chemical Compound chemical formula String stated in, Drugbank ID, language of work or name, title, publication date
Dinoprostone Chemical Compound canonical SMILES String stated in, ChEMBL ID, language of work or name, title, retrieved