The ProteinBoxBot and associated MicrobeBot are working to populate Wikidata with information at a level of quality and up-to-dateness that would justify the use of this content to drive scientific and potentially even medical applications. To achieve this goal, it is vital that data consumers be able to easily access and interpret the evidence underlying the claims. The patterns for representing the semantics of the associated evidence and the provenance trails linking back to the original sources of information must be stably defined and implemented such that software can be written (most likely by other groups) that exposes this information to the intended end users. The purpose of this page is to identify and resolve any inconsistencies in the approaches taken so far and to define specific guidelines for moving forward. These can serve to organize the gene wiki team's efforts and potentially serve as a model for other groups with similar aims.
Given a Statement (claim & semantic qualifier), who made the claim? When? Where can I get the data? In other words what are the references?
Guidelines for Referencing Databases, Ontologies and similar Web-native information entities.
- stated in (P248) with the object of the reference pointing at a Database item (e.g.: Monarch Disease Ontology (Q27468140)) or, if it exists, a Database version item in wikidata (e.g.: InterPro Release 59.0 (Q27135875)). (Version item is preferred if it is feasible)
- retrieved (P813) the most recent point in time when the claim was checked against the database. E.g. if a bot re-examines a data source and nothing has changed, this date can be reset to show this.
- In order to provide a direct link to the data use either a data source specific identifier like InterPro ID (P2926) or, if none are present, a reference URL (P854) linked directly to the relevant entry in the database.
If the information item owner uses a stable version system - e.g. 'official releases' on a periodic schedule:
Then use a Database version item (e.g. Phenocarta release 2016-02-04 (Q22978334)) to record:
- publication date -> timestamp for release
- archive URL -> URL for downloading data from that release
- edition or translation of -> Connection to parent database item (e.g. Phenocarta (Q22330995))
Else, (the information item is updated chaotically as with unpredictable pushes to version control system):
Then use one item to reflect the information entity (e.g. an item for the Disease Ontology) to record publication date, archive URL and update those fields whenever the information entity is changed (and those changes are noticed and acted upon by a relevant bot or other editor.
Possible model extension in the future:
- Add the concept Dataset Distribution to link different representations of a database such as isql, tsv, RDF, ... to the database version item in wikidata, to align with the W3C dataset description specification [].
Guidelines for Referencing Articles (e.g. journal articles)
Use: stated in (P248) with the object of the reference pointing at a item for the article (e.g. The Structure of Electronic Excitation Levels in Insulating Crystals (Q21709348))
Guidelines for referencing individual curators
Use: curator (P1640) as a property on the reference
Guidelines for referencing Gene Ontology annotations (and template for other complex annotations coming from multiple parties and presented through multiple aggregation services)
GO annotations should referenced in a manner similar to how they are displayed in QuickGO (example. Format described in detail here and here). Data from the "With" column is not captured at this time. Each GO term statement should have qualifier stating the determination method (P459). A statement can have multiple determination methods and multiple references. The reference should include the following properties:
- stated in (P248): The original source of the data. May be a scientific article (Q13442814), or a database (Q8513) that inferred the annotation electronically.
- curator (P1640): The human (Q5), organization (Q43229), or database (Q8513) that curated this information.
- retrieved (P813): the most recent point in time when the claim was checked against the database. E.g. if a bot re-examines a data source and nothing has changed, this date can be reset to show this.
- data source specific identifier or reference URL (P854): In order to provide a direct link to the data
- determination method (P459): In order to link references with determination method qualifiers on statements with multiple determination methods, the determination method property should also be added to the reference.
Example GO annotation reference model Q27553062
- Discussion 
Miscellaneous notes on references
Not all statements require sourcing: Help:Sources/Items not needing sources
Property:P2352 (Applies to taxon) would be a useful qualifier to use for describing claims coming from model organism research.
Property:P1013 (criterion used) might be useful in combination with 'determination method' - should be careful not to conflate.
Scientific Evidence Semantics
Given a Claim such as "Metformin Treats Diabetes" or "P53 has subcellular localization Nucleus", we would like to be able to answer the question of "according to what experimental method?" How was the 'fact' created?
Wikidata items synchronized with ECO could play a role here, providing a stable reference for recoding the semantics of evidence statements. Proposing to do this as qualifiers on scientific claims.
|determination method||qualifier stating how a value has been determined||so far used for genetic association and gene ontology (molecular function, biological process, cellular component) claims.|
|use||main use of the subject (includes current and former usage)||So far used in claims linking proteins to drugs to indicate the nature of the relation - e.g. agonist.|
|as||generic qualifier||So far used in claims linking drugs to proteins to indicate the nature of the relation - e.g. agonist.
Gene Wiki ACTION item - review this reciprocal model, seems wrong.
|genomic assembly||specify the genome assembly on which the feature is placed||Used to indicate specific genome build used for gene coordinates/|
Proposal for use of Instance and Subclass properties
These properties are important to use consistently as basically all downstream software will want to employ them. As one example, it would be useful for building cytoscape apps or knowledge.bio apps as part of query filters and as a way to color network nodes.
- For hierarchical entities - e.g. diseases in an ontology, processes from the gene ontology, etc. Propose the 'punning' pattern. Assign each item to be an Instance Of a metaclass (e.g. 'disease') and use the subClassOf relation to reflect the hierarchical structure (isa or subclassof) of the ontology. (Side note to avoid over-using subclassOf for e.g. part-of relations).
- For bottom level concepts - e.g. specific variants, genes, or proteins that you would never reasonably expect to have subclasses. Propose to treat them as instances and not use the subclass relation.
|Name||Instance of||Subclass of||Notes|
|Interpro Items||Protein Family, Protein Domain, Active Site, Binding Site, Conserved Site, PTM, Repeat||Many. Represents hierarchy|
|Drugs||chemical compound, pharmaceutical drug||Many. Represents hierarchy|
|GO Terms||None||Many. Represents hierarchy||Make all instance of GO term?|
|Gene||None||gene, protein-coding gene||Make all instance of gene?|
|Protein||None||protein||Make all instance of protein?|
|Taxon||taxon||None. parent taxon encodes hierarchy|
|Disease||disease||Mostly hierarchy. ~100 are subclass of disease|
Use of reciprocal properties
This is a question. When there are reciprocal properties such as 'part of' and 'has part', when should we use both and why?
Current landscape of content added by the gene wiki project
|name||definition||notes||Gene wiki action/discussion|
|stated in||to be used in the source field, to indicate where a claim is made||This should be the primary way we connect claims to sources including both databases and published articles.|
|imported from||source of this claim's value (use only in References section)||The consensus appears to be that this property should only be used to describe data imported from a Wikipedia. (See Help:Sources and imported from talk). We are currently using 'imported from' to
Unfortunately, it seems there is no way for others to know that these are our intentions as they are inconsistent with each other and do not match what is described in Help:Sources
|Adapt strategy more consistent with current property semantics. Suggest:|
|retrieved||date or point in time that information was retrieved from a database or website (for use in online sources)|
|language of work or name||for works (for original language use P364 and for persons P103 and P1412)||This property is not really intended for nor really useful in the gene wiki context. Further it is frequently considered for deletion (Property talk:P407) . Note that if we wanted a way to indicate a language (which we really don't need to so far) we could also use Property:P2439 which seems more appropriate and less contentious.||Suggest: Remove these from all of our references. They don't do harm per se but are unnecessary and add complexity and hence confusion. We want clean examples for others to emulate.|
|software version||version(s) of the software, current and past||Right now, this looks like it is only being used to provide a source for the 'has part' relations linking proteins to domains drawn from InterPro. This looks like an alternative way to model database/software version from the item-based approach taken with genome builds.||We should pick one method and be consistent. Note also that the protein 'has part' domain relations have redundant 'imported from' and 'stated in' references. (Same problems appear on the 'subClassOf' relations between proteins and families.) Suggest:
Do not use for databases. Use 'stated in' linked to an item for the database version.
|reference URL||should be used for internet URLs as references||Only use this when Wikidata does not have an authority control property referring to the resources you are linking to. For example, use PubMed Id (Property:P698) instead of a referenceURL pointing to an entry in PubMed. Note Help:Sources#Web page suggests that uses of this reference property should be qualified with publication date (P577) , title (P1476) , archive URL (P1065), archive date (P2960), language (P2439) .||Gene Wiki ACTION item - decide if these qualifiers are important. Main usage is for gene-disease associations coming from PhenoCARTA. Raises the question of whether we should be citing PhenoCARTA (an aggregator) or if we should be citing primary sources and in either case, whether we should propose an authority control id (e.g. pubmed id) property for these sources.|
|Authority Control Properties||UniProt ID, NDF-RT ID, Guide to Pharmacology Ligand ID, Disease Ontology ID||When available, these should be used instead of reference URL for online databases.|
|Articles.||When citing articles, we should cite wikidata items for the article and create them if they are absent. e.g. stated in : wikidata item . where wikidata item uses PMID or PMCID itself. NOTE: https://tools.wmflabs.org/sourcemd/ can make an item for you out of the PMID.||Suggest: replace use of PMID as a reference property with a stated in connected to an item for the article in question.|