Wikidata talk:WikiProject Molecular biology

This is the talk page for discussing improvements to Wikidata:WikiProject Molecular biology.
Use the "Add topic" button in the upper righthand corner to begin a new discussion, or reply to one listed below.

GO Term Provenance

WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

Hey all, I'm planning on adding/updating GO annotations to protein items using a new provenance pattern that preserves references to the original curator and the journal article the claim was sourced from. The pattern is as follows:

GO annotations will be referenced in a manner similar to how they are displayed in QuickGO (example. Format described in detail here and here). Data from the "With" column is not captured at this time. Each GO term statement should have qualifier stating the determination method (P459). A statement can have multiple determination methods and multiple references. The reference should include the following properties:

stated in (P248): The original source of the data. May be a scholarly article (Q13442814), or a database (Q8513) that inferred the annotation electronically. (Note: If this is a journal article, the item is created.)
curator (P1640): The human (Q5), organization (Q43229), or database (Q8513) that curated this information.
retrieved (P813): the most recent point in time when the claim was checked against the database. E.g. if a bot re-examines a data source and nothing has changed, this date can be reset to show this.
data source specific identifier or reference URL (P854): In order to provide a direct link to the data
determination method (P459): In order to link references with determination method qualifiers on statements with multiple determination methods, the determination method property should also be added to the reference.

An example item is RNA-binding protein POP5 YAL033W (Q27553062) (right). Comments and suggestions welcome. Some notes:

The data is downloaded from the UniProt-GOA database using QuickGO. So a stated in (P248) should be GOA (Q28018111) in addition to the journal article (if exists) ?
Keep a reference url to the quickgo query?

Improper aliases

There're more than 300k items of individual proteins with "protein" as alias, like [1]. They make no sense. Maybe a bot can remove them.--GZWDer (talk) 14:18, 6 February 2017 (UTC)[reply]

Subclass of -> Instance of for Genes and Proteins

Its very useful for application building and querying to be able to know what an entity "is" without having to traverse class hierarchies. For each ontology term we maintain, we add an appropriate instance of relation. For example blindness (Q10874) is a subclass of retinal disease (Q550455) and instance of disease (Q12136). If there are no objections we (ProteinBoxBot) will move the "subclass of" gene (Q7187) or protein (Q8054) to "instance of" for gene and protein items. Proteins that are in protein families will be a subclass of that family (e.g. Succinyl-CoA:glutarate-CoA transferase (Q21124586)) Gstupp (talk) 20:23, 2 April 2017 (UTC)[reply]

Soliciting suggestions of new data sources

Dear all, we on the Gene Wiki / ProteinBoxBot team are doing some planning and prioritization of future biomedical data sets to load, and we'd like to solicit suggestions from the broader Wikidata community. Historically, the scope of our bot loading effort has revolved around genes, proteins, drugs, diseases, and microbes. And more recently we've also helped related groups load data on genetic variants and pathways. We would welcome suggestions of either other related entity types that should be systematically loaded, or data sources that describe relationships between these entity types. Obviously, availability of a high-quality, CC0-licensed data source is essential. Please let us know if you have any suggestions. (Cross posting to WD:MB, WD:MED, and Wikidata:WikiProject_Chemistry.) Best, Andrew Su (talk) 20:03, 23 June 2017 (UTC)[reply]

Hi @Andrew Su:. How about "cytogenetic location" data? (e.g. ABO gene located at "9q34.2" [2]). When I was making Template:Genetics properties, I found that cytogenetic location data does not exist yet. As you know, all (or almost?) genes already have genomic start (P644)

, genomic end (P645)

(basepair location in specific GRCh version) Thank you for your effort for that! --Was a bee (talk) 10:00, 5 July 2017 (UTC)[reply]

Currently there is the property proposal (Wikidata:Property proposal/Cytogenetic location). No opinion comes yet (too much technical...?) --Was a bee (talk) 10:20, 5 July 2017 (UTC)[reply]

@Was a bee: I created an issue for it here: https://github.com/SuLab/GeneWikiCentral/issues/38. I think it should be relatively straightforward to add, but I did tag it as "low priority". If there are compelling use cases or queries that would benefit from adding this info, let us know and we can look at upping the priority. Thanks for the suggestion! Best, Andrew Su (talk) 16:21, 5 July 2017 (UTC)[reply]

@Andrew Su: Yesterday, I've tried adding new column into Infobox_gene (en:Module_talk:Infobox_gene#Gene_location_column_added). Although I don't know what do you think about that column addition, what I'm thinking now is that it would be useful for general readers if band information is accessible through that column. What do you think?--Was a bee (talk) 05:03, 19 August 2017 (UTC)[reply]

@was a bee: Bravo, I love it! Added my support for the property proposal... If you create/enhance the visualization to include cytogenetic location, we will load and maintain the data using our bot. Nice work! Best, Andrew Su (talk) 19:20, 21 August 2017 (UTC)[reply]

Hi @Andrew Su:. How about "Open Targets" data? (e.g. for the F12 gene the current version of the Open Targets Platform (which is free to use, no need for registration) shows the association of that gene with 192 diseases [3]). The association is based on different types of information (or evidence) such as genetics (somatic or germline), drug information, text mining, affected biochemical pathways, RNA differential expression and mouse models. The opposite is also true: one can start from the disease point of view and find which genes are associated with that disease (e.g. there are 3206 genes - or targets - associated with Alzheimer's)[4]. Wikidata could also link to a profile page of a gene (e.g F12 [5] or disease [6]. --Rejancar (talk) 10:00, 13 December 2017 (UTC)[reply]

classification of properties

I created Wikidata property to identify proteins (Q42415644) and Wikidata property to identify proteins (Q42415644) to organize all properties that uniquely identify (see as part of Wikidata:Identifier). This does not include genomic start (P644) and genomic end (P645) because they only identify if used together. Could you please have a look at this list (unless it have become empty) and also classify these properties? If the properties do not identify individual genes or individual proteins, they must be put into Wikidata property for authority control (Q18614948) or another of its subclasses. -- JakobVoss (talk) 15:03, 30 October 2017 (UTC)[reply]

Haplogroups

Hi, this may not be directly within the scope of this project. However, this project may still be the best place for asking for help. I would like to convert Template:Infobox haplogroup (Q10562645) in fiwiki to use wikidata. In template level, I can do it but I need help with choosing the correct wikidata properties to save the data and if someone could store information from w:en:Haplogroup N (mtDNA) infobox to Haplogroup N (Q118710) wikidata item so it is in line with current practices of this wikiproject for an example then it would be great. --Zache (talk) 09:43, 3 December 2017 (UTC)[reply]

Ok, I made wikiproject page for haplogroups Wikidata:Haplogroups and I tried to populate the Haplogroup N (Q118710). So no I have some questions:

Is place of birth (P19) and date of birth (P569) limited to single entities or can I use those properties with species too?
How to store the Defining mutations? I used the has part(s) (P527) for this.

All other suggestions/comments are welcome too --Zache (talk) 10:13, 8 December 2017 (UTC)[reply]

Some times, people confuses between Y-DNA and mtDNA haplogroups. So how about different from (P1889)? --Was a bee (talk) 13:00, 8 December 2017 (UTC)[reply]

Added different from (P1889), thanks. --Zache (talk) 13:15, 8 December 2017 (UTC)[reply]

Help needed merging Gene Wiki pages

En:C1S has been merged into En:Complement component 1s. I now need to merge the corresponding Wiki data items (Q17854065 and Q5156403 respectively), but given there are corresponding articles in other languages, I am not sure how to go about merging the Wiki data items. Do the corresponding articles in other languages also need to be merged? Any pointers would be greatly appreciated. Cheers. Boghog (talk) 07:41, 20 December 2017 (UTC)[reply]

Boghog I merged the items. There weren't any conflicting articles in the same language (unlike English), so I just moved the corresponding wiki links over to one item and then merged it. Gstupp (talk) 20:42, 20 December 2017 (UTC)[reply]

Thanks for merging and for the pointers. Much appreciated. Boghog (talk) 20:47, 20 December 2017 (UTC)[reply]

Classification of the entities managed by thes project in Wikidata

Hi, the project ontology detected problems on the classification of gene, proteins and processes, mainly that those entities are both subclass and instances of the same item, which is ontologically a problem. There is also inconsistent usage of sources like the gene ontology with the actual statement created on Wikidata by ProteinBotBox. The issue has been raised on Project chat in a discussion about issues in our class tree amongst over and a discussion started there with the bot owners. They ask for a community consensus for a proposed solution, you’re welcome to comment there. If that’s too confusing maybe I can write a subpage for comment here, please ask. author TomT0m / talk page 11:55, 8 February 2018 (UTC)[reply]

Describing a molecular biological process

Hi,

I'm wondering how to describe a molecular biological process on Wikidata. I.e. :

Substance T release
Binds to substance T receptors on Microglias
Microglia activation
Release alphaTNF

It could be an item that store the process but it would also be interesting to have the list of receptors and produced molecules for cells. Before that I miss-use the properties, what would be the good practices ? I guess there may be examples or guides somewhere I missed ?

Regards

-- Thibdx (talk) 00:33, 23 December 2018 (UTC)[reply]

It's not exactly clear to me the type of statements you'd like to add. Can you give a few examples? Best, Andrew Su (talk) 05:51, 18 January 2019 (UTC)[reply]

I think Reactome and biocyc/metacyc are the resp. bio databases that have years of experience building ontologies for these kind of concepts. Please check first how they do it. --SCIdude (talk) 14:00, 26 July 2019 (UTC)[reply]

Possible merge required

Is Cell cycle regulator of NHEJ (Q21438637) a duplicate of CYREN (Q18045925)? w:en:C7orf49 would like to be attached to a wikidata item, but it's not clear to me which of these two is its friend. --Tagishsimon (talk) 04:57, 18 January 2019 (UTC)[reply]

@Tagishsimon:: those two items should not be merged. One describes the human gene and the other describes the human protein, and they have reciprocal statements based on encodes (P688) and encoded by (P702). The convention that is primarily used is to link the WP page to the gene item (CYREN (Q18045925) in this case). Hope that helps! Best, Andrew Su (talk) 05:48, 18 January 2019 (UTC)[reply]

Excellent; thank you, Andrew. --Tagishsimon (talk) 06:08, 18 January 2019 (UTC)[reply]

Molecular Reaction?

In the property(https://www.wikidata.org/wiki/Wikidata:WikiProject_Molecular_biology/Properties) subsection named Proposed Properties linking genes to genes you have a property that is named reaction. I think that is very interesting and here (https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Modelling_a_chemical_reaction?) they are also talking about that. Could you describe me please what the description is because I could not figure that out on https://string-db.org Juliansteinb (talk) 20:07, 10 April 2019 (UTC)[reply]

A possible Science/STEM User Group

There's a discussion about a possible User Group for STEM over at Meta:Talk:STEM_Wiki_User_Group. The idea would be to help coordinate, collaborate and network cross-subject, cross-wiki and cross-language to share experience and resources that may be valuable to the relevant wikiprojects. Current discussion includes preferred scope and structure. T.Shafee(evo&evo) (talk) 02:36, 26 May 2019 (UTC)[reply]

"determination method" property on GOA references

Hello, GOA refs provide statements backed by determination methods (DM); the certainty of statements about genes/proteins can be deduced from the determination method (P459) values provided by one or several references. The problem is that the same property determination method (P459) is both on the statement and its reference which makes no sense semantically---consequently there are scope violations. I'd propose a new property applicable to references, e.g. "uses determination method" or "provides evidence" that has the same value as the determination method (P459) at the moment. Batch changing this would be no problem I think. Comments? --SCIdude (talk) 13:42, 29 July 2019 (UTC)[reply]

Yeah, I agree that it's a bit semantically imprecise as it's currently implemented. In my opinion, the two options are to create a new property as you suggest (advantage: semantically precise, resolves scope violations; disadvantage: proliferation of highly-related properties may cause confusion) or to expand the scope of the existing property (advantage: simpler data model that still resolves scope violations; disadvantages: formalizes a semantically imprecise solution). I'm not passionate either way. We can certainly modify the bot to match whatever consensus emerges here... Best, Andrew Su (talk) 16:29, 29 July 2019 (UTC)[reply]

prepro property

There appears to have been no discussion/proposal about a property for precursor proteins, e.g. preproinsulin (Q39798) has precursor preproinsulin (Q7240673). The connected type would be protein precursor (Q258658). Possible labels:

"precursor"
"has precursor"
"cleaved from"

Possible qualifier: "catalyst" or "cleaving enzyme", another property?

Please comment on this, I'd like to make it my first property proposal. --SCIdude (talk) 05:53, 30 July 2019 (UTC)[reply]

I would support such a property. A number of precursor proteins serve as biomarkers for diseases eg- PIIINP in Vascular Ehlers Danlos and miscellaneous fibrosis diseases or Pro-SFTPB for idiopathic pulmonary disease. My guess is that you'll need to justify why 'has part' or 'part of' isn't sufficient, so having an example ready will likely help. -- Gtsulab (talk) 18:25, 31 July 2019 (UTC)[reply]

To stay with insulin, the end product is a hexamer of two parts of the prepro. Also there may be posttransl. modifications. --SCIdude (talk) 17:27, 1 August 2019 (UTC)[reply]

WD UniProt duplicates/fragments policy

From https://www.uniprot.org/help/redundancy:

UniProtKB/TrEMBL: one record for 100% identical full-length sequences in one species;
UniProtKB/Swiss-Prot: one record per gene in one species;

UniProt protein ID (P352) has distinct and single value constraints which means with:

UniProtKB/TrEMBL: fragments have different IDs, no violation, but identical proteins from different bact. strains (same species) get the same ID, violate the constraints
UniProtKB/Swiss-Prot: fragments of the same gene product have the same ID as the prepro, violate the constraints

To adapt the constraints to be more lax one would need to check that the two proteins have a precursor relationship (=PASS), or that, if taxa are different, the common ancestor of the two taxa is species or below (=PASS). Is this possible? If not, the identical bact. proteins could be merged but not the different fragments of the same prepro. --SCIdude (talk) 17:12, 1 August 2019 (UTC)[reply]

In their 2019-Nov release UniProt has made changes that allow several statements to be made specifically on isoforms/fragments/cleavage products associated with a full UniProt "protein" entry sequence, see https://www.uniprot.org/news/2019/12/18/release.

Clearly, many UniProt entries must now be considered as containers of subentries with associated statements, creating the need for separate WD items, which can be recognized by being instances of isoform (Q609809), protein variant (Q77030030), or protein fragment (Q78782478). I expect Gene Ontology moving to annotation of subentries in the future, as well. --SCIdude (talk) 16:49, 4 February 2020 (UTC)[reply]

Problems with PDB and GOA from UniProt imports

As mentioned UniProt/SwissProt entries contain all gene products from a gene, also cleavage products. That means PDB and GOA statements get lumped together at the UniProt entry, and so they are imported and dumped into the first item that carries the UniProt ID. So we had for example 3D structures and GOA statements about angiotensin-2 (the main peptide hormone) in the angiotensinogen item. This must be sorted out manually but worse is it will be re-dumped with the next bot run. This is a general heads-up for the issue. The cause for the GOA facet is at GOA because apparently it's not a rule to specify which peptide is referred to, just the UniProt ID. The PDB mixing happens at UniProt. If you're bot dev, I know you cannot handle this easily, so I suggest to STOP the import of these statements FOR UniProt entries that have curated fragments (there are 883 for human with keyword "Cleavage on pair of basic residues [KW-0165]" of what? 50k human entries, so the percentage is less than 2%). --SCIdude (talk) 09:35, 6 August 2019 (UTC)[reply]

@SCIdude: Thanks for digging into this issue and also suggesting a practical solution -- much appreciated! Please bear with us if it takes us a little bit to implement the proposed solution or if it takes a couple iterations to get it precisely right. I'm going to jot down some notes here just to be sure I have it right. You specifically mention angiotensinogen (Q267200) and I see you've removed a bunch of PDB IDs that appear to correspond to the peptide hormone. (Should 5E2Q also be removed then?) The Uniprot ID is P01019, and in the "Names & Taxonomy" section I see that it is cleaved into 8 fragments. So in that case, you are proposing that the bot would not PDB IDs. Can you tell us how you did your query to get the 883 proteins you mention? Also, I see you removed some of the GO annotations. Anything we should we aware of there? (EDIT: of course, it's the same issue of annotations on the peptide being transferred to the parent protein. Got it...) Best, Andrew Su (talk) 16:01, 6 August 2019 (UTC)[reply]

Yes I added 5E2Q to angiotensin II (Q412999) but forgot to remove the original. The UniProt query was "human AND keyword:KW-0165" but I just see that "OS:9606 AND keyword:KW-0165" gives 308 hits (297 reviewed)---my query apparently added a bunch of "human" virus proteins---so even less affected entries. As to GOA, I copied ALL GO annotations from angiotensinogen (Q267200) to angiotensins (Q65963433), removed those on angiotensinogen (Q267200) that referred specifically to one of the peptides, and removed those on angiotensins (Q65963433) that referred specifically to angiotensinogen (Q267200). I'm the author of most GOA annotations of M.tuberculosis so I'm feeling somewhat confident in my work. Nevertheless, I refrained from further detailing the angiotensins (Q65963433) annotations to the specific peptides. --SCIdude (talk) 16:39, 6 August 2019 (UTC)[reply]

(tangent)I have some of a peptide/protein-centric view, I mean GO annotations are practically about those entities. They are what has function, and what localizes somewhere. So I think the annotations should associate with those WD items that represent those entities. --SCIdude (talk) 16:46, 6 August 2019 (UTC)[reply]

I just see that the KW-0165 keyword is not on AGT so I have the numbers wrong. You can't trust UniProt to put that keyword everywhere needed. --SCIdude (talk) 17:24, 6 August 2019 (UTC)[reply]

@SCIdude: On second look at this issue, I'm actually not sure what the best solution here is. Originally I thought that GO annotators were annotating the peptide (eg angiotensin II (Q412999)) and that the gene-centric our bots (or the protein-centric nature of uniprot) was corrupting that somehow. But actually it looks like it's the annotators themselves that are associating the annotations with the parent protein (eg [7]). While I can see that the changes you've made are more correct biologically, the referencing then becomes wrong. Technically, the original bot edits are correct from a Wikidata standpoint because they accurately reflect a source's claim (like the statement that the earth is flat). The ideal long-term situation I think is to get the GO annotators to be more precise on exactly what they're annotating, which would obviate the need for our bot to make any judgement calls. Your thoughts (or others' thoughts)? Best, Andrew Su (talk) 05:07, 8 August 2019 (UTC)[reply]

GO annotators associate UniProt IDs with PMIDs. When I curated I didn't realize there were sub-IDs like P01019#PRO_0000032457 (I think UniProt prefers PRO_0000032457 as ID, from their help page). Certainly the annotation tool's input mask wouldn't have allowed that, but that was 8 years ago. Of course technically it's all correct, but then no database can be trusted 100%, so we need to edit the remaining 2%. And I have seen lots of plain wrong GO annotations in UniProt entries too, apart from all electronic annotations (IEA) and traceable author statements (TAS) that never get tested in the lab... --SCIdude (talk) 05:59, 8 August 2019 (UTC)[reply]

As to the referencing becoming wrong, why do you say that---the annotation still associates with the same UniProt ID, just a subobject of that ID. Can you please give an example? Regards --SCIdude (talk) 06:09, 8 August 2019 (UTC)[reply]

However, I think your bot is buggy when it places all annotations in one protein item when there are several that have the same UniProt ID. I'd suggest in that case use the gene. Maybe also the PDBs. What do you say? --SCIdude (talk) 16:04, 8 August 2019 (UTC)[reply]

Sorry, I wasn't as precise as I should have been regarding referencing. You added the UniProt identifier P01019 to the new item for angiotensins (Q65963433) (and added it to angiotensin II (Q412999)), which is the same as for angiotensinogen (Q267200). I definitely see your rationale, but I worry a bit about putting that same identifier on those related-but-different concepts. In my mind, the uniprot identifier should _only_ apply to angiotensinogen (Q267200). If we removed that identifier from angiotensins (Q65963433) (and added, for example, PRO_0000032457 in a new property), then the reference URL (P854) that points to the Uniprot entry doesn't make sense any more.

On the more general point of database accuracy, I absolutely agree that no DB can be trusted 100%. But under the idea that wikidata is a collection of claims and not a database of "truth", I believe that our bot should simply reflect what the GO annotators say as precisely as possible, right or wrong. For whatever that is wrong, I think the ideal scenario is to 1) notify the GO annotators so it can be fixed at the source, and 2) add a Wikidata statement but use a relevant scientific article in the reference. Your thoughts?

(and replying to your last comment, which came after I wrote the above) I think that zeroes in on the issue -- should there be several items that have the same UniProt ID? I think you say yes while my gut says no (but again can see the rationale). Other folks have opinions here? Best, Andrew Su (talk) 16:15, 8 August 2019 (UTC)[reply]

Specifying PTM Type

I'm interested in adding a qualifier on the type of post-translational modification (phosphorylation, N-linked glycosylation, sumoylation, etc.) when using Property:P1917. Does anyone have suggestions on the best property to use for such a qualifier? For example, the phosphorylation of alpha-synuclein at S129 is associated with Parkinson's disease. In this case, I would create a statement on the page for alpha synuclein (Q288591): Post-translational modification associated with(P1917), value: Parkinson's disease (Q11085). Qualifier: property (need suggestions for this), value: protein phosphorylation (Q7251493). Thanks -- Gtsulab (talk) 19:54, 6 August 2019 (UTC)[reply]

A PTM property on the main object could be modeled as "has part" with an amino acid position added. So why not add the disease as qualifier to the PTM? --SCIdude (talk) 04:28, 7 August 2019 (UTC)[reply]

Thanks for the suggestion. I think the property that I'm using posttranslational modification association with (P1917) is already intended to link the protein subject with a disease object, so it doesn't make sense to add the disease as a qualifier when it would already be the main value. I think I can use your suggestion for 'has part' in the qualifier for this statement to link 'protein phosphorylation'. I don't see a good way for including the amino acid residue modified, since adding it as a value for a property would require that it first be added as an entity, no? I don't fancy creating entities for all amino acid residues that are modified. Gtsulab (talk) 20:08, 7 August 2019 (UTC)[reply]

RefSeq has a way to specify "Site"s, e.g. https://www.ncbi.nlm.nih.gov/protein/NP_000930.1?from=87&to=87. You could use a qualifier on the 'has part'-'protein phosphorylation' with the property connects with (P2789) giving the link as value. That was the only property I found, is there a better? --SCIdude (talk) 16:25, 8 August 2019 (UTC)[reply]

Ah there is also applies to part (P518).. no, both need items, sigh --SCIdude (talk) 16:29, 8 August 2019 (UTC)[reply]

Thank you so much for your helpful suggestions, and thanks for proposing a property that would solve this issue. I'll wait for the community to decide on the best way to handle this before proceeding any further. Gtsulab (talk) 20:23, 13 August 2019 (UTC)[reply]

GOA "P" (process) annotations on genes

Genes participate in biological processes so I was surprised when seeing the type constraint to proteins of biological process (P682). Maybe bots expect that annotations go into a WD object representing a gene together with its products? But if there exists a separate object for the gene then certainly biological process (P682) should be there as superset of all annotations of its peptide/protein products. So I'm adding "gene" to the type constraints. Does this make sense? --SCIdude (talk) 07:28, 7 August 2019 (UTC)[reply]

-I suspect this is because the biological process (P682) was derived from Gene Ontology biological processes, in which the "Annotations represent the normal functions of gene products" according to the principles of GO annotations. That said, I think it would makes sense to expand the constraints to genes if there are GO annotations for long non-coding RNA genes and MicroRNA genes (like NEAT1 (Q18054071) or MIR155 (Q17553105). Gtsulab (talk) 19:45, 7 August 2019 (UTC)[reply]

specifying aa position in a protein

There is no way at the moment to define the position and extent of anything that is a part of a protein. Generally, this could be one amino acid (aa), an aa chain, or a set of non-overlapping aa chains. The reason for this generality is to be able to apply this to all of e.g. position of a mutation, span of protein domains, or position of cleaved peptides spanning one or more subchains. One important definition would be the start offset, which usually is 1 here, i.e. the number given to the first aa in a protein. The model would then have the use cases:

1. simple position property, e.g. "amino acid position" expecting an integer>0
2. single span
   a. "amino acid start position" expecting an integer>0
   b. "amino acid end position" expecting an integer>0
3. set of spans, i.e. multiple pairs of start/end

Examples

phenylalanine hydroxylase (Q420604):
    has part-->protein phosphorylation (Q7251493)
        amino acid position--->16
    has part--->ACT domain (Q24745293)
        amino acid start position--->36
        amino acid end position--->114
    protein variant associated with--->phenylketonuria (Q194041)
        amino acid position--->39

Of course, when the site is not on protein parts then it's all simple.

Challenge example: defining a disease related mutation on a subchain. NOTE that the mutation position is usually given as position on the prepro (here 48 on preproinsulin) but the prepro is not part of the peptide, so it makes more sense to give it as the position on the sub-chain:

insulin:
    posttranslational modification association with (P1917)--->type 2 diabetes mellitus
        amino acid position--->24
        applies to part (P518)--->insulin B chain

The semantic problem in the above is that the position is not really associated with the B chain, both are with the statement about insulin

Challenge example: defining a disulfide bond (Q423252) in preproinsulin (Q39798) (one of the bonds is between pos. 7 of the insulin A chain (Q50372833) and pos. 7 of the insulin B chain (Q56837827)

insulin A chain:
    has part--->disulfide bond
        amino acid start position--->7
        connects with (P2789)--->insulin B chain
        amino acid end position--->7
(need not be reciprocally defined?)

The semantic problem here is that the end position is not really associated with the B chain, both are with the statement about the A chain. I'm not sure if these are real problems, or if they can be resolved at all.

So with three new properties we could define the coordinates of anything that is an aa in a protein, or an aa chain as part of a protein.

Comments? --SCIdude (talk) 09:02, 13 August 2019 (UTC) P.S. "protein variant associated with" does not exist yet either...[reply]

It seems the proposal has stalled. I think the best alternative now would be a proposal that is very broad, like "position in sequence" to also get good support. --SCIdude (talk) 08:23, 23 August 2019 (UTC)[reply]

TCDB import done / UniProt coverage

With the new TCDB property import six thousand proteins now have such an ID which is an exact correspondence, i.e. the TCDB entries mostly describe a single protein in UniProt. There are however 9,400 more in TCDB where the UniProt entry is not in Wikidata, e.g. P23586 from Arabidopsis. A quick search shows there is only two(!) protein items from A. thaliana with UniProt. Are there priorities for import, at all? It looks like imports are "up for grabs" since there is lots from prokaryotes---which is good. --SCIdude (talk) 14:38, 6 September 2019 (UTC)[reply]

WD enzymatic activities are GO

Thumbs up to ProteinBoxBot.

Using the EC-->GO mapping: http://current.geneontology.org/ontology/external2go/ec2go (data version: 'releases/2019-07-01') I checked that all 5,162 enzyme-related GO function IDs have a WD item; each of them has the correct EC; 50 of them had several ECs; all checked, they are already in ec2go. This means the EC number statements of catalytic activity GO items are exactly as given in ec2go, i.e. they are canonical. --SCIdude (talk) 16:20, 11 September 2019 (UTC)[reply]

EC is sparsely mapped in WD

Wikipedia articles, from which we get almost all enzyme family items, tend to favor terms that are widely used. Example: "oxidoreductase, acting on the CH-CH group of donors" is not one of them, although it is a main general enzyme family (EC:1.3). This naturally leads to a sparse mapping of the EC categorization (tree without leaf nodes):

├── 1 oxidoreductase (Q407479)
│   ├── 1 alcohol oxidoreductase (Q4713306)
│   └── 12 hydrogenase (Q424135)
├── 2 transferase (Q407355)
│   ├── 1 
│   │   ├── 1 methyltransferase (Q415875)
│   │   └── 4 amidinotransferase (Q68688747)
│   ├── 3 acyltransferases (Q2609152)
│   ├── 4 glycosyltransferases (Q67201373)
│   │   └── 1 hexosyltransferase (Q5749058)
│   ├── 6 
│   │   └── 1 transaminase (Q424288)
│   ├── 7 
│   │   └── 6 diphosphotransferase (Q5279763)
│   └── 8 
│       ├── 2 sulfotransferase subgroup (Q175950)
│       └── 3 CoA-transferase (Q68689639)
├── 3 hydrolase (Q96286)
│   ├── 1 esterase (Q418750)
│   │   ├── 1 carboxylesterase (Q409840)
│   │   ├── 3 phosphatase (Q422476)
│   │   ├── 4 phosphoric diester hydrolase (Q67202883)
│   │   └── 2 thioesterase, subgroup (Q7784664)
│   ├── 2 glycosidase (Q13527914)
│   │   └── 1 glycoside hydrolase superfamily (Q375795)
│   ├── 4 peptidase (Q212410)
│   │   ├── 22 cysteine protease (Q419343)
│   │   ├── 11 aminopeptidase (Q419527)
│   │   ├── 21 serine endopeptidase (Q420032)
│   │   ├── 24 metalloendopeptidase (Q6822865)
│   │   ├── 17 metalloexopeptidase (Q6822868)
│   │   └── 25 Threonine protease (Q7798075)
│   ├── 5 
│   │   ├── 1 amidohydrolases (Q4746164)
│   │   └── 2 amidohydrolases (Q4746164)
│   └── 6 
│       └── 4 helicase (Q138864)
├── 4 lyase (Q407727)
│   ├── 1 
│   │   └── 1 carboxy-lyases (Q417781)
│   └── 2 
│       └── 1 hydro-lyase (Q16915067)
├── 5 isomerase (Q118026)
│   └── 2 cis-trans isomerase (Q5122112)
├── 6 ligases (Q410221)
└── 7 transport protein (Q2449730)

Fortunately the GO enzymatic activities are complete in WD, so we have two different hierarchies. Ninety percent of the 5,000 families have a link to a GO enzymatic activity through molecular function (P680). I'm considering linking back using has cause (P828) if the link is one-to-one (ie. exact). This would enable automatic creation of instance of (P31)/subclass of (P279) statements for single enzymes if they have such an enzymatic activity, finally placing single enzymes in the above hierarchy instead of just instance of (P31)/subclass of (P279)-->enzyme (Q8047) or protein (Q8054). I'm just not clear which of instance of (P31)/subclass of (P279) to use for this. --SCIdude (talk) 07:26, 29 September 2019 (UTC)[reply]

Correction of Wikidata descriptions of Wikipedia protein articles

Hi, about 95 % of the 2.4k protein articles on de.wp display the wd description 'gene', and i guess this is similar in en.wp. The description is displayed on Wikipedia articles in the desktop and mobile version, and in de.wp we sporadically receive complaints. There has been a discussion on this issue here: Wikidata_talk:WikiProject_Chemistry#Proteins. Can you help, with a bot run, to correct the wd item linking wp protein articles in different languages from gene to protein? Or change the description? All the best, --Ghilt (talk) 14:05, 6 November 2019 (UTC)[reply]

@Ghilt: Of course the description of gene items should not be changed. What has to be done is to move the item's sitelinks to a different (=protein) item, or to one item that covers everything that is covered in the WP article. This can be automatized (QS can move sitelinks) so I think we just need to decide if the target item should be the protein, or a newly created item that can potentially cover more than gene + protein. As you can see from the insulin (Q70598743) example and looking at enwp, dewp articles they are about gene, protein, preproteins, protein subchains, protein complex, protein family and superfamily, and pharmaceutical too. --SCIdude (talk) 14:34, 6 November 2019 (UTC)[reply]

Hmm, out of "gene, protein, preproteins, protein subchains, protein complex, protein family and superfamily, and pharmaceutical" only the gene is not a protein. But i can live with the gene+protein solution. --Ghilt (talk) 15:55, 6 November 2019 (UTC)[reply]

The pharmaceutical could be a mixture too, and still wikipedians will put it all in one article, you never know. What next? Actually there are 1093 such gene items on WD that encode proteins and link to dewp. In none of the cases there is a second sitelink on the protein. If there is no objection I'll move them to the proteins. PS: our estimates 1K/2.4K are quite different, please use this SPARQL query to see my list:

SELECT ?g ?gLabel ?p ?pLabel WHERE {

   ?g wdt:P31 wd:Q7187 .
   ?g wdt:P688 ?p .
   ?article schema:about ?g .
   ?article schema:isPartOf <https://de.wikipedia.org/>.

   SERVICE wikibase:label {
      bd:serviceParam wikibase:language "de"
   }
}

PPS: to be exact I would move all sitelinks of all languages together if they all have no second sitelink on the protein. Also I would inspect the list before starting QS, there are genes with multiple protein fragments encoded. --SCIdude (talk) 16:31, 6 November 2019 (UTC)[reply]

For some historical context on the mapping of Wikipedia pages to Wikidata, you can find the previous discussions here. Note there were more extensive discussion in Wikipedia on how the Wikipedia pages should be created and what should go in them too. Gtsulab (talk) 18:58, 6 November 2019 (UTC)[reply]

@Ghilt: There were actually only 828 candidates, the process is running now and is documented here. So <5% of gene items with sitelinks are affected if I'm counting correctly, enwiki has much greater coverage than dewiki. --SCIdude (talk) 16:08, 7 November 2019 (UTC)[reply]

What i forgot to mention: Insulin is somewhat of an exception, since its enwp article actually contains information on the gene, whereas by far most enwp protein articles don't. But hmm, where does the discrepancy 828 via SPARQL vs. 2.4K via PetScan come from? By the way, the difference in coverage between enwp and dewp is that most protein articles in enwp were made by ProteinBoxBot, which copied the text from ncbi. There is no german language protein database. But any change away from 'gene' towards 'protein' would be great. --Ghilt (talk) 18:23, 7 November 2019 (UTC)[reply]

Thanks for correcting! The discrepancy stems from the fact, that a majority of protein articles don't have a description on wd. --Ghilt (talk) 21:42, 7 November 2019 (UTC)[reply]

Manuscript: Wikidata as a FAIR knowledge graph for the life sciences

WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

Dear all: You may have seen that we recently published a preprint entitled "Wikidata as a FAIR knowledge graph for the life sciences". This manuscript was primarily spearheaded by the Gene Wiki team, which has been active in data modeling and data ingestion for a variety of biomedical resources.

Our goal was to write a manuscript that educated the general biological community about Wikidata and to drive more growth and participation. To do this, we selected and described a series of scientific vignettes -- identifier translation, integrative biomedical SPARQL queries, crowdsourced curation, Wikidata-backed application development, and phenotype-based disease diagnosis. Those vignettes were based on our own areas of interest as well as our guess at what would appeal to our target audience.

Of course, there are many possible vignettes that could fit under the broad title we chose. As a matter of practicality, we could not include them all while still creating a final product of reasonable length and focus.

However, upon further reflection and discussion with colleagues, we realized that while the selection of vignettes needed to be somewhat limited, the manuscript should reflect a more complete and inclusive representation of the people behind the larger movement, including those that worked on aspects that weren't directly highlighted as vignettes. Therefore, we'd like to invite anyone to add their name to the author list or acknowledgements by adding their name to Wikidata:WikiProject Molecular biology/FAIR_knowledge_graph. Note that due to journal policies, all authors must still meet the ICMJE standards, but interpreted according to the broadly-defined title of the manuscript. (That broader scope might also be summarized by the class-level diagram shown at right, which is included as Figure 1 in the manuscript.)

Finally, this message is being cross-posted to many places. We will monitor replies at Wikidata talk:WikiProject Molecular biology, or please {{Ping}} me to notify me of replies or discussion elsewhere. Best, Andrew Su (talk) 22:41, 18 December 2019 (UTC)[reply]

canonical databases

Please note the addition of Wikidata:WikiProject_Molecular_biology/Properties#Main_classes_and_their_canonical_database. --SCIdude (talk) 17:05, 18 January 2020 (UTC)[reply]

Wikidata at the Bioinformatics Community Conference 2020?

What do people think about going to the community-guided conference day at the Bioinformatics Community Conference?
I've drafted a suggestion for some wikidata activities in the communal conference planning document.
Would others be interested in attending? Could be an ideal outreach opportunity to a community with closely aligned goals and interests. T.Shafee(evo&evo) (talk) 12:06, 11 May 2020 (UTC)[reply]

bulk statement deletion

There are 750k items having both instance and subclass of protein statements. Almost all were created by a defect bot in 2019. This is an announcement that I intend to delete all of these subclass statements because

a single protein is not a class
having wrong subclass statements impedes any computational analysis of the database

Voice your opinion now, please.

Ref.: https://www.wikidata.org/wiki/User:SCIdude/Protein_bugs#instance_+_subclass_of_protein --SCIdude (talk) 09:42, 20 August 2020 (UTC)[reply]

Just a quick reply to request that we prioritize deliberation and consensus over speed in having this discussion. This topic has been raised before with reasonable arguments on both sides. The inclusion of both statements was not due to a defect but a conscious choice, and it was made at least as far back as 2017. As with everything here, we should absolutely consider proposals to improve past work. But I just don't want move too quickly to action because ping-ponging between the two options I think will not serve anyone. I'll try to dig up some of the previous discussions... Best, Andrew Su (talk) 15:31, 20 August 2020 (UTC)[reply]

I'm sure we've had on-wiki discussions as well (still looking) but I found this thread on the mailing list (starts in Sep 2014, note also continues into Oct 2014). Just as a starting point... (EDIT: going to convert to a running list below that I'll add to as I find them, and invite others to do the same...)

Best, Andrew Su (talk) 04:04, 21 August 2020 (UTC)[reply]

The links, I'm sorry to say, have no relevant arguments.

The mailing list discussion about chemicals (small molecules) has been resolved in the meantime by the Wikiproject Chemistry, as they have now a standard way to create such items (see Wikidata:WikiProject_Chemistry/Guidelines). While this shouldn't concern us, please note they do not have subclasses on compounds, only on classes of compounds.
The second link argues for having instance statements on every gene/protein. No question.
The situation with diseases is completely different. Science is frequently finding subclasses for diseases which up to that point appeared monolithic. Does not apply to proteins with defined sequence.

I'm a bit disappointed that the links you gave did not contribute to the matter at hand, but I'm waiting over the weekend, nevertheless. --SCIdude (talk) 06:40, 21 August 2020 (UTC)[reply]

Perhaps I should state more clearly what is planned:

every single protein (not family or group) that has a subclass-of-protein statement will have it removed. It will still have the instance-of-protein statement.
what if you want to express membership of a protein? See how our bot does it: Q7240673#P361 (Insulin). Isn't that consensus, too? --SCIdude (talk) 06:51, 21 August 2020 (UTC)[reply]

I'm more in favor of first redefining the right model. Actually, some ontologies seen as authorative by the chemistry projects considers chemicals classes. Proteins are chemicals, and therefore having them subclasses certainly has its arguments. I would even argue this applies for (human) protein more than for chemicals, as human proteins are actually a group of related chemical structures. For example, DL-homocystine (Q58879461) actually is a subclass of chemical compound. As such, I think it makes sense to have any human protein be a subclass of protein. But I second the proposal to first work out the data model. --Egon Willighagen (talk) 16:51, 21 August 2020 (UTC)[reply]

DL-homocystine (Q58879461) is a group of compounds and should not be taken as representation of the mixture of that compounds (which is DL-Homocystine (Q72482585) and NOT a subclass).

I am not perse against deleting this duplication of instance vs. subclass of. As Andrew pointed out already this was not the case of a defective bot, it was a design choice. However, many sourced resources do not contain these bare ontological claims (to be a class or not to be a class), so that decision needs to be made at the design phase and to be honest, it is not a clear cut case that proteins can't be classes. On the contrary, ontologically speaking, a wikidata item on a protein is not necessarily an instance of a protein, but more a record of an annotation of a concept that might reflect a protein. If I would take this position, I would also have to re-evaluate the use of found in taxon, for example. So the whole issue is not as simple as all proteins are instances. On the other part of that spectrum if we both share that protein we do have two instances of the class of that protein. In short I do recognise the problematic nature of the current schema where both PO31 (PO31) and subclass of (P279) co-exist, but I am not convinced that simply deleting all subclass statements will solve that problem. Clearly a better model is needed. And if we need a short term sollution I would argue to not remove the subclass of (P279) but the instance of (P31) statements, to reflect the notion that a specific protein is an annotation of a macro-mollecular concept that can have multiple instances. --Andrawaag (talk) 22:14, 21 August 2020 (UTC)[reply]

I would urge to create that schema so that we get consistency. There would be need for two classes, as drug proteins for example are well defined both in sequence and modification. --SCIdude (talk) 06:58, 22 August 2020 (UTC)[reply]

Defining polyamide (Q145273) as a second-order class (Q24017414) is problematic as polyamide (Q145273) is subclassing things like biological macromolecule (Q66560214) that aren't second-order class (Q24017414) but first-order class (Q104086571). We should likely have one first-order class and a second order class here linked via is metaclass for (P8225). Maybe the first-order class has the name "protein" and the second-order class the name "protein type" but other names would work as well. ChristianKl ❪✉❫ 15:57, 12 February 2021 (UTC)[reply]

Merge in WikiProject Bioinformatics?

The Wikidata:WikiProject Bioinformatics page contains some info that could be useful on this project page. It's also not hugely active. I've put a note over at its talkpage suggesting a merge. T.Shafee(evo&evo) (talk) 01:40, 27 August 2020 (UTC)[reply]

non-coding RNA items

This is a heads up about an upcoming bot run that will create around 100k items of the types:

This does not aim for completeness, the goal is to demerge those items that were in the past created as both gene and ncRNA. In particular, miRNA items are not created in this run, but their precursors. Maybe some other person wants to extend on this. Opinions? --SCIdude (talk) 18:25, 1 November 2020 (UTC)[reply]

@SCIdude: There is some history to this I would like to explain. Although I do understand your concerns, those ststements are properly referenced and reflect a modelling decision which go back to the early days when the bot was developed. The statements reflect how those genes are described in NCBI gene, which is to this day the case. E.g. BACE1-AS still has a claim that this is "Gene type ncRNA". So although I agree this needs remodelling, that decision reflects other downstream decisions, which is for example also tied into the issue of whether or not RNA IDs show up on gene items. Moving forward I would like to propose to start creating EntitySchema's for each of the above mentioned types. Currently the modelling choices of the human genes are described in EntitySchema:E37 where the different "gene types" are listed. Andrawaag (talk) 19:23, 2 November 2020 (UTC)[reply]

I agree with both Andrawaag's historical perspective, and with SCIdude's suggestion we revisit this modeling decision made long ago. While it might broaden the scope of SCIdude's original proposal, I suggest we include protein-coding gene (Q20747295) in the discussion here with the hopes we can maintain consistency in the data modeling, rather than having an inconsistent data model that depends on gene type. Best, Andrew Su (talk) 06:20, 3 November 2020 (UTC)[reply]

Thanks, it is a natural idea to subclass gene as in the Sequence Ontology, from which protein-coding gene (Q20747295) is derived. Accordingly I created ncRNA gene (Q101110906) from http://purl.obolibrary.org/obo/SO_0001263. Is there a problem with creating a ShEx for this, and just removing the line "wd:Q427087 # ncRNA" in the gene_types definition of E37? The bot would either no longer work on ncRNA gene items, or need a special implementation for them. If the ProteinBoxBot no longer updates these, the SCIdudebot needs to step in. Did I miss anything? --SCIdude (talk) 16:52, 3 November 2020 (UTC) PS: I just see E37 is for human genes only so there is no formal consistency for other taxa at the moment, anyway. --SCIdude (talk) 17:20, 3 November 2020 (UTC)[reply]

@Andrawaag:@Andrew Su: Please see E251 and E252 and the change to E37. Example items are Q18037560 and Q101126626. If you agree with this design then the above question about changes needs to be answered. Effectively, the mixing of types E251/252 needs to be stopped. Who then creates/changes existing items has to be decided. --SCIdude (talk) 18:07, 4 November 2020 (UTC)[reply]

@SCIdude: E251 and E252 look good to me, though I'm no ShEx expert. So I'd be fine moving forward with those models. My question is primarily whether we should move forward decoupling the gene and the transcript for ncRNAs without doing it for the other gene types. The advantage of that route is that things move quicker with less coordination of moving parts. The disadvantage is that the data models between the different gene types differ somewhat substantially. They would still be compliant with the ShEx of course, but I could imagine it could be confusing to a consumer of the data who hasn't seen this discussion... Best, Andrew Su (talk) 16:08, 9 November 2020 (UTC)[reply]

I agree. The situation is always preliminary as long as not all concepts are created. I would not advocate creating all those messenger RNAs. But reality might force us in specific cases. For example, a therapeutic may appear that binds to a specific mRNA and, from this point, an additional specialized shape for protein-coding genes will be necessary. --SCIdude (talk) 18:06, 9 November 2020 (UTC)[reply]

@SCIdude: like @Andrew Su: the schemas look good to me and I am suppose to be a ShEx expert :). The question that remains is whether or not we should then also move forward and remove the statements that are currently added and in sync with the other models. My prefered next steps are to start implementing those models, BUT leave the current models in place, i.e. not deleting the statements added to Wikidata coming from the current/previous model. This will lead to redundant statements, but with correct references set, it is possible to distinguish between the different models. In away, implementing these new models is like a breaking new feature in software development, which might have some disrupting effects in downstream tools. The next step would be to integrate both models into a more consistent model, but this will take some time and that is why we should accept some plurality in the models applied. But eventually we should aim for this normalization towards a consistent model describing genes and gene products where we can remove the statements from the gene items. As long as references and qualifiers are properly set, Wikidata supports such plurality in data models covered. --Andrawaag (talk) 19:41, 9 November 2020 (UTC)[reply]

@Andrawaag: You want to treat what is a software bug fix like an API release, I don't agree with this. Nevertheless, since the infrastructure to enforce shapes needs time to implement and, again, I'm the only one working on this apparently, I'm setting a warning time of three months from now. That should be enough to adapt downstream applications. --SCIdude (talk) 06:05, 10 November 2020 (UTC)[reply]

ShEx tab

There is now a ShEx tab on the project page. Please add any I might have missed. --SCIdude (talk) 17:13, 3 November 2020 (UTC)[reply]

peptide (Q172847)

Currently, peptide (Q172847) has different definitions in German and English. In German it's any amino acid chain in English it's only short chains. If we want to go for the English definition then we need a new item that can display that's for any amino acid chain. ChristianKl ❪✉❫ 11:12, 11 February 2021 (UTC)[reply]

Does protein (Q8054) work for amino acid chains of any length? Best, Andrew Su (talk) 05:07, 12 February 2021 (UTC)[reply]

Alternatively, polypeptide (Q3084232) for the general item, with the subtypes protein and peptide that all other items are sorted into. There's obviously some edge cases and ambiguity at the small protein / large peptide end of (e.g. defensin (Q412820) sometimes referred to as proteins, or peptides), but it looks like it would work most of the time. T.Shafee(evo&evo) (talk) 05:49, 12 February 2021 (UTC)[reply]

Given how I understand the term protein, an amino acid chain with length of 2 clearly isn't aprotein.

polypeptide (Q3084232) currently has a description according to which it's for long, "continuous, and unbranched peptide chain". If we use it as a name we would have to change it.

We likely want one item for any animo acid chain, one for long ones and another one for short ones. We might also have more specific notions of long/short. ChristianKl ❪✉❫ 15:49, 12 February 2021 (UTC)[reply]

Design Q&A

I have added answers to design questions that might come up here: Properties#Modeling_questions. Please comment. --SCIdude (talk) 18:02, 1 March 2021 (UTC)[reply]

Complex Portal Bot

Hello all,

We are building a bot for connecting Complex Portal Information on Wikidata. We would love to have your opinion at Wikidata:Requests_for_permissions/Bot/ComplexPortalBot!

For a set of curated species, the bot will (in synchrony with Complex Portal curation)

Add items for macromolecular complexes absent on Wikidata
Add label, aliases, description and core statements (e.g., "instance of")
Link macromolecular complex items to their components via "part of" relations
Link macromolecular complex to Gene Ontology terms

For more information, details about the process are available at User:ProteinBoxBot/2020_complex_portal

Best, WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. -- TiagoLubiana (talk) 19:38, 4 March 2021 (UTC)[reply]

Gene items missing P31 and P279

A large proportion of the items that have no instance of (P31) or subclass of (P279) have a description in the format "protein-coding gene in the species [species name]". Is it appropriate to add P31=gene (Q7187) and P279=protein-coding gene (Q20747295) to these in batches, or is it more complicated than that?

This is the search where these appear. - PKM (talk) 23:47, 4 October 2021 (UTC)[reply]

@PKM: There are indeed a substantial amount of such items and it is not so strange to add P31=gene (Q7187) and P279=protein-coding gene (Q20747295). I dived a bit into this and identified approximatly almost 3k of such items. Of those 292 were actually deprecated gene annotations. I wrote a bot for this, which 1. deprecates statements in case a gene annotation has been deprecated. This except except for Entrez Gene ID (P351) since this still resolves to a NCBI pages describing the deprecation or 2. add those instance of (P31) and subclass of (P279) which still returns gene annotations. I have run this bot on a few items and intend to run more where subsequently the number of items will be updated until all have been dealt with. I do this gradually to allow potential subsequent curation tasks. --Andrawaag (talk) 10:25, 12 October 2021 (UTC)[reply]

I agree that items having a non-obsolete Entrez ID can automatically get P31/P279 added. --SCIdude (talk) 14:03, 12 October 2021 (UTC)[reply]

Thank you for writing that bot - much the best way to clean these up. - PKM (talk) 22:59, 14 October 2021 (UTC)[reply]

Suggestions for modeling drug to gene relationships?

I have a number of academic/expert-curated gene-disease, gene-drug, disease-phenotype relationships that were submitted as part of the Gene Wiki Review series Q28108851 that I'd like to add to Wikidata. While it has been relatively straightforward to add the gene-disease and disease-phenotype relationships, I'm unsure how to model some of the drug/gene relationships that have been provided by the authors. These relationships include:

5-azacytidine increases the expression of or translation of GABRB2
Efficacy of Loreclezole is reduced in mutations/variants of GABRB2
Oxycodone Reduces the expression of this gene or the translation of GABRB2

Suggestions on the best way to proceed? Gtsulab (talk) 20:50, 11 October 2021 (UTC)[reply]

@Gtsulab: the only things you can claim come from experimental papers, and the basic mechanics is that drugs bind to a target, which is modelled by physically interacts with (P129) and its subproperties. If the interaction partner is a protein, that protein takes part in biological processes which are already annotated by Gene Ontology biological process (P682) statements. Your examples:

azacitidine (Q416451) interacts with DNA methyltransferase 1 (Q5205796), RNA, and DNA. References can be found in https://go.drugbank.com/drugs/DB00928 see Targets.
loreclezole (Q6680282) acts as a GABAA receptor positive allosteric modulator (PMID 8183949 says enWP). For this interaction there is a property: positive allosteric modulator of (P3778), the interaction partner would be the protein complex subunits shown to interact in PMID 8183949, i.e. the human beta2 and beta3 subunits (just reading the abstract but you would need to read the paper too) which are Gamma-aminobutyric acid type A receptor subunit beta2 (Q21113314) and Gamma-aminobutyric acid type A receptor subunit beta3 (Q21113312), so you need two statements
oxycodone (Q407535) see https://go.drugbank.com/drugs/DB00497, interacts with opioid receptor mu 1 (Q5123372), opioid receptor kappa 1 (Q8083989), Opioid receptor delta 1 (Q21115334). I haven't found an experimental paper showing interactions affecting GABRB2 expression.

So you see, genes rarely interact, it's the proteins that do all the work. And we didn't even account for variants and necessary modifications. Have a look at reactome.org to see the big picture. --SCIdude (talk) 09:11, 12 October 2021 (UTC)[reply]

Thanks, @SCIdude, for the prompt response and suggestions! I really appreciate it. The way you've modeled the given examples make sense to me, but it does also suggest that we limit ourselves to only results from certain types of experimental protocols. If a paper is about observations on a gene expression array after exposure to a drug or toxicant, then the observable relationship may lack the additional mechanistic info. Perhaps this level of evidence is insufficient for Wikidata. Do you think I should omit it if I come across it in the future? Gtsulab (talk) 19:42, 12 October 2021 (UTC)[reply]

@Gtsulab: This level of evidence is insufficient in general. If X causes increased expression of mRNA coding for protein Y then you could model that, it just doesn't tell you what's really happening. X could bind directly to the Y gene promoter,it could bind to transcription factor Z1 which binds to the promoter, or Z2 which binds to Z1 etc., or it could block the repressor Z3 which then doesn't bind etc., or it could prevent some of these factors from being expressed etc.

But one real problem is that transcriptomics results often turn out to be irreproducible, and you would need meta-studies to reliably settle the question. But why do that if there is direct support for the actual mechanics? And there is lots of direct evidence that we don't have modeled. There is lots that even reactome.org hasn't covered, and they focus mainly on human.

But the real problem for WD in the long term would be the number of results from transcriptomics studies. I mean we are even holding back on adding papers. --SCIdude (talk) 06:44, 13 October 2021 (UTC)[reply]

Makes sense, thanks @SCIdude --Gtsulab (talk) 14:37, 14 October 2021 (UTC)[reply]

Update the modelling of protein families?

Currently, proteins are identified as "part of" part of (P361), or the inverse sense by has part has part(s) (P527) what looks like a bit of a stretch.

"Part of" and "has part" are used in many different contexts, often referring to physical partonomy relations (i.e. proteins as part of protein complexes).

Should we have a property "has molecular family" (or similar) linking a protein to its families?

TiagoLubiana (talk) 21:22, 19 May 2022 (UTC)[reply]

Hibernating Gene Wiki Bots.

Unfortunately, the Gene Wiki project currently has no funding. I will continue to monitor for community input and where possible curate upcoming issues, however I don't have the bandwidth to deal with the scale the different automated tasks can require. This is why we have put our Wikidata bots into hibernation. The different life science resources will not be updated until further notice.

The Wikidata Integrator, currently has a thriving community that extends beyond the life-science use case. I will continue to maintain that library and pull requests and/or issue reports are remain more than welcome, but support for the Wikidata Integrator will be less than before. --Andrawaag (talk) 11:55, 1 June 2022 (UTC)[reply]

I will take over the Gene Ontology entity maintenance which was neglected. First create the 881 missing new items. SCIdude (talk) 08:11, 8 July 2022 (UTC)[reply]

Also, there were 3,600 missing subclass statements (i.e. is_a in GO). --SCIdude (talk) 16:58, 14 July 2022 (UTC)[reply]

New property proposal

Hey, letting you of this new property proposal related to molecular biology: Wikidata:Property proposal/part of molecular family

Thanks! TiagoLubiana (talk) 20:03, 1 July 2022 (UTC)[reply]

UniProt item cleanup

Just a heads up that there wlll be some cleanup within and of our 622,751 items with UnProt accession.

For example 1) in Q13561329#P680 one can see duplicate claims with references that can be merged into a single claim, 2) some claims are no longer made in UniProt.

Some items themselves may have been obsoleted in UniProt, others need to be added.

These are the most obvious issues, others may be hidden. Any ideas? SCIdude (talk) 17:33, 6 October 2022 (UTC)[reply]

@SCIdude: Sounds like the only feasible solution is to build a data pipeline from UniProt to Wikidata that's maintained by a Wikidata bot. I think all of UniProt's data can be sourced using SPARQL, so it seems feasible. Where are UniProt identifiers currently sourced from? Seppi333 (Insert 2¢) 05:28, 28 July 2023 (UTC)[reply]

@Andrawaag SCIdude (talk) 09:23, 28 July 2023 (UTC)[reply]

New property proposal: CIViC gene ID

See Wikidata:Property proposal/CIViC gene ID, comments are very welcome! TiagoLubiana (talk) 13:53, 10 October 2022 (UTC)[reply]

New property proposal: ICTV Tax-Id

Hello, Concerning ICTV (International Committee on Taxonomy of Viruses) Wikidata knows still only the legacy ICTV ID linking to ICTVdb. This has been abolished since a long time and is only available via the web archive without any updates. The new ICTV Tax-Id does not exist at Wikidata yet. You can find it in the German template "Infobox Virus" (Virusbox) labeled "ICTV Taxon History". This is important as ona you can trace all the renaming and changes in the taxonomic classification performedby the ICTV. In the table provided by ICTV's Taxonomy Browser (https://ictv.global/taxonomy ) you'l link in the column on the far right. It is composed of an url path (always the same) followed by the Tax-Id. For an example see https://ictv.global/taxonomy/taxondetails?taxnode_id=202215027 . Btw: Same link is provided in the far right column of the Excel table "Master Species List" (there only available for species). The link target provides full information about the taxon history: name and classification changes. In addition, there is a link to the proposals that initiated these changes. As this is usually a zip file containing a Word docx and an Excel xlsx file, the information cannot be looked up by web search engines. However the docx contains important information about which virus strains belong to a species, etymology, phylogeny etc. As these proposals have been accepted by the ICTV this information appears to ba reliable. Same proposal may be shared between taxons.

German original: Zum ICTV gibt es immer noch nur die uralte ICTV ID für die ICTVdb. Das schon lange abgeschafft und nur noch im alten Zustand über das Webarchiv verfügbar. Die neue ICTV Tax-Id gibt es offenbar noch nicht. Die findet manin der deutschen Virusbox unter ICTV Taxon History, weil man da die ganzen Umbenennungen und Änderungen in der taxonomischen Klassifikation zurückverfolgen kann. Beim ICTV ist das in der Tabelle des Taxonomy Browsers der Link in der Spate ganz rechts (nur die ID, ohne Pfad), siehe https://ictv.global/taxonomy Auch in der Excel-Tabelle (Master Species List) ist es die Spalte ganz rechts (dort nur für Species verfügbar). Könnte man diesen Parameter hinzufügen?

Samlpe article in German WP providing the ICTV Tax-Id link (see last line in the box at upper right): de:Moumouvirus

Tax-Id Format: Number

Applicable: only for virus taxons from species up to realm. Ernsts (talk) 06:31, 13 May 2023 (UTC)[reply]

Hello @Ernsts: there is already a property Property:P1076, which has not been in use at this moment, no single wikidata object had this property as a statement.

From my point of view this property could be used for example in d:Q118322307 and d:Q118351767

After adding the formatter URL

https://ictv.global/taxonomy/taxondetails?taxnode_id=$1

in the property it might take a while (a day or so), until the ID is clickable in the objects.

A related project is Wikidata:WikiProject Taxonomy

Properties can be found at

Template:Taxonomy properties
Property:P4628 - ICTV virus genome composition
Property:P1076 - ICTV virus ID

M2k~dewiki (talk) 19:17, 13 May 2023 (UTC)[reply]

Hello, Thanks for these details. In fact, if the ICTV virus ID is not used (any more) at this time, it may be (re-)used. However, any hint to the former ICTVdb must be removed, syntax (format) has to be checked if it meets the new requirements, and the path for the url has to be implemented (as you mention). I'll try to assign some sample values soon.--Ernsts (talk) 19:38, 13 May 2023 (UTC)[reply]

Looks fine. Just changed the description in order to fit with the new purpose/usage. So we have to wait about one day until the values are clickable. BTW it might be a great thing if the ICTV Tax id of german boxes (if unique and just a number) could be imported to wikidata some way. But unfortunately I can't estimate how much effort that would take.--Ernsts (talk) 20:30, 13 May 2023 (UTC)[reply]

If there is a single numeric value (that is, only the ID) in the infobox for parameter ICTV_Tax in Infobox Virus, it should be possible to import the value using

https://pltools.toolforge.org/harvesttemplates/

de:Pyramimonas-orientalis-Virus currently has two values. M2k~dewiki (talk) 20:34, 13 May 2023 (UTC)[reply]

The genus is monotypic, so the lemme treats both, genus and species. A lemma may also treat a collection of taxa of same rank, such as de:Influenzavirus. However, these are exceptions. In most cases there is only a single number. For species only there might be another way by exploiting ICTV's master species list (MSL) at https://ictv.global/msl/current, go to the data sheet, the right most column hass the full link to the species' Taxon Details File. The correct line can be found via the scientific name. If Wikipedia has an scientific name matching old MSL thosecan be found at https://ictv.global/msl . All these ways may be used to fill gaps left by the tries performed before. It is a question of cost and benefit how much effort one wants to invest. --Ernsts (talk) 22:33, 13 May 2023 (UTC)[reply]

Examples can be found at User:M2k~dewiki/Tools/Enrich_Objects M2k~dewiki (talk) 20:35, 13 May 2023 (UTC)[reply]

Seems to be a rather mighty tool :-). I am unfortunately still busy creating some articles on virus taxa officially confirmed with the latest MSL of 8 April, plus an article on host amoeba, see en:Discussion:Mimiviridae#ReorgImiter — I'm afraid I don't have enough time to delve further into the matter. Could you please make the import from the de WP? If you have any questions, please do not hesitate to contact me! I'd be glad to have a look at the completed form before the action starts (by the way, I had added the ICTV_Tax parameter to the template myself some years ago, maybe you have already seen it). So long!--Ernsts (talk) 22:33, 13 May 2023 (UTC)[reply]

@Ernsts: the link for imports is:

https://pltools.toolforge.org/harvesttemplates/?siteid=de&project=wikipedia&namespace=0&p=P1076&template=Infobox%20Virus&parameters=ICTV_Tax&depth=1&constraints=Q21502404%7CQ52004125%7CQ19474404%7CQ21502410%7CQ21503247&alreadyset=1&wikisyntax=1

To start the import you need to click on "Login" before the first run. Afterwards you can click on "Load" + "Start".

While trying to import there are some format violations, for example:

Just as exoected, collections of human pathogen viruses/species of a higher taxon (genus or family)--Ernsts (talk) 23:25, 13 May 2023 (UTC)[reply]

as well as Constraint violation: item requires statement violation

I did not mention: artice about a virus caused disease with a section about the viru, this including the box.--Ernsts (talk) 23:25, 13 May 2023 (UTC)[reply]

A current list with all objects with IDs can be found at:

https://w.wiki/6hby

M2k~dewiki (talk) 22:44, 13 May 2023 (UTC)[reply]

For de:Vesicular stomatitis Indiana virus and de:Zaire-Ebolavirus (for example) there is a distinct violation M2k~dewiki (talk) 23:00, 13 May 2023 (UTC)[reply]

does the list contain the errors or the good cases or all? I wonder why mimivirus is included, but not moumouvirus. Boxex are looking well either. I cannot see the difference. – Concrning the distinct violation I don't see the problem, box looks well, too.--Ernsts (talk) 23:25, 13 May 2023 (UTC)[reply]

Per default, HarvestTemplates only imports values, that do not violate a constraint, so it do not insert a ID in two objects or a ID, which is not numeric (format violation), etc. So the list only contains values, which have been imported without error. M2k~dewiki (talk) 23:29, 13 May 2023 (UTC)[reply]

Looked at again, perfect!!! It doesn't get any better than that. I added the two cases with distinct violation by hand, the first one had a wrong ID in the de box, so double assignment. Many thanks and good night!--Ernsts (talk) 23:54, 13 May 2023 (UTC)[reply]

Currently there are about 160 entries, that could not be imported. If you click on the HarvestTemplate-Link after a initial Login plus "Load" + "Start" you will get a complete list of errors. M2k~dewiki (talk) 23:57, 13 May 2023 (UTC)[reply]

I did. Among the "errors" are articles without an ID, articles about diseases with a downstream box to the section about the pathogen, and those with several IDs. In the latter case, you could look at Wikidata for the taxon reng (genus, for example), and then in the template parameter for the Id before (genus) to get even more out of it. But that is a lot of effort with little benefit. Is it possible to evaluate an Excel list like the MSL? You could use the scientific name of a species to find the appropriate row, and the last column would contain the full link. But that is probably also a lot of work. P.S.: Because the ICTV keeps changing the scientific names (most recently on 8 April), especially to binary names (such as Homo sapiens), but this is not yet tracked in Wikidata, one would also have to import older versions of the MSL.--Ernsts (talk) 06:56, 14 May 2023 (UTC)[reply]

Please also see Wikidata:Properties for deletion/P1076 M2k~dewiki (talk) 17:02, 15 May 2023 (UTC)[reply]

@Ernsts: There is no "true" ID. Best use would be as reference URL (P854) as part of a reference. --Succu (talk) 19:59, 15 May 2023 (UTC) PS: I noticed MSL #38 was released. Another job for my bot. --Succu (talk) 19:59, 15 May 2023 (UTC)[reply]

Some_UMLS_CUI_(P2892)_concerns

https://www.wikidata.org/wiki/Property_talk:P2892#Some_UMLS_CUI_(P2892)_concerns Vladimir Alexiev (talk) 14:27, 21 June 2023 (UTC)[reply]

HGNC-UNIPROT ID pairs

Hi everyone.

For context, I operate a bot on Wikipedia (Wikipedia:Wikipedia:WikiProject Molecular Biology/Molecular and Cell Biology/Human protein-coding genes - current script) that programmatically writes/updates the current and complete list of human protein-coding genes using a dataset from the HGNC database, which is updated on a daily basis.

Following a discussion with User:Boghog at Wikipedia:Wikipedia:Articles for deletion/List of proteins in the human body (NB: the discussion here isn't that relevant to Wikidata), there seems to be some interest in expanding the lists my bot generates to include more information about the listed genes and encoded protein(s) using Wikidata as a data source (NB: the other alternative would be to source this data from UniProt). In order to do that, I need to access all data items on every human coding-protein gene with an HGNC ID and encoded protein with a UniProt ID on Wikidata that are listed in the HGNC dataset. Failing that, I can't perform a 1:1 merge of HGNC-UNIPROT ID pairs in the dataset I'd source from Wikidata with the HGNC dataset from the file I linked. Following a discussion at Wikipedia:User talk:Boghog, it's apparent that I'll need to program a Wikidata bot to unbork Wikidata before I can perform a 1:1 merge on HGNC-UNIPROT ID pairs. I figured I'd just give you all a heads up here to let you know of the issues I've found and will need to fix if there's a consensus to expand the current list articles my Wikipedia bot generates. Seppi333 (Insert 2¢) 05:26, 28 July 2023 (UTC)[reply]

protein (Q8054) status as instance of (P31) second-order class (Q24017414)

protein (Q8054) is stated to be instance of (P31) second-order class (Q24017414). This means that all instances protein (Q8054) are classes all of whose instances are individuals. But there are many instances of protein (Q8054) that are also subclass of (P279) it, e.g., retrotransposon (Q413988). Then instances of these instances, e.g., diversity-generating retroelement (Q101438742), are also instances of protein (Q8054). But then protein (Q8054) can't be a second-ordder class.

What is the real situation? Is protein (Q8054) a variable-order class (Q23958852) or is it not the case that items like retrotransposon (Q413988) are both instances and subclasses of protein (Q8054)?

And what does it mean to be both an instance and a subclass of protein (Q8054)? Peter F. Patel-Schneider (talk) 21:46, 20 August 2023 (UTC)[reply]

@Peter F. Patel-Schneider Usually in ontologies "Protein" is just a class. The individuals stuff you can count under a suitable microscope are all instances of proteins. Then rétrotransposon (Q413988) is just a subclass of "protein", as a concrete instance of it is just … a protein.

Metaclasses should just appear if we want to classify subclasses of proteins. For example if we want to tag the levels of amylase (Q17153)  

vs. Amylase proximal Dmel_CG18730 (Q29815901)  

, the latter is definitely a subclass of the former, and the first one is a specific kind of subclass because it’s a specific sequence that defines it, versus the more generic one who is defined by the function of the molecules and encompass a whole family of molecules with different chemical formula.

In that hypothesis, the metaclasses could be "class of protein with definite function" versus "class of proteins with definite formula".

We might have

⟨ class of protein with definite formula ⟩ metasubclass of (P2445) ⟨ class of proteins with definite function ⟩

because usually the formula-defined classes will be lower in the hierarchy than the functions one. author TomT0m / talk page 14:31, 21 August 2023 (UTC)[reply]

@TomT0m That's how I would have done it. But https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology/Problems/instance_and_subclass_of_same_class lists 751400 items that are both instances and subclasses of protein (Q8054). So I was wondering about the meaning of (this part of) life. Peter F. Patel-Schneider (talk) 16:08, 21 August 2023 (UTC)[reply]

A quick check with my classification.js script on amylase (Q17153) shows that there is a problem with chemical compound (Q11173) who is both a concrete and an abstract object : see this query of the subclass tree. This would be at least one of the problems with this. It seems it’s both a subclass of "chemical compound" and "type of chemical compound", which is incorrect. author TomT0m / talk page 17:49, 21 August 2023 (UTC)[reply]

I see that @Swpb presumably tried to fix this by this edit, but I don’t think this is the right fix. It’s incorrect that "compound" is a subclass of "type of compound". Subclasses of compound might be instances of "type of compound", so they cannot be both instances and subclasses of it. And they are the right types of compound. author TomT0m / talk page 18:44, 21 August 2023 (UTC)[reply]

I removed the statement because it was leading to seemingly false inferences like 2,4-Dimethylbenzylamine (Q72469479)instance of (P31)abstract entity (Q7184903). I don't know much chemistry, but I'm pretty sure chemicals are physically real. Maybe chemical component (Q20026787) was supposed to be an instance of metaclass (Q19478619)? Swpb (talk) 20:32, 21 August 2023 (UTC)[reply]

@Swpb It is correct. Its instances, concrete molecules, are concrete, but the class itself is an abstract object, something we use to classify concrete objects. All classes are abstract, ideas. Physical objects are concrete, classes are abstract objects we use to describe them. A specific cow is concrete, the "cow" class is abstract. A (strict) metaclass can be thought as an object whose all instances are abstract objects, for example. There is an issue if an item ends up being an instance of both a class of physical object and an instance of a metaclass, so both an abstract and a concrete stuff, which is not possible and here is an indication that there is an ontology problem. author TomT0m / talk page 20:47, 21 August 2023 (UTC)[reply]

Please don't explain basic ontology terms to me, I'm familiar with them. It seems to me your application of them here disagrees with common practice. You could say house (Q3947) is just an "abstract" class of physical houses, but nevertheless, it is a subclass of physical object (Q223557), as are hundreds of thousands of other such items. I don't see how the "class" of molecules 2,4-Dimethylbenzylamine (Q72469479) is any different from the "class" of houses house (Q3947). house (Q3947) is understood to be a class by virtue of having a P279 statement on it; we don't state house (Q3947)instance of (P31)class (Q5127848). Please try to be succinct in your reply. Swpb (talk) 20:55, 21 August 2023 (UTC)[reply]

You’re right. The real problem is that "type of compound", my creation, is supposed to be a metaclass, its instances should be classes. So "chemical compound" cannot be a subclass of it. But it’s an old creation of me which is a bit unclear in what way it should be used, it has no real good definition of on what kind of compound it should be used like in my example of "class of proteins with definite function" in my first comment, so we should either find a good and useful definition for it like "type of compound with a definite formula" or delete it alltogether if we can’t find something it is useful for. I guess I created it when some people wanted to have an instance of (P31) value to find compounds item on Wikidata. author TomT0m / talk page 21:10, 21 August 2023 (UTC)[reply]

BioCyc ID

Please consider adding the BioCyc ID for the reasons explained in a text at that wikilink. Maxim Masiutin (talk) 22:57, 19 November 2023 (UTC)[reply]

Tool for Gene Set Over-Representation Analysis using Wikidata and Wikipedia

WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

Hi! I am developing, as part of my PhD, a tool based on Wikidata for doing gene-set enrichment and displaying the results with navigable Wikipedia pages.

It is meant as an exploratory, knowledge-building tool for bioinformaticians trying to get a sense out of gene lists.

Maybe it interests this group: https://wikiora.toolforge.org/

(feedback is welcome)

Cheers! TiagoLubiana (talk) 15:50, 16 July 2024 (UTC)[reply]

Wikidata talk:WikiProject Molecular biology

GO Term Provenance

Improper aliases

Subclass of -> Instance of for Genes and Proteins

Soliciting suggestions of new data sources

classification of properties

Haplogroups

Help needed merging Gene Wiki pages

Classification of the entities managed by thes project in Wikidata

Describing a molecular biological process

Possible merge required

Molecular Reaction?

A possible Science/STEM User Group

"determination method" property on GOA references

prepro property

WD UniProt duplicates/fragments policy

Problems with PDB and GOA from UniProt imports

Specifying PTM Type

GOA "P" (process) annotations on genes

specifying aa position in a protein

TCDB import done / UniProt coverage

WD enzymatic activities are GO

EC is sparsely mapped in WD

Correction of Wikidata descriptions of Wikipedia protein articles

Manuscript: Wikidata as a FAIR knowledge graph for the life sciences

canonical databases

Wikidata at the Bioinformatics Community Conference 2020?

bulk statement deletion

Merge in WikiProject Bioinformatics?

non-coding RNA items

ShEx tab

peptide (Q172847)

Design Q&A

Complex Portal Bot

Gene items missing P31 and P279

Suggestions for modeling drug to gene relationships?

Update the modelling of protein families?

Hibernating Gene Wiki Bots.

New property proposal

UniProt item cleanup

New property proposal: CIViC gene ID

New property proposal: ICTV Tax-Id

Some_UMLS_CUI_(P2892)_concerns

HGNC-UNIPROT ID pairs

protein (Q8054) status as instance of (P31) second-order class (Q24017414)

BioCyc ID

Tool for Gene Set Over-Representation Analysis using Wikidata and Wikipedia

Navigation menu

Search