Shortcut: WT:MB

Wikidata talk:WikiProject Molecular biology

From Wikidata
(Redirected from Wikidata talk:MBTF)
Jump to: navigation, search

Contents

species[edit]

I read "human is a priority", at Wikidata level shouldn't we avoid this type of approach and try to be open to all organisms since the beginning? or you should rename the task force in "Human Molecular biology task force" ;-) --Chandres (talk) 14:12, 4 March 2013 (UTC)

Yes, I think this task force should eventually target all organisms. We suggested starting with human since that nicely overlaps with our interests and expertise. But we are also interested in the Long Tail of microbial genomes too ([1]), so making sure the models/properties/tools are generalizable is definitely on our mind. And more to the point, if you and/or others are interested in other organisms, then by all means, let's actively move forward on multiple fronts! Cheers, Andrew Su (talk) 19:17, 4 March 2013 (UTC)

Let's get started?[edit]

Now that the string datatype has been implemented, seems like things are ready to move on getting some of our proposed properties approved. Anyone want to take the lead on proposing a few over on the property proposal page? Andrew Su (talk) 17:44, 7 March 2013 (UTC)

I had a little time and requested a couple of properties. All they need now is a few reviews. --Tobias1984 (talk) 10:39, 27 May 2013 (UTC)
See currently proposed properties. These properties require some specialized knowledge, but they are just string-type identifiers, and as such should not cause major structural problems. I think that if they have not been reviewed in a few days, they can safely be created. --Zolo (talk) 10:00, 29 May 2013 (UTC)
Although it would be nice if at least one person would check over the proposals and support them ;). Where are all the people of this task force? --Tobias1984 (talk) 21:14, 1 June 2013 (UTC)

Scope?[edit]

I read that you want to at least add data for 10k human genes, but we could easy imagine adding data for millions of genes in UniProt (for example). Is this the place to discuss the scope of the project? Cheers, --Dan Bolser (talk) 15:58, 12 March 2013 (UTC)

Hi Dan, absolutely, we want to scale up to all known genes and proteins in all organisms. Our local database here (a newer version of what's available at http://mygene.info) has loaded all ~8.7 million genes from NCBI's gene_info file ([2]), plus all known links to Ensembl, UniProt, PDB, GO, etc etc. So all of our infrastructure is converging on a complete loading of all knowledge. As a pilot project though, we are proposing to start with the 10k human genes that encompass the Gene Wiki, just because our team has interest and bandwidth to make sure all of that is high quality. As mentioned above, if there are eyeballs and hands to help us broaden our initial pilot, by all means, we're open to it... Cheers, Andrew Su (talk) 17:58, 12 March 2013 (UTC)

User:Ricordisamoa/AuthorityControl.js[edit]

As external links are probably important here, you may want to use and expand User:Ricordisamoa/AuthorityControl.js that automatically creates links for some string properties. --Zolo (talk) 10:15, 9 March 2013 (UTC)

Hi Zolo, just to confirm, the tool above only changes how string properties are rendered in the web interface, not how the data are represented in wikidata, correct? Assuming I'm understanding correctly, sounds great! Once we get a few of our proposed properties created, we'll add our identifiers. (For others, instructions on how to add this user script are at Wikidata:Tools#AuthorityControl.js.) Cheers, Andrew Su (talk) 23:04, 12 March 2013 (UTC)
Yes, it only changes the way things are dipslayed. In the longer term, it might make sense to store external IDs differently from other strings, as full statements with sources and qualifiers may not make much sense for them. Apparently, this is something the development team is thinking about, but if something get done, it will probably not be in the next few months. --Zolo (talk) 08:15, 13 March 2013 (UTC)
Note: it is now called MediaWiki:Gadget-AuthorityControl.js and is activated by default for all users. --Zolo (talk) 17:31, 2 April 2013 (UTC)

Distinguishing between genes and proteins[edit]

On Wikipedia, proteins are covered in the same article as their corresponding gene; see BRCA1, APOE4, etc. What are others' thoughts on separating proteins and their corresponding genes into separate items on Wikidata, e.g. such that 'BRCA1' would be a subclass of gene, and 'breast cancer type 1 susceptibility protein' would be a subclass of protein? And how about distinguishing between homologous genes and proteins in different organisms? Emw (talk) 02:41, 19 March 2013 (UTC)

Great discussion point. With the Gene Wiki, we did make the decision (with en:WP:MCB) to lump all the data about the gene and the corresponding protein product(s) into a single page. That was mostly because most pages were sufficiently underdeveloped such that splitting them would have been non-productive fragmentation. I'm torn about whether to take this same strategy with Wikidata. On the one hand, splitting them into two items is probably the more "accurate" way of representing reality. However, I think consumers of the data generally won't care about the difference, so it would be one more step with doing integrative queries. For example, suppose I want to find all genes on chromosome 2 whose protein products have a kinase domain. If the gene and protein items were distinct, presumably it would require another "join" in the query. Anyway, very interested to hear others' thoughts... Cheers, Andrew Su (talk) 04:59, 19 March 2013 (UTC)
From a semantic point of view seperating genes really makes a lot of sense to me. Especially since you can think of properties proteins have (function etc...) which do not really apply to genes. I can understand that persons who want to use the data don't really care if they look at a gene or a protein, but it doesn't seem like you would want to modify the raw data for it. Wouldn't it be possible to change the way genes and proteins are presented here on wikidata trough an extension or something like that. --TWillemsen (talk) 09:56, 25 May 2013 (UTC)
I am conflicted on this one. In this case, we have a natural immutable hierarchy that is a result of the central dogma (gene → protein → function). As Andrew has pointed above, in the vast majority of cases, we have a single article about the gene and protein encoded by that gene. This situation IMHO is unlikely to change since the subject of genes and proteins is so interrelated. Separate gene and protein articles, where they exist, have slowly been merged over the time. Furthermore, I am not aware of a single example of a gene/protein article has been split into two. On the other hand, gene and protein are distinct entities ... Boghog (talk) 10:59, 23 June 2013 (UTC)
Hum, the central dogma is wrong, there is a lot of example where one gene can give different proteins (in term of sequence and sometimes even the name change). you can aso find example of proteins with differents functions depending of , for example, phosphorylation status. The rational solution would be to have protein in the same item than the gene., But, when the case of multi protein from one gene or different function (not multiple function), it can be really difficult to link the different data. --Chandres (talk) 20:27, 23 June 2013 (UTC)
I also think that we can and should part ways with Wikipedias structure in certain areas. We could for example manage the sitelinks in the gene-items and create protein-items that usually don't have sitelinks but can hold statements independent from the genes. Maybe we should have an RfC about this topic and then we have a guidline how to manage similar cases. --Tobias1984 (talk) 21:00, 23 June 2013 (UTC)
Of course one gene can code for more than one protein (alternative splicing, post translational modification, etc.). This does not however invalidate the central dogma and the hierarchy (a one to many relationship). Boghog (talk) 21:05, 23 June 2013 (UTC)
Physically, the gene and proteins are separate entities, but I think it makes more sense to put them on one page. For the related properties it's clear if they belong to the gene or the protein. In an ideal world, we would have data on all the splice and PTM forms, but this is not the case and so pooling all the evidence (and perhaps adding qualifiers as to which form is meant) seems more manageable. MichaK (talk) 08:06, 24 June 2013 (UTC)
As some here have noted, genes and proteins are different things. They're strongly related, of course, but they're not the same thing. Wikidata is about things--its items are not encyclopedic articles. While it makes sense for Wikipedia to cover a gene and its proteins in a single article because of convenience for the humans that read the article, the same thing doesn't make sense in Wikidata. We can have as many items as we want, we can link them however we want, and we can display the data however we want. Wikidata's current user interface is not the final word, and better (and even domain-specific) interfaces can be built to display the data--in this case to put the data about genes and their proteins in one place. For an example of a different kind of Wikidata user interface, see the Reasonator. Silver hr (talk) 21:03, 24 June 2013 (UTC)
The two discussions about how Wikidata should handle gene-protein distinctions and how we should handle ortholog distinctions involve the same question. How should biological sequence information be partitioned on Wikidata? Should it be divided into many items such that one item represents one discrete type of biological entity, or divided into fewer items such that one item represents one gene product, including information like which entities the gene products derives from, etc?
Separating genes and their encoded proteins into separate items, and separating those by which organisms express them, seems like it would be more unwieldy initially. For an article like reelin, it would entail at least four Wikidata items: one for the RELN gene in humans, one for the RELN gene in mice, one for the reelin protein in humans, and one for the reelin protein in mice. To be consistent with this approach, Wikidata would need separate items for RNA (in humans and mice). And even when an EC number mapped to only one gene product, Wikipedia would need a separate item for the that enzyme class (see here and here for more detail). So Wikidata would need six items (and quite possibly more) to represent the information in one PBB template.
On the other hand, putting these distinct biological types into separate Wikidata items seems like it would make it easier to assign properties to each specific type. It's technically possible to distinguish between claims that apply to a gene or its product (or to its orthologs) in one large item, but that simple offloads the unwieldiness onto the claims. In other words, claims would need to be bloated with qualifiers in order to unambiguously specify which statement applied to which biological entity.
My impression is that it would be better to divide the articles in question such that one item represents one discrete type of biological entity. This seems like it might make the initial mapping of Wikidata items onto the PBB template less straightforward, but -- in time -- allow for greater expressivity about each biological entity within that template. Emw (talk) 23:38, 24 June 2013 (UTC)
Thank you for this summary, and you are right of course that the gene/protein question and the ortholog/paralog issue are highly related. It seems to me that the consensus is forming around separating all these conceptual entities into separate items (right?). I lean this way to, where my only hesitation comes from the fact that I'm largely ignorant of how the Wikidata querying system will work. Emw states that this design would "make the initial mapping of Wikidata items onto the PBB template less straightforward". I'm fine with "less straightforward" as long as it could be done (integrating data from multiple wikidata items into a single wikipedia template). Can someone more knowledgeable confirm that this is true? Cheers, Andrew Su (talk) 07:26, 25 June 2013 (UTC)
I forgot that I had a call with Denny this morning, so I asked the question I posed above. The answer is that currently only one-to-one mappings between wikipedia and wikidata are allowed (meaning we would not be able to import data from multiple wikidata items into a single wikipedia template), but that support for this feature is definitely planned and will likely come near the end of the year. So with that context, I think we separate out genes from proteins, and also the various orthologs. Long term that seems like the best solution to me. I'm going to make a specific proposal to this effect below... Cheers, Andrew Su (talk) 17:18, 25 June 2013 (UTC)

Human/mouse/... ID[edit]

How do we distinguish between IDs for humans and mice? Present Reelin P351 value is for humans, but the property does not describe that. Should a qualifier be used? (I know not much about non-human bioinformatics) — Finn Årup Nielsen (fnielsen) (talk) 14:55, 17 April 2013 (UTC)

Ooops, sorry I missed this post way back when. Yes, we need to better model how orthologs are handled. Personally, rather than putting it in a qualifier, I'd propose creating a separate topic for the mouse gene, and then relating them via a new property called "ortholog". There are too many functional differences to lump all orthologs into a single topic. Your thoughts? Cheers, Andrew Su (talk) 00:27, 25 May 2013 (UTC)
I just updated RELN (Q414043) with some information and I think that the invalid ID (P89) qualifier (mouse/human) works pretty well. We should still thing about how the information should be sourced. --Tobias1984 (talk) 21:06, 13 June 2013 (UTC)
I'm still not 100% sure I agree with having the human and mouse genes represented in the same item. Many of a gene/protein's properties might be species-specific. For example suppose (completely hypothetical here) that reelin interacts with VLDL receptor in both human and mouse, but interacts with APP only in human. How would we model that? What happens if there are disagreements on what the true ortholog relationships should be? Again, I tend to favor creating a separate topic for each species-specific gene, and then linking them via an 'ortholog' property. Other thoughts? Cheers, Andrew Su (talk) 06:38, 21 June 2013 (UTC)
Different interaction in different species can be expressed with qualifiers, but it seems more tricky for disagreements over orthologs. I am no biologist, but I feel that separate items is a more manageable long-term solution. If so, should RELN (Q414043) be taken to mean "human reelin", or should "human reelin" have its own item ? I do not know if there would be much to say about reelin in general, perhaps things like its evolutionary history. --Zolo (talk) 07:12, 21 June 2013 (UTC)
Yes, I would agree that RELN (Q414043) would specifically refer to the human version (which would be noted with invalid ID (P89) and human (Q5)), and we would then create a new item for "mouse reelin". As far as evolutionary history, I think those reciprocal "ortholog" relationships can be encoded as statements on both items. Make sense? Cheers, Andrew Su (talk) 08:34, 21 June 2013 (UTC)
Hi Andrew! I made an example of how qualifiers could be used to show different interactions for different species (http://www.wikidata.org/w/index.php?title=Q4115189&oldid=51778169). The bottom two example of VLDL protein show that additional qualifiers could be added to further describe the interaction (my weird example: Sandbox-item has physical interaction with VLDL protein in Mice, but only in 1980). --Tobias1984 (talk) 09:11, 21 June 2013 (UTC)
Hi Tobias, yes point well taken that qualifiers could be added there too. My concern though is that almost all of the properties would end up having species-specific qualifiers. For example, in addition to the ones already shown, I think regulates (molecular biology) (P128), gene symbol (P353), and RefSeq Protein ID (P637) would all be candidates. In that case, it might be easier just to separate them. And on a philosophical level (even though I hate philosophical arguments), I tend to think that human reelin is in fact a different thing than mouse reelin. Your thoughts? Others thoughts? Cheers, Andrew Su (talk) 16:52, 21 June 2013 (UTC)
The question of what would be the easier solution is difficult to answer at the moment. The problem has multiple-dimensions too. What is easier for people to view and edit; What is easier for bots to edit; What is easier to query? Just recently most people agreed that certain editions of books should have their own item. But then a query for "books written by author" will return all the editions that link to the author. That means that the query has to be more complicated than in the one-item solution. If there is something nature hates, it is parceling of information ;). Do you have time to look at Wikidata:Property_proposal/Term#RefSeq so the GSoC student can get up and running? --Tobias1984 (talk) 17:20, 21 June 2013 (UTC)
Yeah, good points. I'm personally not so concerned about the views/edits, but query complexity is definitely a downside to over-fragmentation. But I also worry that if we decide fragmentation is better later, it will be a big pain to "fix" things after we've already created 10000+ items. Ugh, no perfect solution... (And yes, I did add my support to Wikidata:Property_proposal/Term#RefSeq. Thanks for the reminder...) Cheers, Andrew Su (talk) 17:55, 21 June 2013 (UTC)
At the moment, I do not see what it would be made harder to query. Presumably, most queries will be species-specific anyway.
Yes, I suppose it makes sense to link ortholog genes through an "ortholog" property, though of course, it we have many species, and link all ortholog pairs across all species, we get something almost as redundant as the old interwiki system that Wikidata is supposed to avoid ;). --Zolo (talk) 07:36, 23 June 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Interesting discussion. I think this issue boils down to whether it is better to have a hierarchal vs relational data model for this type of data. A hierarchal model might make sense if the hierarchy were immutable and not subject to change over time. The ortholog hierarchy is a product of evolution which implies that there may be some ambiguities. In cases where the corresponding orthologs within two species only have a single paralog within each of the species, there is no debate on that these two protein are orthologs. However if there is more than one paralog of a given protein in either of the species, things can get complicated particularly if the similarity between orthlogs is comparable to the similarity between paralogs. I think looking at the HomoloGene database is instructive. Assignment of orthologs is based on a clustering algorithm. The exact assignment of orthologs may and do change slightly over time as more protein sequences from different organisms are added. Hence from a maintenance standpoint, as Andrew has already suggested, I think a relational model is better in this case (separate database entries for orthologs from different species linked by a new ortholog property). Boghog (talk) 09:41, 23 June 2013 (UTC)

Adding the paralog layer is important for the debate. In the future the system will be used for other species,when we will start the plant world, we will have ortholog, with paralogs, that have the same function in different localization, or different function in the same compartment. I really think that one entry per gene per specie is more relevant at a long term view. --Chandres (talk) 20:15, 23 June 2013 (UTC)
For small protein families, putting proteins of different species on one page might look feasible, but I don't think it scales well. So each protein should have its own page. Regarding the ortholog links: As we add more species, the number of necessary links explodes. I'm not really sure if there's a good representation of gene trees in the WikiData model. But in the beginning, adding mouse--human orthologs to the respective protein pages makes a lot of sense. MichaK (talk) 08:00, 24 June 2013 (UTC)
Andrew Su has alerted me to this discussion. We will have the Quest for Orthologs meeting in a few weeks, and I will try to get feed-back then. In the meantime, my feeling is that creating ortholog groups makes the most sense: for a given taxonomic level (e.g., LCA of human and mouse) you get the orthologs plus any in-paralogs. This is notably implemented in a clear way in OMA, I'm sure that they can help you if needed.Marcrr (talk) 15:23, 1 July 2013 (UTC)

Hi, when you are done with this decision, could you add a word about it in Help:Modeling#Molecular_biology ? This help page is precisely intended to index and discuss (it probably will be split in the future) such modelisation decision. It would be nice to have some usecase items, to list relevant properties, and even detail an example. And/Or a link to the page on this project where this decision will be documented. TomT0m (talk) 14:03, 30 June 2013 (UTC)

Great idea. Though is there a formal way to describe a data model? We've so far been doing everything by example (e.g., RELN (Q414043)), but something more formal would certainly be good too... Cheers, Andrew Su (talk)
If we'd like to formalize the data model, I think a good approach would be to keep working out basic properties as we have been, then express that in OWL. OWL is the lingua franca for speaking about ontologies for the Semantic Web, which in my opinion is what we're building (often unwittingly) here and throughout Wikidata. Some relevant literature and resources on OWL and biological ontologies:
One of high-level points I've taken from the literature is that the Gene Ontology's "is-a" relationship is modeled with the OWL property rdfs:subClassOf. Wikidata has a property explicitly based on that W3C recommendation: subclass of (P279). My initial impression is that it would work for all the gene and protein items to be created to have the statements "subclass of (P279) gene (Q7187)" and "subclass of (P279) protein (Q8054)". Emw (talk) 17:29, 30 June 2013 (UTC)
Good idea to try to formalize the model. I'll start a RfC to try to establish a common language on Wikidata on how to use instance of and subclass to establish a formal type model, this will be a support for this discussion. I have a few ideas but it might take a few days before it is really on. I think this will be a good point in the discussion you try to start for quite a while :) TomT0m (talk) 10:48, 1 July 2013 (UTC)
Concerning OWL and XML, please note that their is a workgroup of the Quest for Orthologs trying to establish standards to represent orthology relations: http://questfororthologs.org/standards Marcrr (talk) 15:25, 1 July 2013 (UTC)

Pathways[edit]

Hi everyone! New member of the community here (writing from the Amsterdam Wikimedia Hackathon! I was wondering if I could propose some properties not only for genes and proteins, but also pathways. I really think this could be a good addition. A lot of pathway data is already available, but not very structured. What are your opinions on this? If you might think it is a good idea, I can create some examples here.TWillemsen (talk) 18:37, 24 May 2013 (UTC)

I think pathway information would be fantastic! And there are a few structured resources for pathways -- Pathway Commons, Wikipathways, KEGG, etc. You're probably already familiar with them, but just to be sure... Let us know how it goes! Cheers, Andrew Su (talk) 00:25, 25 May 2013 (UTC)
Yeah, I've worked with those pathways. But in my experience those resources are not really data-oriented, for good reasons ofcourse. Anyway. I'll be proposing some properties for pathways here this weekend. I'm still very new to wikidata, so please let me know if I don't follow guidelines :) --TWillemsen (talk) 09:41, 25 May 2013 (UTC)

Re: Support for Property Creations[edit]

Hi everyone!, For the GSOC Gene wiki project[[3]], we have proposed a set of properties[[4]] to capture fields of infobox[[5]]. Kindly take part in property proposals[6] through your comments and/or support. --Chinmay26 (talk) 19:28, 18 June 2013 (UTC)

I can create the property as soon as there is more support. Maybe a lot of people don't have watch-list-email-notifications turned on. --Tobias1984 (talk) 10:15, 21 June 2013 (UTC)

Proposal for handling genes and proteins, and species-specific orthologs[edit]

In an attempt to summarize the consensus that I think we're reaching here and here, I propose (well, mostly reiterating and formalizing EMW's proposal) that the data from a single PBB template on Wikipedia be separated out into four Wikidata items: the human gene, the human protein, the mouse gene, and the mouse protein. Later, I think we can consider separating out the RNAs as well, I don't think this is justified at the moment since there are few (if any) RNA-specific statements. Please lodge your support or opposition below... Cheers, Andrew Su (talk) 17:35, 25 June 2013 (UTC)

Support[edit]

Oppose[edit]

More comments[edit]

Please make sure you've reviewed the comments already made here and here.

Just to make sure it is clear, please note that User:Chinmay26 is a GSoC intern this summer. So even if you think this basic model will need to be tweaked in the future, getting consensus around the plan above will allow Chinmay to get started building the basic pieces of infrastructure for uploading and maintaining genomic data... Cheers, Andrew Su (talk) 17:35, 25 June 2013 (UTC)

Distinguishing enzymes and gene products[edit]

There's a question about the nature of EC number (P591): Property_talk:P591#Distinguishing_enzymes_and_gene_products. Feedback is welcome! Emw (talk) 00:37, 27 June 2013 (UTC)

Using RefSeq (P656)[edit]

I've mocked up the use of RefSeq (P656) over on reelin (Q13569356) and RELN (Q414043), but I'm not convinced this is the best way of structuring things. Any thoughts? Cheers, Andrew Su (talk) 20:58, 29 June 2013 (UTC)

Is there a reason we need GenBank accessions for Wikidata items? The GenBank accessions used to derive a given RefSeq accession are noted in the latter's 'COMMENT' field, see e.g. NM_011261. So my impression is that it's probably extraneous and unnecessary to include GenBank accessions in 'RNA ID' or 'Protein ID' claims, in which case those properties could be renamed to 'RefSeq RNA ID' and 'RefSeq protein ID' and we could do away with RefSeq (P656) (which is currently used as a qualifier). Emw (talk) 05:36, 30 June 2013 (UTC)
Yeah, I'm not 100% sure that we need it either. Perhaps for organisms that don't have strong RefSeq support. Perhaps we shouldn't worry about that case for now? Are there other sources of non-RefSeq RNA and protein sequences that are important? Where should we put the Ensembl transcript IDs (ENST*) and Ensembl protein IDs (ENSP*)? Not sure about this... Cheers, Andrew Su (talk) 15:49, 30 June 2013 (UTC)
I agree that we probably don't need to worry about organisms that don't have strong RefSeq support for now. If we'd like Ensembl IDs for transcripts and proteins, then I think it would make sense to have separate properties for each of those. Emw (talk) 16:56, 30 June 2013 (UTC)
(And shouldn't RELN (Q414043) and reelin (Q13569356) be switched? The Wikipedia article is about the protein, but the Wikidata item with the sitelinks is about the gene.) Emw (talk) 05:49, 30 June 2013 (UTC)
Well, here's where Wikipedia's semantic ambiguity makes things difficult. The WP article combines information about the gene and the corresponding protein. That was a conscious decision that we made at WP:MCB. Ultimately the infobox template for reelin will need to draw from all four reelin-related wikidata items (human/mouse gene/protein). So I think the link as it stands is fine, but certainly open to more discussion on how best to handle things... Cheers, Andrew Su (talk) 15:49, 30 June 2013 (UTC)
I also think that the items should be switched. Even though Wikipedia infoboxes are both about the gene and the protein, the textual content is primarily about the protein, and that is true for all languages. Actually fr:Reelin hardly even mentions the gene. --Zolo (talk) 22:00, 1 July 2013 (UTC)
I moved the sitelinks to Q13569356. --Zolo (talk) 15:31, 3 July 2013 (UTC)

New properties needed[edit]

Given the format that has been agreed upon, we need some addtional properties:

  • ortholog of
  • encoded by and the symmetric property "encodes". Or just one of them ?
  • any others ?

--Zolo (talk) 15:34, 3 July 2013 (UTC)

I think we need "Taxonomy ID" to note that human (Q5) is 9606. And yes, I do think we should have the reciprocal links. The brother and sister properties are used on both ends of that relationship, so that's a similar situation, right? Cheers, Andrew Su (talk) 23:06, 3 July 2013 (UTC)
I added them to Wikidata:Property proposal/Term#Biochemistry and molecular biology / Biochemie und Molekularbiologie / Biochimie et biologie moléculaire. -Zolo (talk) 07:26, 4 July 2013 (UTC)

Sourcing requirements for bots[edit]

There's an RFC that's relevant for the GSoC project: Wikidata:Requests_for_comment/Sourcing_requirements_for_bots. Emw (talk) 17:18, 6 July 2013 (UTC)

gene, RNA, and protein identifiers[edit]

All, as we move forward with modeling gene and protein items, I want to be sure we have consensus. Right now, we are generally following a model that has database-specific properties:

However, I think there is an argument to have just three properties ("Gene ID", "RNA ID", and "Protein ID"), where the different types of identifiers are differentiated by the "Source". For example, RELN (Q414043) could have a property "Gene ID" --> "5649" (Source: National Center for Biotechnology Information (Q82494)), and also "Gene ID" --> "ENSG00000189056" (Source: Ensembl (Q1344256)). I tend to like the simplicity of this system because it will prevent explosion of the number of properties, especially as we move to other organisms (Flybase, Wormbase, Pombase, RGD, etc...). Thoughts? Cheers, Andrew Su (talk) 18:27, 9 July 2013 (UTC)

There is a RfC about this topic running right now: RfC:How to classify items. --Tobias1984 (talk) 19:43, 9 July 2013 (UTC)
Yes, in order to prevent an explosion of properties, I think it would make sense to have fewer, more fundamental properties. We would of course need to agree on a standard "base" name. For genes, the most natural would be Human Genome Project (Q192446) and proteins, UniProt ID (P352). Boghog (talk) 20:05, 9 July 2013 (UTC)
The Classification RFC linked above seems mostly unrelated to this discussion. That discussion concerns whether we should use many domain-specific "type of" properties or two such properties to construct instance relations and subsumption hierarchies. (I happen to think we should use the latter approach.) This discussion seems to be about how to handle identifier properties, not basic membership properties.
For identifier properties, there seems to be more precedent on Wikidata to have each separate ID as its own property, rather than grouping identifiers by type and then individuating them with qualifiers. For example, we've got several very popular properties VIAF identifier (P214), GND identifier (P227), LCNAF identifier (P244), NDL identifier (P349), and BnF identifier (P268). Each of those properties is really describing the same type of thing: an authority file ID. However, instead of having a single "authority ID" property with one qualifier per authority (e.g. OCLC, GND, LC, NDL, BNF), you can see in e.g. On the Origin of Species (Q20124) that each identifier has its own property.
That said, type-specific organization of properties as suggested by Andrew seems like it could be a good idea. Here are a few questions and comments:
  1. How would identifiers from different databases operated by the same organization be qualified? For example, the gene ID properties Entrez Gene ID (P351) and HomoloGene ID (P593) come from different databases -- Gene and HomoloGene -- operated by the same organization, National Center for Biotechnology Information (Q82494). So sourcing at least gene ID strictly by organization doesn't seem feasible. In cases where a biological sequence (e.g. a gene, RNA or protein) has multiple identifiers from the same organization, should that qualifier simply point to the database instead of the organization, for example Gene and HomoloGene (Q468215)? (The problem doesn't necessarily go away if we don't considered "HomoloGene ID" to be a "gene ID". If we wanted to represent both reelin's NCBI Gene ID 5649 and NCBI RefSeq ID NG_011877.1 with a "Gene ID" property, how would we do that?)
  2. Is having "RNA ID" and "Protein ID" on a gene item redundant with the proposed "encodes" property? This seems like it could have implications on how we organize our ID properties for biological sequence items.
I'll end my comment by pointing out that Andrew's proposal, or something like it, might enable an even simpler way of handling IDs. If we were to organize these sequence identifier properties by type, then we might be able to designate one "preferred ID" among the set of IDs for each property with Wikidata's upcoming claim ranking feature. This would allow the statement with the preferred ID to be shown to all users and displayed in Wikipedia infoboxes by default. The various ID properties are probably all of equivalent accuracy in themselves (so our usage of ranking would deviate slightly from the feature's official description), but this might be a nice added benefit. Emw (talk) 04:26, 10 July 2013 (UTC)
A trivial solution to question #1 above is to assign "source" to a specific database instead of an organization. For example:
Thanks for pointing out the "claim ranking" feature which looks very useful. To rephrase what I stated above, IMHO the "preferred id" sources for genes and proteins should be Human Genome Project (Q192446) and UniProt ID (P352) respectively. It is not so clear what the "preferred id" would be for mRNA however. Boghog (talk) 09:32, 10 July 2013 (UTC)
Sorry for the "slow" reply (only relative to you all!)... Just to get caught up quickly in bullet point style...
  1. Tobias, the linked RfC is interesting indeed. However, I personally find it to be too abstract. I'd propose that we focus on coming up with the best solution for this particular corner of Wikidata, and worry about the implications/relationship to the rest of Wikidata later. Otherwise, we run the risk of paralysis. Sound reasonable?
  2. I agree with Boghog that we should just create an item for every unique source.
  3. Boghog, I'm not sure I 100% understand what you mean by "base name" / "preferred ID". Especially as it relates to Human Genome Project (Q192446). Can you clarify?
  4. Regardless, I think Boghog and Emw are supportive of the alternate plan above. Tobias (and anyone else who's interested), do you have any objections or refinements? In tangible terms, I think the game plan would involve:
  • creating a "Gene ID" property (we already have RefSeq RNA ID (P639) and RefSeq Protein ID (P637))
  • creating items for any database identifier providers that don't already exist (to start, "HUGO Gene Nomenclature Committee (HGNC)", "NCBI Entrez Gene")
  • migrate all the uses of the DB-specific properties (e.g., UniProt ID (P352), HGNC ID (P354)) to the new system.
  • eventually propose deletion of the DB-specific properties
Any thoughts/refinements/dissent? Cheers, Andrew Su (talk) 01:18, 11 July 2013 (UTC)
A few comments:
  • I think we should consider how the proposed "encodes" property relates to this alternate plan. It seems the alternate plan would have statements for "Gene ID", "RNA ID" and "Protein ID" in each gene item. However, isn't that information redundant with "encodes"?
  • Since Human Genome Project (Q192446) isn't a sequence database I don't think it makes sense to use it as a source for identifier properties. (Yes, HGP is a high-level project that has generated much of the underlying biological sequence data for the human genome, but it's not the entity asserting that, say, reelin has any particular identifier.)
I agree with Boghog's statement that these ID properties should be sourced to a biological database, not an organization. I think for our purposes we can consider items with Wikipedia pages using 'Infobox biodatabase' to be valid sources for these ID properties. Emw (talk) 02:07, 11 July 2013 (UTC)
Responding to Andrew's question above, what I meant by "base name" / "preferred ID" is for situations where multiple databases provide an equivalent data field (e.g., gene name), and where the value (e.g., reelin) may not be identical between databases. Therefore we should indicate which database provides preferred values for a given data field. For example, many databases provide gene names (HUGO, NCBI gene, etc.). Furthermore HUGO provides an approved gene name, which most other databases including NCBI gene replicate. However not all databases may use the currently approved HUGO gene name. Hence the need to specify a "preferred ID" (if I have interpreted Emw correctly). Does that make sense? Boghog (talk) 04:31, 11 July 2013 (UTC)
Regarding the "Encodes" / "Encoded by" proposed properties, I still think those are relevant for linking the gene item to the protein item. I didn't mean to suggest that statements for "Gene ID", "RNA ID" and "Protein ID" would all appear in the gene items. Rather, I think "Gene ID" would show up on gene items, and "Protein ID" would show up on protein items. (RNAs of course are the sticky one. I'd propose that in general, "RNA ID" shows up under the gene object, unless the RNA has a defined function in which case we break it out as its own item. But these would be edge cases.) Everything else you both stated makes sense to me, and I agree... Andrew Su (talk)
Thanks for the clarification -- that works for me. One very minor note, though: if "Gene ID" would only be used on gene items and "Protein ID" on protein items, then would it be simpler to just say "Sequence ID" when referring to the ID of the current gene or protein item? This seems like it would be similar to the approach most sequence databases take. For example, when referring to the ID of the "current" sequence on a given record, RefSeq, GenBank and UniProt simply say "accession" for the sequence ID, rather than "gene accession" or "protein accession" (see here, here and here). If this seems like it has notable drawbacks, then "Gene ID" and "Protein ID" seem fine to me. Emw (talk) 05:32, 11 July 2013 (UTC)


There are basically two extremes Wikidata could use (correct me if I'm wrong)

  • (1) Use a property "ID" for all identifiers.
  • Pro: Very few properties
  • Contra: No lists for constraint violations, no way of finding out if each item has every identifier, hard to query for Wikipedia infoboxes, hard to construct links to those databases
  • (2) Create properties for all identifiers.
  • Pro: Lists of constraint violations, Lists that show if each item has each property, easy for Wikipedias to get infobox information, easy to construct URL form "base-URL" + "identifier"
  • Contra: Very many properties

A possible solution would be to allow for properties to be nested too. So all the gene properties could be a subclass of "Gene ID" and "GeneID & RNA-ID & ProteinID" would be a subclass of "sequence ID" and "sequence ID" would be a subclass of "ID". I don't know if there are any plans to implement this here, but I think somebody once said in the ProjectChat that that was the way SemanticWeb handles these problems.
I personally think that the identifiers are not that important. The true potential of Wikidata is the item and number datatype that will allow us to create an unbelievable mesh of interlinked data that will in the end be more important than the 50+ identifier-properties that each item will receive sooner or later. --Tobias1984 (talk) 09:59, 11 July 2013 (UTC)

Tobias, you raise some good points about the limitations of the proposed system. I think (hope) some of them will end up being non-issues, but they are issues at the moment. So since this is not an undeniably positive move, let's just continue with the status quo and use database-specific properties. We can always make a change later. To help things move along then, I will:
I think that will cover all the identifier properties needed for the current gene infoboxes. Please discuss more if anyone disagrees with any of these changes! Cheers, Andrew Su (talk) 16:30, 11 July 2013 (UTC)

Gene and protein labels and descriptions[edit]

Since we're talking about identifiers for genes and proteins, I thought it'd be fitting to also discuss their labels and descriptions.

The proposal / guideline at Help:Label#Disambiguation says "When an article title includes disambiguation in it, either by placing it after a comma or by placing it in parenthesis, the disambiguation should be left out. Disambiguation information should instead be placed in the description field". Some of our items don't follow this styling, e.g. we've got items labeled "reelin (human gene)", "reelin (human protein) and "Reln (mouse gene)".

Proposed label and description format:

  • Human genes:
Label: HGNC gene symbol, e.g. RELN
Description: human gene
  • Mouse genes:
Label: MGI gene symbol, e.g. Reln
Description: mouse gene
  • Human proteins:
Label: HGNC full name, e.g. reelin
Description: human protein
  • Mouse proteins:
Label: MGI name, e.g. reelin
Description: mouse protein

For convenience, the HGNC entry for RELN is here and the MGI entry for Reln is here. What do others think? Emw (talk) 03:22, 11 July 2013 (UTC)

I worry a bit that the label won't be interpretable to a non-scientist, but this is a pretty minor worry. The item is really defined by its statements, so the label and description (I think) are really there just for convenience only... So bottom line, I like this proposal... Cheers, Andrew Su (talk) 05:05, 11 July 2013 (UTC)

Hi, I'm not sure if this topic has been discussed before. The few times I made gene or protein databases for projects I used ENSEMBL IDs as primary identifiers as they are usually quite complete and convenient to parse from the data files. However, one problem with ENSEMBL IDs is important to keep in mind. Such an ID is rather meaningless without the information to which ENSEMBL database *version* the ID belongs. ENSEMBL IDs change quite frequently over the history of the database versions and to keep a local database up-to-date to ensure that a used ID actually still refers to the gene/protein it was originally assigned to isn't trivial (although Ensembl maintains tables which record any such changes). I don't see any mentioning of a 'version' with the ENSEMBL IDs in wikidata. The 'source' info of the 'ENSEMBL GENE ID' property links only to 'Ensembl' in general, not a specific database version and also the ID itself links to the general Ensembl entry, which is the latest version. Would that imply that there is a bot who updates the ENSEMBL IDs in the wikidata database regularly to resolve possible arisen conflicts? Cheers, Optimale (talk) 11:42, 6 August 2013 (UTC)

Genome assembly database?[edit]

Hi,

I'm collecting some data on WP here: http://en.wikipedia.org/wiki/List_of_sequenced_plant_genomes

Can we store all those values in WD? --Dan Bolser (talk) 13:48, 10 July 2013 (UTC)

Yes! Putting genome metadata onto Wikidata is a great idea. Each genome should probably be represented as an item. How are these genomes represented, e.g. are most of them assemblies, sequence maps, or?
Here's a possible mapping for fields in the table at list of sequenced plant genomes; some based on relevant existing properties:
Organism strain: invalid ID (P89) (perhaps we should create a property "strain" to support assigning sub-species information through qualifiers)
Family: this is extraneous, since we should be able to deduce it from the above field
Relevance: maybe unnecessary?
Genome size: we should probably use a generic "length" property with units "Mbp" (or whichever order of magnitude of "bp" is most appropriate)
Number of genes predicted: we might want to propose a new property for this
Organization: new property, like above
Year of completion: I would suggest using publication (P577) to specify the most precise date possible
Assembly status: what does this mean?
@Emw: Status of the assembly using a controlled vocabulary, described here: wikipedia:Talk:List_of_sequenced_plant_genomes --Dan Bolser (talk) 12:03, 20 May 2014 (UTC)
There are more interesting genome properties to consider, but this is a start. Emw (talk) 02:50, 11 July 2013 (UTC)
Agreed, these would be cool data to add! In addition, I can think of two things that might be nice to include that isn't already in your table. First is the sequence identifier for the sequenced genome. Second is the NCBI Taxonomy ID (P685). Possible to add those columns? Cheers, Andrew Su (talk) 04:53, 11 July 2013 (UTC)
Hello! I started work on this today with lots of help from User:Magnus Manske (I was a bit clueless before). We jumped in and proposed one of the properties in the table (and suggested by User:Emw too), here: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Natural_science#Genome_size
I totally agree that GCA and tax_id would be valuable properties to add to the table (in addition I'm planning some form of quality ontology). Are there already properties for these two identifiers? Cheers, --Dan Bolser (talk) 12:39, 15 April 2014 (UTC)

property support or comments please[edit]

Hello all, there are a few proposed properties that could use your support or comments please. Starting with this link, the properties in question are "Encoded by", "Ensembl Transcript ID", and "Ensembl Protein ID". Please contribute your thoughts! Cheers, Andrew Su (talk) 18:01, 16 July 2013 (UTC)

I've supported those properties, which all make sense to me. Are there properties that the GSoC project needs but doesn't have yet? Emw (talk) 03:22, 17 July 2013 (UTC)
In general we still need a lot of item-properties that link genes and proteins to other items. For example "mutation in gene 1 causes disease A". We could also have more links to neurological and physiological functions. --Tobias1984 (talk) 08:33, 17 July 2013 (UTC)
I think new proposals can be done on a separate WD:Property proposal/Biology or WD:Property proposal/Science, as WD:Property proposal/Term is annoyoingly slow to load, and so diverse, that there does not seem to be much point in having everyhthing together. --Zolo (talk) 10:36, 17 July 2013 (UTC)
This discussion could took place in project chat or in a new RfC, and be generalised ... TomT0m (talk) 10:52, 17 July 2013 (UTC)
The reasoning behind the limited amount of subpages is that people should actually also review properties from other scientific fields and find properties that overlap. But I can see the problem with the page being huge at the moment (usually 100+ proposals since March). Maybe we should make a subpage for biology and life sciences? --Tobias1984 (talk) 10:56, 17 July 2013 (UTC)
To avoid huge number of proposal, I would support grouping properties if they make sense together or if they are similar and vote for a group of properties instead of property by property. TomT0m (talk) 11:02, 17 July 2013 (UTC)
I trimmed down the page to about 80 proposals. That should help with the load times. It would help to get some votes for Wikidata:Property_proposal/Term#Medicine_.2F_Medizin_.2F_M.C3.A9decine. --Tobias1984 (talk) 20:30, 17 July 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Infobox neuron is next. Please vote for the 6 proposals at Wikidata:Property_proposal/Term#presynaptic_connection_.28afferent.29 --Tobias1984 (talk) 09:39, 23 July 2013 (UTC)

Did these properties get accepted? If so can you link? Cheers, --Dan Bolser (talk) 12:42, 15 April 2014 (UTC)
@Dan Bolser: A hopefully complete list of the existing properties is at Wikidata:WikiProject_Molecular_biology/Properties Tobias1984 (talk) 15:24, 15 April 2014 (UTC)

links between genes/proteins, diseases, and drugs[edit]

As Tobias mentioned above, it would be great if we could start thinking about how to model the properties that link diseases with genes or proteins. I'd also add links to/from drugs as well. Let me splash down a few thoughts on each of those edge types:

  • gene-disease: primary data source would be OMIM, which as ~4000 links. The relationship type ("mutation causes", "trinucleotide repeat expansion causes", "translocation causes", etc.) looks like it can be parsed out with some effort.
  • gene-drug: primary source would be Drugbank, which has somewhere between 2700 and 10,000 links. No apparent relationship types.
  • disease-drug: primary source would be NDF-RT, which has ~55,000 links. Semantic types include "may_treat" (48000), "may_prevent" (6000), "may_diagnose" (1000), and induces (700).

Questions I would pose to this community are: 1) are there additional/better data sources, and 2) how detailed or generic do we want the properties to be? Cheers, Andrew Su (talk) 23:39, 18 July 2013 (UTC)

Linking genotypes, diseases and drugs would be a great idea. There are many efforts ongoing to link together topics in medicine and genetics in ontologies, and I think doing so in Wikidata will be one of the project's more compelling applications. I'm interested in linking genotypes and phenotypes -- especially diseases. For that, in addition to OMIM, there are also relevant resources in the Human Phenotype Ontology (HPO), MedGen and ClinVar.
HPO has annotations of all the clinical entries in OMIM, and includes gene-to-phenotype mappings. Importantly, HPO also puts this information into an ontology available in OWL, the standard language for representing ontologies: see HPO downloads, hp.owl. We might even be able to import HPO en masse to Wikidata, where it could be linked to other important domains beyond genetics and disease classification.
MedGen is a new project to aggregate information for medical genetics (docs). A MedGen search for RELN shows how the project can map gene names to a page incorporating information from HPO, OMIM, and relevant structured data on clinical features, phenotype hierarchies, etc. Further, if we're interested in how genetic variants are associated with clinically significant phenotypes, ClinVar can tell us that -- see for example the ClinVar entry for RELN. Emw (talk) 03:20, 19 July 2013 (UTC)
Fantastic, thanks for the feedback! A few notes on each of the resources you mentioned:
  • Human Phenotype Ontology: We've actually have quite a few offline discussions with the HPO group, and there is a mutual commitment to work together. I think they would argue that HPO does not deal with diseases per se, but more with the observable clinical symptoms that patients have. So we absolutely were going to link diseases (as indexed by the Disease Ontology) with the related phenotypes (in HPO). Since phenotypes uniquely apply to diseases, I was going to wait on that until after we had our initial gene/protein - disease - drug triangle set up.
  • MedGen: Excellent suggestion! A quick scan of this download file shows that they have almost 6000 links between genes and diseases. My guess is that it largely overlaps OMIM, but even if it only made OMIM available in a more easily-parsable format, that would be a big win. One nice thing is that MedGen also tracks the PMIDs for each link here.
  • ClinVar: This file has 7826 unique links between genes and diseases, nicely parsed via Entrez Gene ID and UMLS CUIs. Again, might largely overlap OMIM, but it's definitely worth adding to the mix! Obviously ClinVar also annotates specific genetic variants. I think annotation of specific variants will be something that we tackle in the future, but for the moment I think we will have our hands full with just the gene-disease links.
Thanks for the great ideas! (And agreed, I'm very confident that getting all these data into Wikidata will lead to many biological killer apps...) Cheers, Andrew Su (talk) 23:08, 19 July 2013 (UTC)
That roadmap sounds really exciting Andrew and Emw. I think it would also be interesting to crosslink and use http://humanmetabolism.org/ as a source. --Tobias1984 (talk) 09:37, 23 July 2013 (UTC)
Just wanted to point out that the ontology of the HPO can be used for patients, diseases, variants... There are already a rich collection of disease-phenotype annotations available on the HPO site (mostly OMIM, but also ORPHANET and DECIPHER). Cmungall (talk) 22:51, 5 February 2015 (UTC)

New ProteinBoxBot Bot flag request[edit]

Hi everyone!, I have submitted a request for bot flag here. The test runs are here. Kindly chime in with your thoughts. Chinmay26 (talk) 18:05, 19 July 2013 (UTC)

A week has passed. Ymblanter is waiting for final comments so the bot can be approved. --Tobias1984 (talk) 15:35, 26 July 2013 (UTC)
If you have experience with bots, please also review Wikidata:Requests_for_permissions/Bot/Chembot. We have a lot of overlapping properties with the chemistry task force. --Tobias1984 (talk) 14:22, 29 July 2013 (UTC)

ProteinBoxBot edits[edit]

Hi, i have run ProteinBoxBot for first 3 proteins under | human proteins. The bot does not handle EC Classification , Gene_atlas image, Alias yet . sample item -- www.wikidata.org/wiki/Q411507. The bot also does not update appropriate qualifiers yet(working on it). I just wanted to run this by the community to confirm/clarify if there are any issues regarding the edits. Chinmay26 (talk) 22:32, 29 July 2013 (UTC)

Thanks Chinmay. I looked over ProteinBoxBot's recent contributions and it seems like things are coming along. A few comments on Cyp21a1 (Q14358793) that generalize to other gene/protein items:
  • The description field reads "Mouse Gene". This should be lowercase -- "mouse gene". Same for "human gene", "human protein", and "mouse protein", since none involve proper nouns.
  • The RefSeq RNA ID (P639) claims should only be used for RefSeq accessions. Per the "Distinguishing Features" section of http://www.ncbi.nlm.nih.gov/refseq/about/, RefSeq accessions all contain underscores -- for example, NM_009995 is a RefSeq accession, but AI323066 is not (it's a GenBank accession).
Below is a template I would suggest we use for these various RefSeq properties. The "valid accession prefixes" constraints are derived from the official table mapping RefSeq accession numbers to molecule types.
Property Valid accession prefixes Molecule type Should be used on Wikidata items that are subclasses of... Example usage in reelin items
RefSeq (P656) NG_, NT_, NC_, AW_, NW_, NS_, NZ_ genomic DNA gene (Q7187) RELN (Q414043) RefSeq (P656) NG_011877.1
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NG_")
RefSeq RNA ID (P639) NM_, NR_, XM_, XR_ RNA gene (Q7187) RELN (Q414043) RefSeq RNA ID (P639) NM_005045.3
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NM_")
RefSeq Protein ID (P637) NP_, AP_, YP_, XP_, ZP_ protein protein (Q8054) reelin (Q13569356) RefSeq Protein ID (P637) NP_005036.2
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NP_")
Overall things seem to be making good progress! Emw (talk) 03:48, 30 July 2013 (UTC)
Small point, but could you also add description in other language:
en:human gene
de:menschliches Gen
es:gen humano
fr:gène humain
it:gene umano
zh:人类基因
I would like to add (Sarilho1 (talk) 14:36, 30 July 2013 (UTC)):
pt:gene humano
pt-br:gene humano

Properties[edit]

physically interacts with (P129) is still missing examples, constraints and a description of its scope. We should decide if this proposal (Wikidata:Property_proposal/Term#drug_interaction_.28en.29_.2F_Arzneimittelwechselwirkung_.28de.29) needs to be a separate property or not. --Tobias1984 (talk) 18:23, 31 July 2013 (UTC)

Another proposal needing attention drug action altered by --Tobias1984 (talk) 12:54, 2 August 2013 (UTC)

Discussion about drug-drug interaction qualifiers[edit]

I would like to invite the participants of this project to give their opinion in this discussion: Wikidata_talk:Medicine_task_force#drug-drug_interaction. --Tobias1984 (talk) 19:39, 6 August 2013 (UTC)

reelin and RELN[edit]

Human Gene:

Mouse Gene:

Human

Mouse:

Wouldn't it be good idea to connect reelin (human) and reelin (mouse) to reelin (in general). The sitelinks could go into that general item and statements that are true for both could only go with the parent item. Same could be done for the gene. --Tobias1984 (talk) 11:33, 26 August 2013 (UTC)

Sorry I missed this post. I personally think we should not have items for "in general". Human reelin and mouse reelin are real and tangible things, and I think the abstraction will create more problems than it's worth. For example, what is true for "in general" undoubtedly differs depending on what species you're considering. One claim that is true for human, mouse and rat may not be true for fly, and of course reelin is probably not present in the genomes of many lower organisms. My two cents... Cheers, Andrew Su (talk) 23:04, 10 September 2013 (UTC)

Duplicates ProteinBoxBot[edit]

We probably need to discuss some items that ProteinBoxBot is currently creating. Some items might be duplicates we can merge, others are distinct concepts. Merging needs to be done carefully, in order not to break all the links. The first duplicate I found is: invalid ID (Q14330657), aging (Q332154). In my opinion those two could be merged. Any opinions? --Tobias1984 (talk) 17:57, 10 September 2013 (UTC)

Hi Tobias, you raise a very good point. Chinmay has put in quite a few checks to avoid creating duplicates. Of course, no system is perfect (especially with the somewhat incomplete query API at the moment), so letting us know when you run across them is very useful. Speaking on the those two examples in particular... I think the ageing/aging issue would be difficult to detect. The spelling difference prevents us from doing an exact string match, although I think that if it was added as an alias then Chinmay's program would have detected the existing item. (In theory, we may have been able to use the MeSH ID to make the match, and if that becomes a common theme then we can add that feature.) I think the second example for endoplasmic reticulum exists because Q14327640 was created before the redundancy checking was in place... I think that is not a problem in any of the most recent runs. Unless you have any objection, I'll go try to use the merge tool to fix these? Cheers, Andrew Su (talk) 23:00, 10 September 2013 (UTC)
(Edit conflict) I was about to say much the same. I added "aging" (American English) as an alias of the item labeled "ageing" (British English). Gene Ontology terms use American English, but the Wikipedia articles on a good proportion of biomedical subjects use British English titles. Adding the bolded text in article leads as an alias seems like it would solve this problem in one fell swoop, but given the GSOC time constraints I suspect we'll need to do more manual data input. Emw (talk) 23:13, 10 September 2013 (UTC)
@Andrew Su: Will it mess up the bot if we delete them now, or will it use the other item automatically? --Tobias1984 (talk) 07:01, 11 September 2013 (UTC)
Another one? cell-cell signaling (Q14758911) and cell signaling (Q210973) --Tobias1984 (talk) 12:03, 11 September 2013 (UTC)
We should probably always keep the item with the lower number and we also have to change all the incoming links:
Changing the links is now pretty easy with User:BeneBot*/movelinks.js. --Tobias1984 (talk) 15:02, 9 December 2013 (UTC)

ProteinBoxBot progress[edit]

Hi all, in case anyone is interested in how Chinmay's PBB project is going, you can see his recent efforts at https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbsearchentities&format=json&search=entrez%3A&language=en&type=item&limit=50 or https://www.wikidata.org/w/index.php?title=Special:Contributions/ProteinBoxBot&offset=&limit=500&target=ProteinBoxBot. We're in the home stretch on his GSoC project (which is why we haven't been more active in the discussions) so the current priority is to make sure the code that he's written is robust to many different gene examples. Cheers, Andrew Su (talk) 23:08, 10 September 2013 (UTC)

Hormones and biological process (P682)[edit]

Couple of questions. Would it be ok to connect hormones with biological process (P682) to their functions or do we need another property? Example: progesterone (Q26963) = pregnancy (Q11995). Also we need a property for "where the hormone is made in the body". Does anybody know what to call it or is there a property we can recycle? --Tobias1984 (talk) 17:53, 12 September 2013 (UTC)

I like this suggestion. It seems there are different kinds of relationships we'd want to use - inhibits, activates. This might tie in with a general discussion on representing pathways in wikidata, Note that CHEBI includes a link between progesterone and the role 'contraceptive drug' (which might itself be linked to GO:pregnancy). As for where substances are made, in GO we include RO:occurs_in links between processes such as 'progesterone biosynthetic process' and structures or tissues in Uberon (we don't have a particular link for progesterone, but this could be added, but we have strict criteria). In addition we have links between 'progesterone biosynthetic process' and CHEBI:progesterone. These could be chained together to get a chemical to tissue link. Cmungall (talk) 23:01, 5 February 2015 (UTC)

Endorse funding for wikidata query tool and more?[edit]

There is a proposal for funding to build a wikidata toolkit for developers. If you like, it, please head over there and let them know by adding an endorsement. https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit#Endorsements:

Updates and additions to URL mappings for molecular biology properties[edit]

I've added a request to improve URL mappings for several of the ID properties we use in gene and protein items at MediaWiki_talk:Gadget-AuthorityControl.js#Updates_and_additions_for_molecular_biology_properties. That link also describes an interesting bug with the URL mapping for Ensembl Gene ID (P594) -- it goes to a page expecting a human gene ID even when it's used on claims for mouse genes. Please take a glance over that and note any comments or questions there. Thanks, Emw (talk) 12:51, 15 October 2013 (UTC)

Property proposal: chromosome[edit]

See Wikidata:Property_proposal/Natural_science#chromosome. Emw (talk) 20:56, 14 December 2013 (UTC)

protein binding[edit]

Hi, I have trouble labelling and sorting out this item, which is in the list of most used items without french label : Plasma protein binding (Q14633864). Is it a duplicate for peptid bond ? Is it a (class of) bond beetwen proteins ? TomT0m (talk) 22:52, 10 February 2014 (UTC)

The concept Plasma protein binding (Q14633864) comes from the Gene Ontology (GO), specifically http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005515. It is a type of binding, which in turn is a type of molecular function. It isn't a synonym of 'peptide bond' nor a subclass of bond between proteins. More information is available in the GO link; more about GO in general is here: http://www.geneontology.org/GO.doc.shtml. Emw (talk) 00:44, 11 February 2014 (UTC)
I saw that link, I am more confused and curious because protein amino acid binding is cited as synonim, is it a bond beetween too proteins by bonding some of their peptides ? TomT0m (talk) 17:47, 11 February 2014 (UTC)
Let's compare definitions and a few subclasses of relevant GO terms:
Definition: Interacting selectively and non-covalently with peptides, any of a group of organic compounds comprising two or more amino acids linked by peptide bonds.
Subclasses: beta-amyloid binding, oligopeptide binding, peptide hormone binding
Definition: Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules).
Subclasses: apolipoprotein binding, heat shock protein binding, p53 binding
The difference between 'protein binding' and 'peptide binding' is the difference between proteins and peptides: size. Page 85 of Lehninger: Principles of Biochemistry (4th edition) says "molecules referred to as polypeptides generally have molecular weights below 10,000 and those called proteins have higher molecular weights." And that's the case with the children of 'peptide binding' and 'protein binding'. Beta amyloid has 36-43 amino acids (for comparison, cytochrome C has 104 residues and a weight of 13,000 residues), and so is a peptide. Apolipoprotein E, an apolipoprotein, has 317 amino acids, and so is a protein. Beta amyloid would be an object of peptide binding, and apolipoprotein E would be an object of protein binding.
So both 'protein binding' and 'peptide binding' would involve binding amino acids, just amino acids in a bigger or smaller molecule. Importantly, the definitions for both terms explicitly note that they are non-covalent. A peptide bond is a type of covalent bond, so is would be quite incorrect to synonymize "peptide binding" with "peptide bond". Another important thing to consider is that, per GO, 'binding' is a type of activity, not a bond; 'binding' is a process, not an object. Emw (talk) 02:55, 12 February 2014 (UTC)
Thanks, my high school courses are far away, and in non english :) So to sum up, protein and peptide are made of bonds beeteen anime acids, whereas petide and proteins bindings forms respectively peptide and proteins complex (that may include other kinds of molecules). TomT0m (talk) 11:55, 12 February 2014 (UTC)
Yup, pretty much. Emw (talk) 12:16, 12 February 2014 (UTC)

Wikidata Infobox on Czech Wikipedia[edit]

Czech Wikipedia is currently interested in adding protein data to their articles. This would be our chance to test our data in the field, by building a Lua-Infobox that uses Wikidata-data. By adding the infobox one at a time we can slowly work out problems, add sources and gather experience for further deployments. @Hypothalamus: would be our go-to person. Hypothalamus could choose an example page (ideally a page that isn't visited to much, so mistakes can be fixed without anybody noticing). We also still have to find somebody with some Lua-infobox experience. --Tobias1984 (talk) 11:26, 25 March 2014 (UTC)

Yes, the original suggestion was targeted to User:Andrew Su here. I suggest we experiment with infoboxes on a short article on Hepcidin. Is there anything I can do at this point? Do you want me to ask at Czech wiki's Village pump if there is anyone with Lua programming skills willing to help? Hypothalamus (talk) 19:35, 7 May 2014 (UTC)

Andrew Su
Genewiki123
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Dan Lawson
Kizar
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Pictogram voting comment.svg Notified participants of Wikiproject Molecular biology Tobias1984 (talk) 21:03, 7 May 2014 (UTC)

@Hypothalamus: - If you could find a lua-programmer on Czech wiki that would be great. There are some working examples of infoboxes that would just need adjustment e.g. taxobox on hebrew-wiki [7] and the module [8]. This bug still puts some limitations on pulling data from other items than the item that is connected to the wikipedia article [9]. But I think that the most important information can be pulled from the item itself. Tobias1984 (talk) 11:34, 8 May 2014 (UTC)
I'd be happy to help. Can read Czech and Lua fine but am lousy at producing them properly. --Daniel Mietchen (talk) 12:05, 9 May 2014 (UTC)
@Daniel Mietchen, Hypothalamus: - I think another example is at https://fr.wikipedia.org/wiki/Module:Infobox and https://en.wikipedia.org/wiki/Wikipedia:Lua - I still have to read up on how this all works. Especially the modules that are required by Wikipedia still confuse me. Tobias1984 (talk) 18:48, 9 May 2014 (UTC)
@Daniel Mietchen, Hypothalamus: - Another example: fr:Module:Infobox/Composé_chimique used on fr:Undéc-1-ène - Daniel if you have time it would be great if you could help out Hypothalamus and czech-wiki. I still have trouble reading the code and don't really understand what goes where. Tobias1984 (talk) 13:20, 10 May 2014 (UTC)
This is a great initiative and I would be happy to help. I am a native Czech, molecular biologist, but cannot write in Lua. --Vojtěch Dostál (talk) 21:58, 13 May 2014 (UTC)
@Vojtěch Dostál: Welcome to the WikiProject. Help is of course always appreciated and needed at every corner of Wikipedia and Wikidata :) - I see that you didn't make any edits yet to Wikidata. You could start by just looking at some items of your favorite articles and see what kind of information you can add to them. It is good to understand the data structure, so when the infoboxes are put onto czech-wiki you will already know how to fix mistakes. Ping me if you need any help! -Tobias1984 (talk) 22:31, 13 May 2014 (UTC)
@Tobias1984: I am user Vojtech.dostal but have just renamed my account and did not merge all of them :-). Thank you for your kind offer - as long as I do not dig into the programming - I think I'll be all right. --Vojtěch Dostál (talk) 22:39, 13 May 2014 (UTC)

I can't work out how to use the Wikidata:WikiProject Molecular biology page[edit]

Sorry if it's just me, but I wanted to add my new proposed property (genome size) there but couldn't work out how to do it. --Dan Bolser (talk) 12:45, 15 April 2014 (UTC)

@Dan Bolser: Do you have trouble with the markup language? The template of the property list together with the table syntax can be confusing at times. I can help you if you describe your problem. Tobias1984 (talk) 15:26, 15 April 2014 (UTC)
@User:Tobias1984: Syntax is fine, I just don't know what properties to use with what templates. --Dan Bolser (talk) 11:33, 20 May 2014 (UTC)

Gene alias[edit]

Hello to everyone! I am a beginner in to Wikidata, so I ask you to be patient with me :) I was wondering if it has already been taken into account to add alias (aka) in the names of the genes (eg aliases reported by geneCard or other similar database). As example I tried to edit the page FOSL2, but obviously it is not a task that can be carried one-by-one by hand. Have you thought about developing a bot for this purpose? I am at your disposal! Amicobromo (talk) 10:38, 31 July 2014 (UTC)

@Amicobromo: - @Andrew Su: has run the previous bot edits for this project. Maybe he can incorporate aliases on the next run. But it also depends on, if the data is available to us. - If you like, you can also program your own PyWikiBot. But there is also still much work to be done by humans. For example checking constraint violations on properties. -Tobias1984 (talk) 10:07, 1 August 2014 (UTC)
@Tobias1984: Thank you for your reply. I will prepare a tsv (or csv) file with all gene alias I can find in public databases, so @Andrew Su: may use this information to add this task to his bot. Let me know, thank you! 79.38.219.90 10:14, 1 August 2014 (UTC)
@Amicobromo: Thanks for your interest in biological data on Wikidata! Gene aliases are of course very important. The primary source databases for that information are NCBI Entrez Gene and Ensembl, both of which provide alias information in downloadable files. As @Tobias1984: mentioned, we are writing a bot (currently under the care of @Andrawaag:), and we are taking those data sources into account. If you'd like to help, let's try to figure out to get you involved with our bot development! I'll ask Andrawaag to chime in here too... Cheers, Andrew Su (talk) 18:07, 1 August 2014 (UTC)

ProteinBoxBot August 2014[edit]

Andrew Su
Genewiki123
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Dan Lawson
Kizar
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Pictogram voting comment.svg Notified participants of Wikiproject Molecular biology Pinging because I hadn't seen this page, and maybe some other people didn't either: User:ProteinBoxBot/201408 sprint. The bot has done a few runs in August. Please check its work and update the constraint violation templates (e.g. Property talk:P639). New violations will show up after 1 or 2 days.

Some constraint violations also need fixing:

Don't hesitate to ping me for help with the constraint templates. -Tobias1984 (talk) 12:29, 27 August 2014 (UTC)

Yes, thank you Tobias1984! We welcome any and all input. The primary goal right now is to reimplement the infrastructure we previously created so that it is more robust. So far we've only done a few test edits. We will post here when we need feedback from the larger community. Hopefully soon! Cheers, Andrew Su (talk) 23:59, 28 August 2014 (UTC)

ProteinBoxBot September 2014[edit]

Hi all. The last month I have been refactoring the ProteinBoxBot. The progress is reported in User:ProteinBoxBot/201408 sprint. The workflow is as follows:

  1. for a gene label check if it is already covered in WikiData,
  2. if so obtain the WikiData ID
    1. Check if the WikiData entry all ready contains a subclass subclass property -- if it's anything other than what the bot would have added ("gene" or "protein") then it should throw a warning and skip that gene
  3. If not create a new page
  4. With the gene identifier extract related identifers from http://mygene.info
  5. Add the related identifiers as statements to WikiData

The Refurbished proteinBoxBot has been tested on about 2000 Entrez genes (see https://www.wikidata.org/wiki/Special:Contributions/ProteinBoxBot) and if no objections are raised I hope to launch the ProteinBoxBot on the remaining 40000 entires later this week. - AndraWaag (talk)

We have revised the bot approach a bit. Adding Entrez gene and its related identifiers are now added in two steps. A first step where only a stub page is created. This stub contains a label, a symbol, synonyms, the entrez gene identifier, and its species. In a second step the information form mygene.info is obtained and added as related identifiers. The first step (the stubs) is finished now and the second step is currently running. --Andrawaag (talk) 22:37, 19 September 2014 (UTC)

Royal Society of Chemistry - Wikimedian in Residence[edit]

Hi folks,

I've just started work as w:Wikimedian in Residence at the w:Royal Society of Chemistry. Over the coming year, I'll be working with RSC staff and members, to help them to improve the coverage of chemistry-related topics in Wikipedia and sister projects.

You can keep track of progress at w:Wikipedia:GLAM/Royal Society of Chemistry, and use the talk page if you have any questions or suggestions.

How can I and the RSC support your work to improve Wikipedia? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:17, 24 September 2014 (UTC)

The key thing is to make sure we coordinate ongoing chemistry (especially drug-related) tasks (mainly bots) within wikidata. Check out the work of User:AlepfuBot as an example. Curious what you are planning to do? One thing that would be really helpful in general is to produce some examples of how to get wikidata content into Wikipedia.. --Genewiki123 (talk) 16:30, 24 September 2014 (UTC)

Strategy to merge duplicates in WikiData[edit]

Andrew Su
Genewiki123
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Dan Lawson
Kizar
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Pictogram voting comment.svg Notified participants of Wikiproject Molecular biology Hi All,

We are prosperously adding entries from Entrez gene to Wikidata. Both the house mouse genome and the human are processed once and we are in the process of designing an update bot to keep the data up-to-date. The workflow is as follows. When entering a new entry from Entrez gene, the bot checks if an entry already in WIkidata exist or not. Here the description of a gene entry is key and the symbols and Synonyms are added as aliases. However, Wikidata entries exist where the symbol is key. This has let to some interesting duplicates (e.g. Q15316328 and Q18254780). The question would be what strategy to apply here.

My proposition would be to keep the description as key and add the symbol and synonyms as aliases. Currently if no description exists (i.e. a dash, "-"), the entry is ignore by ProteinBoxBot, but I would propose to use a symbol here in the next iteration of the bot. The question is then how to deal with merges like: Q18046951 into Q1521757. Here the description got removed in the merge process. In my opinion the merge should been in the other direction, where Q1521757 would be merged into Q18046951.

Any thoughts, objections, approval on both the workflow where the description is key and the the proposition to merge duplicates in the direction of the entry that contains this description? Andrawaag (talk) 10:57, 17 October 2014 (UTC)

Let's Do SNPs![edit]

i.e. dbSNP: http://www.ncbi.nlm.nih.gov/projects/SNP/

We could also add the GWAS catalog: http://www.genome.gov/gwastudies/

With all human genes now in wikidata, we could link each of those SNPs both to the gene they're in, and also to the disease the SNP is associated with.

In terms of properties, we'd need:

property: orientation, values "plus" and "minus" (depending on what strand is read, the value for the SNP might be reversed) property: in gene, to show what gene the SNP is in.

property: implicated in (implicated in disease or trait) to show what diseases or trait was implicated in.

genotype: values AA, AG, AT, AC, GA, GG, GT etc.and then each genotype could be annotated with a value: i.e. .2, and then a description of that value "diopters" to indicated that the SNP was associated with a a change in .2 diopters in the case of myopia...

OR

nucleotide: value A, G, T or C.

This is actually the hardest bit, deciding how to represent the different values for the SNP. So in GWAS, we only have a single nucleotide associated with each disease, that's relatively easy to represent within one wikidata entry. The problem is that some studies report only the association between one nucleotide (.i.e "the presence of T") with the magnitude of the effect, and others look at both nucleotides (i.e. AA, AT, or TT) and associate each genotype with an effect.

The difference, of course, is that you can have a dose responsive effect or dominance could be involved... so just doing the single nucleotide, as opposed to the genotype, doesn't give you the complete picture

And then the other issue is that a single SNP may have effects on multiple traits/diseases, so we need to make sure that say ".2" and "diopters" is connected because the same snp might also cause say 200% increase in glaucoma or something.

From a DB design perspective the normal way to do this is to put all associations into its own table. I.e. have a wikidata entry for the SNP, and then have an entry for each genotype that points to the SNP. But I'm not sure what the WD way of doing things would be in this instance.

Any thoughts? Mvolz (talk) 21:02, 25 October 2014 (UTC)

Sample SNP entry here: rs8176058 (Q18341737). Did what I could with more general properties! Mvolz (talk) 21:35, 25 October 2014 (UTC)
One I found that was already here: rs267601217 (Q15304616)

While I'm all for it, I assume everyone has seen SNPedia? --Magnus Manske (talk) 23:48, 26 October 2014 (UTC)

Mvolz, thanks for starting this conversation. Structured data about genetic variation is clearly essential if we want to discuss the connection between genes and diseases on Wikidata. As you may know, dbSNP deals not only with SNPs, but also other types of small variants, e.g. insertions, deletions, indels, multiple nucleotide variants, microsatellites, etc. Starting with small genetics variants (i.e. variants generally < 50 bp in length), even perhaps just single nucleotide variants, seems like a good start.
Even that task is huge. The current version of dbSNP (dbSNP build 142) has over 112,000,000 RS's for human. As you note, each RS often represents multiple alleles -- the reference allele and one or more variant alleles. As can be seen in e.g. http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=429358, each RS is also mapped to two genome assemblies -- GRCh38 and GRCh37.p13. Representing clinical assertions about genetic variations is also complex, see e.g. http://www.ncbi.nlm.nih.gov/clinvar/?term=rs429358 and http://www.ncbi.nlm.nih.gov/clinvar/variation/17872/#clinical-assertions. This genetic variation and clinical assertion data is updated regularly.
I suggest starting off with a narrow subset of genetic variation: variants of clinical significance like those in SNPedia (thanks Magnus), which has data on some 57,000 genetic variants.
I also think we should adopt standards and conventions in the controlled vocabulary used to discuss genetic variations. Relevant ontologies, standards, etc.:
Regarding the the proposed properties for representing SNPs:
  • orientation: Seems good to me
  • in gene: Good idea. It's worth noting that many clinically relevant variants occur upstream (or downstream) of a particular gene.
  • genotype: Let's clarify that "genotype" entails the sum of alleles at a given locus, e.g. two sequences for humans. I'd recommend using a slash to delimit each allele, which helps when considering alleles involving multiple nucleotides. That is, I recommend using "A/A" rather than "AA".
  • nucleotide: This should probably be called "allele".
  • implicated in: Implicated in and cause of (P1542) differ in evidentiality, but they both deal with causation, and I think statement ranks like "preferred" or "deprecated" could be used to capture that difference in sense. See also Help:Modeling_causes#Malaria, immediate cause of (P1536) and contributing factor of (P1537). In other words, my initial impression is that we should use one of those properties, not a new one.
  • clinical significance: would augment cause of (or contributing factor of or immediate cause of) statements with ACMG-recommended values noted in previous list -- pathogenic, likely pathogenic, unknown significance, likely benign, benign
I'd also suggest including which transcript(s) the variant occurs in. And let's also keep in mind that many medically relevant variants are structural variants, e.g. copy number variations. dbVar and DGVa store information on that. Finally, let's be sure to get our provenance statements reasonably precise, so we can track not only the organization making a particular statement, but also which build or release the statement is sourced in. Emw (talk) 04:27, 27 October 2014 (UTC)
Thanks Mvolz! This is something that has been on our group's radar for a while now and I'm happy to see the discussion starting here at wikidata. A while back we hacked a gene-snp-disease smeantic media wiki with data extracted from Wikipedia and SNPedia. It might be useful to have a play with that to see how you like the structure. See http://genewikiplus.org/wiki/Main_Page . Also note that Chunlei Wu is leading the development of a service called http://myvariant.info that should be a great way to programmatically gain access to SNP annotations. If it all works out, we ought to be able to use it much like we are using http://mygene.info now to feed bots to populate wikidata with this information. I agree with Emw though - we will want to stage this in a way that puts some useful content in here first without completely flooding the system with human variation data. (Though in the long run I would hope to see few if any limits on the amount of content like this that get into wikidata.) --Genewiki123 (talk) 16:56, 27 October 2014 (UTC)

Should these be merged?[edit]

Kell antigen system (Q1738190) and Kell blood group, metallo-endopeptidase (Q18028243)?

The English Wikipedia article mostly talks about the gene (and has a gene infobox) but technically the antigen system and the gene itself are not the same thing...

Thoughts?

A better way to link the two?

Mvolz (talk) 22:00, 25 October 2014 (UTC)

Infobox enzyme[edit]

Nobody has gathered the identifiers from Infobox enzyme yet:

5044 transclusions would make it well worth the effort. -Tobias1984 (talk) 20:12, 27 October 2014 (UTC)

General genes, specific diseases[edit]

In discussing a draft import of Disease Ontology classification, Andra explained that the claim "subclass of: disease" was added to all diseases because that's what was done for genes. Having suggested the "subclass of gene" approach for genes like RELN (Q414043) but complained about an analogous "subclass disease" approach for diseases like Alzheimer's disease (Q11081), I'd like to examine these approaches.

Consider how knowledge about diseases and genes are organized in sources. For example, Alzheimer's disease is said to be "subclass of tauopathy" in Disease Ontology (DO) and "subclass of other degenerative diseases of the nervous system" in ICD-10. The DO entry has 5 ancestors between "Alzheimer's disease" and "disease".

Now consider how genes are classified in the Ontology of Genes and Genomes (OGG), a modern ontology that aligns with major works like Gene Ontology. RELN is said to be "subclass of protein-coding gene of Homo sapiens" in OGG. There are 3 ancestors between "RELN" and "gene". This structure seems reasonable to me -- one layer accounts for the type of gene product (protein) and the remaining two account for organismic taxonomy (human genes, eukaryotic genes).

To compare apples to apples, it's necessary to account for the fact that DO does not include non-human diseases, but OGG does include non-human genes. If DO accounted for non-human diseases like OGG, that would add 2 ancestors to Alzheimer's disease. Non-human diseases are relevant for Wikidata -- e.g. blight (Q4273292) and scrapie (Q170102). So, Alzheimer's disease would have 6 or 7 ancestors, and RELN would have 2 or 3.

This is mostly just an exploration of how different domains do classification. I wouldn't oppose using OGG's approach; it may even be helpful, since OGG aligns with other major biomedical ontologies. Emw (talk) 19:02, 30 November 2014 (UTC)

I don't know enough about the wikidata technology stack to answer authoritatively here, but I would be a bit wary of directly replicating a realist OWL model in Wikidata triples. Liberal use of OWL classes (genes, proteins, diseases, pathways, chemicals) makes a certain amount of sense when your stack is based around DL reasoners. And it's also highly defensible from an ontological/philosophical point of view. But there are also good reasons for a more OWL-individual centric approach, some obvious, some subtle. The wikidata datamodel and stack may push things further towards individual-centric modeling. I think some kind of formal mapping to the OBO world could still be maintained, along the lines of prototype theories. This feels like a big issue that spans multiple WikiProjects. Cmungall (talk) 22:41, 5 February 2015 (UTC)
Cmungall, you mention there are also good reasons for a more OWL-individual centric approach, and that the Wikidata data model and stack may push things towards individual-centric modeling. Could you elaborate? Also, what makes you wary of adopting ontological realism in Wikidata?
Briefly: think of basic queries like 'how many genes in human?'. You'd end up having to bolt on some kind of metaclass system to constrain results to the desired hierarchical rank/layer. Also, does WD intend to commit to the same model theoretic semantics as OWL? How are existential restrictions mapped? I'm guessing these are all irrelevant to WD, in which case you don't end up buying anything with a class representation, you just make certain things harder. As for realism, it can lead to ontological hair-splitting distinctions; these may be v useful for precise reasoning and modeling, but I would guess WD users would prefer to see disease-qua-disposition, disease-qua-process, disease-qua-disorder etc lumped into an overarching concept. In terms of Ceusters and Smith's 3-level Granular Partition Theory of reality (http://www.jbiomedsem.com/content/1/1/10), ontologies are good for representing L1, but I'd argue WD would more naturally represent L2. (with the caveat that I am new to WD and may have misunderstood some of the goals) Cmungall (talk) 23:11, 7 February 2015 (UTC)
Are you aware of the Wikidata RDF exports? As seen in e.g. the 2015-01-26 dump, it includes OWL dumps for Wikidata's class hierarchy (wikidata-taxonomy.nt.gz) and instance layer (wikidata-instances.nt.gz). Emw (talk) 00:23, 6 February 2015 (UTC)

Launch of WikiProject Wikidata for research[edit]

Hi, this is to let you know that we've launched WikiProject Wikidata for research in order to stimulate a closer interaction between Wikidata and research, both on a technical and a community level. As a first activity, we are drafting a research proposal on the matter (cf. blog post). It would be great if you would see room for interaction! Thanks, --Daniel Mietchen (talk) 01:39, 9 December 2014 (UTC)

Replacing P643 with P1057[edit]

Andrew Su
Genewiki123
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Dan Lawson
Kizar
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Pictogram voting comment.svg Notified participants of Wikiproject Molecular biology
Per discussion in Property_talk:P643, Pasleim is planning to replace invalid ID (P643) (Genloc Chr) with chromosome (P1057). Genloc Chr (P643) has been deprecated for over a year and this has been an outstanding maintenance task.

Pasleim, please be sure to only add P643 claims when there is no P1057 claim. Many genes, proteins etc. already have P1057 statements and we don't want two redundant P1057 claims on those items. Once there are no items with a P643 claim and without a P1057 claim, we should be able to delete the deprecated property. Thanks for taking this on! Emw (talk) 15:59, 27 December 2014 (UTC)

FYI, I deleted now P643--Pasleim (talk) 13:12, 30 December 2014 (UTC)

Identifier Syntax[edit]

I have a question and a mild paranoia about the identifier syntax. I have seen confusion and wasted effort before when people can't seem to agree how to decompose an identifier into a prefix and local portion. For example, consider MGI genes. I consider these to be a "local" part which is numeric, and the prefix "MGI". Some people understood "identifier" to mean the number, and others understood the "identifier" to include the prefix. This lead to "identifiers" being created that were MGI:MGI:nnnnn. I would like to ensure this doesn't happen again. So for a field like "Gene Ontology identifier", are we meant to enter the numeric portion, or the full ID. The GO's position on identifiers is here: http://wiki.geneontology.org/index.php/Identifiers I would also like to make sure the rules for translating wikidata identifiers to OBO PURLs are clear, and that each identifier property is linked to a source such as identifiers.org. I can help with this but not sure where to start. Cmungall (talk) 21:04, 7 February 2015 (UTC)

Gene Disease Interactions[edit]

All genes and diseases have now been put into WD and they are kept up-to-date by bots. This is a great development. The next step is to introduce relations between genes and diseases. For this, we have built an OWL ontology with the classes 'gene' and 'disease' and several properties. It can be found here: [10]. We would like to put this out for community discussion now.

The general idea behind this approach would be to first collaboratively build an OWL based ontology (e.g. with webprotege.standford.edu) which represents all classes and properties necessary in order to represent a certain relation/topic/part of reality in WD. This could then be proposed as a whole for WD property creation, so all required properties are getting created via one request. The ontology created could then also serve as a basis for data export from WD and would enable partial or complete export of a certain topic covered by the ontology. The ontology would therefore serve as the relational scaffold for a certain part of WD, mediating import and export processes or just giving a user a quick overview of how things relate. For the start, it would be important to discuss the gene-disease interaction properties and get them created, based on the ontology shown in the link above. Sebotic (talk) 21:12, 3 April 2015 (UTC)


Tobias1984
Doc James
User:Bluerasberry
Wouterstomp
Gambo7
Daniel Mietchen
Andrew Su
Peter.C
Klortho
Remember
Matthiassamwald
Projekt ANA
Andrux
Pavel Dušek
Was a bee
Alepfu
FloNight
Genewiki123
Emw
emitraka
Lschriml
Mvolz
Franciaio
sebotic
Pictogram voting comment.svg Notified participants of Wikiproject Medicine

Andrew Su
Genewiki123
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Dan Lawson
Kizar
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Andra Waagmeester
Sebotic
Mvolz
Toniher
Pictogram voting comment.svg Notified participants of Wikiproject Molecular biology