Wikidata talk:WikiProject Molecular biology

From Wikidata
(Redirected from Wikidata talk:MBTF)
Jump to: navigation, search
Shortcut: WT:MB

species[edit]

I read "human is a priority", at Wikidata level shouldn't we avoid this type of approach and try to be open to all organisms since the beginning? or you should rename the task force in "Human Molecular biology task force" ;-) --Chandres (talk) 14:12, 4 March 2013 (UTC)

Yes, I think this task force should eventually target all organisms. We suggested starting with human since that nicely overlaps with our interests and expertise. But we are also interested in the Long Tail of microbial genomes too ([1]), so making sure the models/properties/tools are generalizable is definitely on our mind. And more to the point, if you and/or others are interested in other organisms, then by all means, let's actively move forward on multiple fronts! Cheers, Andrew Su (talk) 19:17, 4 March 2013 (UTC)

Let's get started?[edit]

Now that the string datatype has been implemented, seems like things are ready to move on getting some of our proposed properties approved. Anyone want to take the lead on proposing a few over on the property proposal page? Andrew Su (talk) 17:44, 7 March 2013 (UTC)

I had a little time and requested a couple of properties. All they need now is a few reviews. --Tobias1984 (talk) 10:39, 27 May 2013 (UTC)
See currently proposed properties. These properties require some specialized knowledge, but they are just string-type identifiers, and as such should not cause major structural problems. I think that if they have not been reviewed in a few days, they can safely be created. --Zolo (talk) 10:00, 29 May 2013 (UTC)
Although it would be nice if at least one person would check over the proposals and support them ;). Where are all the people of this task force? --Tobias1984 (talk) 21:14, 1 June 2013 (UTC)

Scope?[edit]

I read that you want to at least add data for 10k human genes, but we could easy imagine adding data for millions of genes in UniProt (for example). Is this the place to discuss the scope of the project? Cheers, --Dan Bolser (talk) 15:58, 12 March 2013 (UTC)

Hi Dan, absolutely, we want to scale up to all known genes and proteins in all organisms. Our local database here (a newer version of what's available at http://mygene.info) has loaded all ~8.7 million genes from NCBI's gene_info file ([2]), plus all known links to Ensembl, UniProt, PDB, GO, etc etc. So all of our infrastructure is converging on a complete loading of all knowledge. As a pilot project though, we are proposing to start with the 10k human genes that encompass the Gene Wiki, just because our team has interest and bandwidth to make sure all of that is high quality. As mentioned above, if there are eyeballs and hands to help us broaden our initial pilot, by all means, we're open to it... Cheers, Andrew Su (talk) 17:58, 12 March 2013 (UTC)

User:Ricordisamoa/AuthorityControl.js[edit]

As external links are probably important here, you may want to use and expand User:Ricordisamoa/AuthorityControl.js that automatically creates links for some string properties. --Zolo (talk) 10:15, 9 March 2013 (UTC)

Hi Zolo, just to confirm, the tool above only changes how string properties are rendered in the web interface, not how the data are represented in wikidata, correct? Assuming I'm understanding correctly, sounds great! Once we get a few of our proposed properties created, we'll add our identifiers. (For others, instructions on how to add this user script are at Wikidata:Tools#AuthorityControl.js.) Cheers, Andrew Su (talk) 23:04, 12 March 2013 (UTC)
Yes, it only changes the way things are dipslayed. In the longer term, it might make sense to store external IDs differently from other strings, as full statements with sources and qualifiers may not make much sense for them. Apparently, this is something the development team is thinking about, but if something get done, it will probably not be in the next few months. --Zolo (talk) 08:15, 13 March 2013 (UTC)
Note: it is now called MediaWiki:Gadget-AuthorityControl.js and is activated by default for all users. --Zolo (talk) 17:31, 2 April 2013 (UTC)

Distinguishing between genes and proteins[edit]

On Wikipedia, proteins are covered in the same article as their corresponding gene; see BRCA1, APOE4, etc. What are others' thoughts on separating proteins and their corresponding genes into separate items on Wikidata, e.g. such that 'BRCA1' would be a subclass of gene, and 'breast cancer type 1 susceptibility protein' would be a subclass of protein? And how about distinguishing between homologous genes and proteins in different organisms? Emw (talk) 02:41, 19 March 2013 (UTC)

Great discussion point. With the Gene Wiki, we did make the decision (with en:WP:MCB) to lump all the data about the gene and the corresponding protein product(s) into a single page. That was mostly because most pages were sufficiently underdeveloped such that splitting them would have been non-productive fragmentation. I'm torn about whether to take this same strategy with Wikidata. On the one hand, splitting them into two items is probably the more "accurate" way of representing reality. However, I think consumers of the data generally won't care about the difference, so it would be one more step with doing integrative queries. For example, suppose I want to find all genes on chromosome 2 whose protein products have a kinase domain. If the gene and protein items were distinct, presumably it would require another "join" in the query. Anyway, very interested to hear others' thoughts... Cheers, Andrew Su (talk) 04:59, 19 March 2013 (UTC)
From a semantic point of view seperating genes really makes a lot of sense to me. Especially since you can think of properties proteins have (function etc...) which do not really apply to genes. I can understand that persons who want to use the data don't really care if they look at a gene or a protein, but it doesn't seem like you would want to modify the raw data for it. Wouldn't it be possible to change the way genes and proteins are presented here on wikidata trough an extension or something like that. --TWillemsen (talk) 09:56, 25 May 2013 (UTC)
I am conflicted on this one. In this case, we have a natural immutable hierarchy that is a result of the central dogma (gene → protein → function). As Andrew has pointed above, in the vast majority of cases, we have a single article about the gene and protein encoded by that gene. This situation IMHO is unlikely to change since the subject of genes and proteins is so interrelated. Separate gene and protein articles, where they exist, have slowly been merged over the time. Furthermore, I am not aware of a single example of a gene/protein article has been split into two. On the other hand, gene and protein are distinct entities ... Boghog (talk) 10:59, 23 June 2013 (UTC)
Hum, the central dogma is wrong, there is a lot of example where one gene can give different proteins (in term of sequence and sometimes even the name change). you can aso find example of proteins with differents functions depending of , for example, phosphorylation status. The rational solution would be to have protein in the same item than the gene., But, when the case of multi protein from one gene or different function (not multiple function), it can be really difficult to link the different data. --Chandres (talk) 20:27, 23 June 2013 (UTC)
I also think that we can and should part ways with Wikipedias structure in certain areas. We could for example manage the sitelinks in the gene-items and create protein-items that usually don't have sitelinks but can hold statements independent from the genes. Maybe we should have an RfC about this topic and then we have a guidline how to manage similar cases. --Tobias1984 (talk) 21:00, 23 June 2013 (UTC)
Of course one gene can code for more than one protein (alternative splicing, post translational modification, etc.). This does not however invalidate the central dogma and the hierarchy (a one to many relationship). Boghog (talk) 21:05, 23 June 2013 (UTC)
Physically, the gene and proteins are separate entities, but I think it makes more sense to put them on one page. For the related properties it's clear if they belong to the gene or the protein. In an ideal world, we would have data on all the splice and PTM forms, but this is not the case and so pooling all the evidence (and perhaps adding qualifiers as to which form is meant) seems more manageable. MichaK (talk) 08:06, 24 June 2013 (UTC)
As some here have noted, genes and proteins are different things. They're strongly related, of course, but they're not the same thing. Wikidata is about things--its items are not encyclopedic articles. While it makes sense for Wikipedia to cover a gene and its proteins in a single article because of convenience for the humans that read the article, the same thing doesn't make sense in Wikidata. We can have as many items as we want, we can link them however we want, and we can display the data however we want. Wikidata's current user interface is not the final word, and better (and even domain-specific) interfaces can be built to display the data--in this case to put the data about genes and their proteins in one place. For an example of a different kind of Wikidata user interface, see the Reasonator. Silver hr (talk) 21:03, 24 June 2013 (UTC)
The two discussions about how Wikidata should handle gene-protein distinctions and how we should handle ortholog distinctions involve the same question. How should biological sequence information be partitioned on Wikidata? Should it be divided into many items such that one item represents one discrete type of biological entity, or divided into fewer items such that one item represents one gene product, including information like which entities the gene products derives from, etc?
Separating genes and their encoded proteins into separate items, and separating those by which organisms express them, seems like it would be more unwieldy initially. For an article like reelin, it would entail at least four Wikidata items: one for the RELN gene in humans, one for the RELN gene in mice, one for the reelin protein in humans, and one for the reelin protein in mice. To be consistent with this approach, Wikidata would need separate items for RNA (in humans and mice). And even when an EC number mapped to only one gene product, Wikipedia would need a separate item for the that enzyme class (see here and here for more detail). So Wikidata would need six items (and quite possibly more) to represent the information in one PBB template.
On the other hand, putting these distinct biological types into separate Wikidata items seems like it would make it easier to assign properties to each specific type. It's technically possible to distinguish between claims that apply to a gene or its product (or to its orthologs) in one large item, but that simple offloads the unwieldiness onto the claims. In other words, claims would need to be bloated with qualifiers in order to unambiguously specify which statement applied to which biological entity.
My impression is that it would be better to divide the articles in question such that one item represents one discrete type of biological entity. This seems like it might make the initial mapping of Wikidata items onto the PBB template less straightforward, but -- in time -- allow for greater expressivity about each biological entity within that template. Emw (talk) 23:38, 24 June 2013 (UTC)
Thank you for this summary, and you are right of course that the gene/protein question and the ortholog/paralog issue are highly related. It seems to me that the consensus is forming around separating all these conceptual entities into separate items (right?). I lean this way to, where my only hesitation comes from the fact that I'm largely ignorant of how the Wikidata querying system will work. Emw states that this design would "make the initial mapping of Wikidata items onto the PBB template less straightforward". I'm fine with "less straightforward" as long as it could be done (integrating data from multiple wikidata items into a single wikipedia template). Can someone more knowledgeable confirm that this is true? Cheers, Andrew Su (talk) 07:26, 25 June 2013 (UTC)
I forgot that I had a call with Denny this morning, so I asked the question I posed above. The answer is that currently only one-to-one mappings between wikipedia and wikidata are allowed (meaning we would not be able to import data from multiple wikidata items into a single wikipedia template), but that support for this feature is definitely planned and will likely come near the end of the year. So with that context, I think we separate out genes from proteins, and also the various orthologs. Long term that seems like the best solution to me. I'm going to make a specific proposal to this effect below... Cheers, Andrew Su (talk) 17:18, 25 June 2013 (UTC)

Human/mouse/... ID[edit]

How do we distinguish between IDs for humans and mice? Present Reelin P351 value is for humans, but the property does not describe that. Should a qualifier be used? (I know not much about non-human bioinformatics) — Finn Årup Nielsen (fnielsen) (talk) 14:55, 17 April 2013 (UTC)

Ooops, sorry I missed this post way back when. Yes, we need to better model how orthologs are handled. Personally, rather than putting it in a qualifier, I'd propose creating a separate topic for the mouse gene, and then relating them via a new property called "ortholog". There are too many functional differences to lump all orthologs into a single topic. Your thoughts? Cheers, Andrew Su (talk) 00:27, 25 May 2013 (UTC)
I just updated RELN (Q414043) with some information and I think that the (no label) (P89) qualifier (mouse/human) works pretty well. We should still thing about how the information should be sourced. --Tobias1984 (talk) 21:06, 13 June 2013 (UTC)
I'm still not 100% sure I agree with having the human and mouse genes represented in the same item. Many of a gene/protein's properties might be species-specific. For example suppose (completely hypothetical here) that reelin interacts with VLDL receptor in both human and mouse, but interacts with APP only in human. How would we model that? What happens if there are disagreements on what the true ortholog relationships should be? Again, I tend to favor creating a separate topic for each species-specific gene, and then linking them via an 'ortholog' property. Other thoughts? Cheers, Andrew Su (talk) 06:38, 21 June 2013 (UTC)
Different interaction in different species can be expressed with qualifiers, but it seems more tricky for disagreements over orthologs. I am no biologist, but I feel that separate items is a more manageable long-term solution. If so, should RELN (Q414043) be taken to mean "human reelin", or should "human reelin" have its own item ? I do not know if there would be much to say about reelin in general, perhaps things like its evolutionary history. --Zolo (talk) 07:12, 21 June 2013 (UTC)
Yes, I would agree that RELN (Q414043) would specifically refer to the human version (which would be noted with (no label) (P89) and human (Q5)), and we would then create a new item for "mouse reelin". As far as evolutionary history, I think those reciprocal "ortholog" relationships can be encoded as statements on both items. Make sense? Cheers, Andrew Su (talk) 08:34, 21 June 2013 (UTC)
Hi Andrew! I made an example of how qualifiers could be used to show different interactions for different species (http://www.wikidata.org/w/index.php?title=Q4115189&oldid=51778169). The bottom two example of VLDL protein show that additional qualifiers could be added to further describe the interaction (my weird example: Sandbox-item has physical interaction with VLDL protein in Mice, but only in 1980). --Tobias1984 (talk) 09:11, 21 June 2013 (UTC)
Hi Tobias, yes point well taken that qualifiers could be added there too. My concern though is that almost all of the properties would end up having species-specific qualifiers. For example, in addition to the ones already shown, I think regulates (molecular biology) (P128), gene symbol (P353), and RefSeq Protein ID (P637) would all be candidates. In that case, it might be easier just to separate them. And on a philosophical level (even though I hate philosophical arguments), I tend to think that human reelin is in fact a different thing than mouse reelin. Your thoughts? Others thoughts? Cheers, Andrew Su (talk) 16:52, 21 June 2013 (UTC)
The question of what would be the easier solution is difficult to answer at the moment. The problem has multiple-dimensions too. What is easier for people to view and edit; What is easier for bots to edit; What is easier to query? Just recently most people agreed that certain editions of books should have their own item. But then a query for "books written by author" will return all the editions that link to the author. That means that the query has to be more complicated than in the one-item solution. If there is something nature hates, it is parceling of information ;). Do you have time to look at Wikidata:Property_proposal/Term#RefSeq so the GSoC student can get up and running? --Tobias1984 (talk) 17:20, 21 June 2013 (UTC)
Yeah, good points. I'm personally not so concerned about the views/edits, but query complexity is definitely a downside to over-fragmentation. But I also worry that if we decide fragmentation is better later, it will be a big pain to "fix" things after we've already created 10000+ items. Ugh, no perfect solution... (And yes, I did add my support to Wikidata:Property_proposal/Term#RefSeq. Thanks for the reminder...) Cheers, Andrew Su (talk) 17:55, 21 June 2013 (UTC)
At the moment, I do not see what it would be made harder to query. Presumably, most queries will be species-specific anyway.
Yes, I suppose it makes sense to link ortholog genes through an "ortholog" property, though of course, it we have many species, and link all ortholog pairs across all species, we get something almost as redundant as the old interwiki system that Wikidata is supposed to avoid ;). --Zolo (talk) 07:36, 23 June 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Interesting discussion. I think this issue boils down to whether it is better to have a hierarchal vs relational data model for this type of data. A hierarchal model might make sense if the hierarchy were immutable and not subject to change over time. The ortholog hierarchy is a product of evolution which implies that there may be some ambiguities. In cases where the corresponding orthologs within two species only have a single paralog within each of the species, there is no debate on that these two protein are orthologs. However if there is more than one paralog of a given protein in either of the species, things can get complicated particularly if the similarity between orthlogs is comparable to the similarity between paralogs. I think looking at the HomoloGene database is instructive. Assignment of orthologs is based on a clustering algorithm. The exact assignment of orthologs may and do change slightly over time as more protein sequences from different organisms are added. Hence from a maintenance standpoint, as Andrew has already suggested, I think a relational model is better in this case (separate database entries for orthologs from different species linked by a new ortholog property). Boghog (talk) 09:41, 23 June 2013 (UTC)

Adding the paralog layer is important for the debate. In the future the system will be used for other species,when we will start the plant world, we will have ortholog, with paralogs, that have the same function in different localization, or different function in the same compartment. I really think that one entry per gene per specie is more relevant at a long term view. --Chandres (talk) 20:15, 23 June 2013 (UTC)
For small protein families, putting proteins of different species on one page might look feasible, but I don't think it scales well. So each protein should have its own page. Regarding the ortholog links: As we add more species, the number of necessary links explodes. I'm not really sure if there's a good representation of gene trees in the WikiData model. But in the beginning, adding mouse--human orthologs to the respective protein pages makes a lot of sense. MichaK (talk) 08:00, 24 June 2013 (UTC)
Andrew Su has alerted me to this discussion. We will have the Quest for Orthologs meeting in a few weeks, and I will try to get feed-back then. In the meantime, my feeling is that creating ortholog groups makes the most sense: for a given taxonomic level (e.g., LCA of human and mouse) you get the orthologs plus any in-paralogs. This is notably implemented in a clear way in OMA, I'm sure that they can help you if needed.Marcrr (talk) 15:23, 1 July 2013 (UTC)

Hi, when you are done with this decision, could you add a word about it in Help:Modeling#Molecular_biology ? This help page is precisely intended to index and discuss (it probably will be split in the future) such modelisation decision. It would be nice to have some usecase items, to list relevant properties, and even detail an example. And/Or a link to the page on this project where this decision will be documented. TomT0m (talk) 14:03, 30 June 2013 (UTC)

Great idea. Though is there a formal way to describe a data model? We've so far been doing everything by example (e.g., RELN (Q414043)), but something more formal would certainly be good too... Cheers, Andrew Su (talk)
If we'd like to formalize the data model, I think a good approach would be to keep working out basic properties as we have been, then express that in OWL. OWL is the lingua franca for speaking about ontologies for the Semantic Web, which in my opinion is what we're building (often unwittingly) here and throughout Wikidata. Some relevant literature and resources on OWL and biological ontologies:
One of high-level points I've taken from the literature is that the Gene Ontology's "is-a" relationship is modeled with the OWL property rdfs:subClassOf. Wikidata has a property explicitly based on that W3C recommendation: subclass of (P279). My initial impression is that it would work for all the gene and protein items to be created to have the statements "subclass of (P279) gene (Q7187)" and "subclass of (P279) protein (Q8054)". Emw (talk) 17:29, 30 June 2013 (UTC)
Good idea to try to formalize the model. I'll start a RfC to try to establish a common language on Wikidata on how to use instance of and subclass to establish a formal type model, this will be a support for this discussion. I have a few ideas but it might take a few days before it is really on. I think this will be a good point in the discussion you try to start for quite a while :) TomT0m (talk) 10:48, 1 July 2013 (UTC)
Concerning OWL and XML, please note that their is a workgroup of the Quest for Orthologs trying to establish standards to represent orthology relations: http://questfororthologs.org/standards Marcrr (talk) 15:25, 1 July 2013 (UTC)

Pathways[edit]

Hi everyone! New member of the community here (writing from the Amsterdam Wikimedia Hackathon! I was wondering if I could propose some properties not only for genes and proteins, but also pathways. I really think this could be a good addition. A lot of pathway data is already available, but not very structured. What are your opinions on this? If you might think it is a good idea, I can create some examples here.TWillemsen (talk) 18:37, 24 May 2013 (UTC)

I think pathway information would be fantastic! And there are a few structured resources for pathways -- Pathway Commons, Wikipathways, KEGG, etc. You're probably already familiar with them, but just to be sure... Let us know how it goes! Cheers, Andrew Su (talk) 00:25, 25 May 2013 (UTC)
Yeah, I've worked with those pathways. But in my experience those resources are not really data-oriented, for good reasons ofcourse. Anyway. I'll be proposing some properties for pathways here this weekend. I'm still very new to wikidata, so please let me know if I don't follow guidelines :) --TWillemsen (talk) 09:41, 25 May 2013 (UTC)

Re: Support for Property Creations[edit]

Hi everyone!, For the GSOC Gene wiki project[[3]], we have proposed a set of properties[[4]] to capture fields of infobox[[5]]. Kindly take part in property proposals[6] through your comments and/or support. --Chinmay26 (talk) 19:28, 18 June 2013 (UTC)

I can create the property as soon as there is more support. Maybe a lot of people don't have watch-list-email-notifications turned on. --Tobias1984 (talk) 10:15, 21 June 2013 (UTC)

Proposal for handling genes and proteins, and species-specific orthologs[edit]

In an attempt to summarize the consensus that I think we're reaching here and here, I propose (well, mostly reiterating and formalizing EMW's proposal) that the data from a single PBB template on Wikipedia be separated out into four Wikidata items: the human gene, the human protein, the mouse gene, and the mouse protein. Later, I think we can consider separating out the RNAs as well, I don't think this is justified at the moment since there are few (if any) RNA-specific statements. Please lodge your support or opposition below... Cheers, Andrew Su (talk) 17:35, 25 June 2013 (UTC)

Support[edit]

Oppose[edit]

More comments[edit]

Please make sure you've reviewed the comments already made here and here.

Just to make sure it is clear, please note that User:Chinmay26 is a GSoC intern this summer. So even if you think this basic model will need to be tweaked in the future, getting consensus around the plan above will allow Chinmay to get started building the basic pieces of infrastructure for uploading and maintaining genomic data... Cheers, Andrew Su (talk) 17:35, 25 June 2013 (UTC)

Distinguishing enzymes and gene products[edit]

There's a question about the nature of EC number (P591): Property_talk:P591#Distinguishing_enzymes_and_gene_products. Feedback is welcome! Emw (talk) 00:37, 27 June 2013 (UTC)

Using RefSeq (P656)[edit]

I've mocked up the use of RefSeq (P656) over on reelin (Q13569356) and RELN (Q414043), but I'm not convinced this is the best way of structuring things. Any thoughts? Cheers, Andrew Su (talk) 20:58, 29 June 2013 (UTC)

Is there a reason we need GenBank accessions for Wikidata items? The GenBank accessions used to derive a given RefSeq accession are noted in the latter's 'COMMENT' field, see e.g. NM_011261. So my impression is that it's probably extraneous and unnecessary to include GenBank accessions in 'RNA ID' or 'Protein ID' claims, in which case those properties could be renamed to 'RefSeq RNA ID' and 'RefSeq protein ID' and we could do away with RefSeq (P656) (which is currently used as a qualifier). Emw (talk) 05:36, 30 June 2013 (UTC)
Yeah, I'm not 100% sure that we need it either. Perhaps for organisms that don't have strong RefSeq support. Perhaps we shouldn't worry about that case for now? Are there other sources of non-RefSeq RNA and protein sequences that are important? Where should we put the Ensembl transcript IDs (ENST*) and Ensembl protein IDs (ENSP*)? Not sure about this... Cheers, Andrew Su (talk) 15:49, 30 June 2013 (UTC)
I agree that we probably don't need to worry about organisms that don't have strong RefSeq support for now. If we'd like Ensembl IDs for transcripts and proteins, then I think it would make sense to have separate properties for each of those. Emw (talk) 16:56, 30 June 2013 (UTC)
(And shouldn't RELN (Q414043) and reelin (Q13569356) be switched? The Wikipedia article is about the protein, but the Wikidata item with the sitelinks is about the gene.) Emw (talk) 05:49, 30 June 2013 (UTC)
Well, here's where Wikipedia's semantic ambiguity makes things difficult. The WP article combines information about the gene and the corresponding protein. That was a conscious decision that we made at WP:MCB. Ultimately the infobox template for reelin will need to draw from all four reelin-related wikidata items (human/mouse gene/protein). So I think the link as it stands is fine, but certainly open to more discussion on how best to handle things... Cheers, Andrew Su (talk) 15:49, 30 June 2013 (UTC)
I also think that the items should be switched. Even though Wikipedia infoboxes are both about the gene and the protein, the textual content is primarily about the protein, and that is true for all languages. Actually fr:Reelin hardly even mentions the gene. --Zolo (talk) 22:00, 1 July 2013 (UTC)
I moved the sitelinks to Q13569356. --Zolo (talk) 15:31, 3 July 2013 (UTC)

New properties needed[edit]

Given the format that has been agreed upon, we need some addtional properties:

  • ortholog of
  • encoded by and the symmetric property "encodes". Or just one of them ?
  • any others ?

--Zolo (talk) 15:34, 3 July 2013 (UTC)

I think we need "Taxonomy ID" to note that human (Q5) is 9606. And yes, I do think we should have the reciprocal links. The brother and sister properties are used on both ends of that relationship, so that's a similar situation, right? Cheers, Andrew Su (talk) 23:06, 3 July 2013 (UTC)
I added them to Wikidata:Property proposal/Term#Biochemistry and molecular biology / Biochemie und Molekularbiologie / Biochimie et biologie moléculaire. -Zolo (talk) 07:26, 4 July 2013 (UTC)

Sourcing requirements for bots[edit]

There's an RFC that's relevant for the GSoC project: Wikidata:Requests_for_comment/Sourcing_requirements_for_bots. Emw (talk) 17:18, 6 July 2013 (UTC)

gene, RNA, and protein identifiers[edit]

All, as we move forward with modeling gene and protein items, I want to be sure we have consensus. Right now, we are generally following a model that has database-specific properties:

However, I think there is an argument to have just three properties ("Gene ID", "RNA ID", and "Protein ID"), where the different types of identifiers are differentiated by the "Source". For example, RELN (Q414043) could have a property "Gene ID" --> "5649" (Source: National Center for Biotechnology Information (Q82494)), and also "Gene ID" --> "ENSG00000189056" (Source: Ensembl (Q1344256)). I tend to like the simplicity of this system because it will prevent explosion of the number of properties, especially as we move to other organisms (Flybase, Wormbase, Pombase, RGD, etc...). Thoughts? Cheers, Andrew Su (talk) 18:27, 9 July 2013 (UTC)

There is a RfC about this topic running right now: RfC:How to classify items. --Tobias1984 (talk) 19:43, 9 July 2013 (UTC)
Yes, in order to prevent an explosion of properties, I think it would make sense to have fewer, more fundamental properties. We would of course need to agree on a standard "base" name. For genes, the most natural would be Human Genome Project (Q192446) and proteins, UniProt ID (P352). Boghog (talk) 20:05, 9 July 2013 (UTC)
The Classification RFC linked above seems mostly unrelated to this discussion. That discussion concerns whether we should use many domain-specific "type of" properties or two such properties to construct instance relations and subsumption hierarchies. (I happen to think we should use the latter approach.) This discussion seems to be about how to handle identifier properties, not basic membership properties.
For identifier properties, there seems to be more precedent on Wikidata to have each separate ID as its own property, rather than grouping identifiers by type and then individuating them with qualifiers. For example, we've got several very popular properties VIAF identifier (P214), GND identifier (P227), LCNAF identifier (P244), NDL identifier (P349), and BnF identifier (P268). Each of those properties is really describing the same type of thing: an authority file ID. However, instead of having a single "authority ID" property with one qualifier per authority (e.g. OCLC, GND, LC, NDL, BNF), you can see in e.g. On the Origin of Species (Q20124) that each identifier has its own property.
That said, type-specific organization of properties as suggested by Andrew seems like it could be a good idea. Here are a few questions and comments:
  1. How would identifiers from different databases operated by the same organization be qualified? For example, the gene ID properties Entrez Gene ID (P351) and Homologene ID (P593) come from different databases -- Gene and HomoloGene -- operated by the same organization, National Center for Biotechnology Information (Q82494). So sourcing at least gene ID strictly by organization doesn't seem feasible. In cases where a biological sequence (e.g. a gene, RNA or protein) has multiple identifiers from the same organization, should that qualifier simply point to the database instead of the organization, for example Gene and HomoloGene (Q468215)? (The problem doesn't necessarily go away if we don't considered "HomoloGene ID" to be a "gene ID". If we wanted to represent both reelin's NCBI Gene ID 5649 and NCBI RefSeq ID NG_011877.1 with a "Gene ID" property, how would we do that?)
  2. Is having "RNA ID" and "Protein ID" on a gene item redundant with the proposed "encodes" property? This seems like it could have implications on how we organize our ID properties for biological sequence items.
I'll end my comment by pointing out that Andrew's proposal, or something like it, might enable an even simpler way of handling IDs. If we were to organize these sequence identifier properties by type, then we might be able to designate one "preferred ID" among the set of IDs for each property with Wikidata's upcoming claim ranking feature. This would allow the statement with the preferred ID to be shown to all users and displayed in Wikipedia infoboxes by default. The various ID properties are probably all of equivalent accuracy in themselves (so our usage of ranking would deviate slightly from the feature's official description), but this might be a nice added benefit. Emw (talk) 04:26, 10 July 2013 (UTC)
A trivial solution to question #1 above is to assign "source" to a specific database instead of an organization. For example:
Thanks for pointing out the "claim ranking" feature which looks very useful. To rephrase what I stated above, IMHO the "preferred id" sources for genes and proteins should be Human Genome Project (Q192446) and UniProt ID (P352) respectively. It is not so clear what the "preferred id" would be for mRNA however. Boghog (talk) 09:32, 10 July 2013 (UTC)
Sorry for the "slow" reply (only relative to you all!)... Just to get caught up quickly in bullet point style...
  1. Tobias, the linked RfC is interesting indeed. However, I personally find it to be too abstract. I'd propose that we focus on coming up with the best solution for this particular corner of Wikidata, and worry about the implications/relationship to the rest of Wikidata later. Otherwise, we run the risk of paralysis. Sound reasonable?
  2. I agree with Boghog that we should just create an item for every unique source.
  3. Boghog, I'm not sure I 100% understand what you mean by "base name" / "preferred ID". Especially as it relates to Human Genome Project (Q192446). Can you clarify?
  4. Regardless, I think Boghog and Emw are supportive of the alternate plan above. Tobias (and anyone else who's interested), do you have any objections or refinements? In tangible terms, I think the game plan would involve:
  • creating a "Gene ID" property (we already have RefSeq RNA ID (P639) and RefSeq Protein ID (P637))
  • creating items for any database identifier providers that don't already exist (to start, "HUGO Gene Nomenclature Committee (HGNC)", "NCBI Entrez Gene")
  • migrate all the uses of the DB-specific properties (e.g., UniProt ID (P352), HGNC ID (P354)) to the new system.
  • eventually propose deletion of the DB-specific properties
Any thoughts/refinements/dissent? Cheers, Andrew Su (talk) 01:18, 11 July 2013 (UTC)
A few comments:
  • I think we should consider how the proposed "encodes" property relates to this alternate plan. It seems the alternate plan would have statements for "Gene ID", "RNA ID" and "Protein ID" in each gene item. However, isn't that information redundant with "encodes"?
  • Since Human Genome Project (Q192446) isn't a sequence database I don't think it makes sense to use it as a source for identifier properties. (Yes, HGP is a high-level project that has generated much of the underlying biological sequence data for the human genome, but it's not the entity asserting that, say, reelin has any particular identifier.)
I agree with Boghog's statement that these ID properties should be sourced to a biological database, not an organization. I think for our purposes we can consider items with Wikipedia pages using 'Infobox biodatabase' to be valid sources for these ID properties. Emw (talk) 02:07, 11 July 2013 (UTC)
Responding to Andrew's question above, what I meant by "base name" / "preferred ID" is for situations where multiple databases provide an equivalent data field (e.g., gene name), and where the value (e.g., reelin) may not be identical between databases. Therefore we should indicate which database provides preferred values for a given data field. For example, many databases provide gene names (HUGO, NCBI gene, etc.). Furthermore HUGO provides an approved gene name, which most other databases including NCBI gene replicate. However not all databases may use the currently approved HUGO gene name. Hence the need to specify a "preferred ID" (if I have interpreted Emw correctly). Does that make sense? Boghog (talk) 04:31, 11 July 2013 (UTC)
Regarding the "Encodes" / "Encoded by" proposed properties, I still think those are relevant for linking the gene item to the protein item. I didn't mean to suggest that statements for "Gene ID", "RNA ID" and "Protein ID" would all appear in the gene items. Rather, I think "Gene ID" would show up on gene items, and "Protein ID" would show up on protein items. (RNAs of course are the sticky one. I'd propose that in general, "RNA ID" shows up under the gene object, unless the RNA has a defined function in which case we break it out as its own item. But these would be edge cases.) Everything else you both stated makes sense to me, and I agree... Andrew Su (talk)
Thanks for the clarification -- that works for me. One very minor note, though: if "Gene ID" would only be used on gene items and "Protein ID" on protein items, then would it be simpler to just say "Sequence ID" when referring to the ID of the current gene or protein item? This seems like it would be similar to the approach most sequence databases take. For example, when referring to the ID of the "current" sequence on a given record, RefSeq, GenBank and UniProt simply say "accession" for the sequence ID, rather than "gene accession" or "protein accession" (see here, here and here). If this seems like it has notable drawbacks, then "Gene ID" and "Protein ID" seem fine to me. Emw (talk) 05:32, 11 July 2013 (UTC)


There are basically two extremes Wikidata could use (correct me if I'm wrong)

  • (1) Use a property "ID" for all identifiers.
  • Pro: Very few properties
  • Contra: No lists for constraint violations, no way of finding out if each item has every identifier, hard to query for Wikipedia infoboxes, hard to construct links to those databases
  • (2) Create properties for all identifiers.
  • Pro: Lists of constraint violations, Lists that show if each item has each property, easy for Wikipedias to get infobox information, easy to construct URL form "base-URL" + "identifier"
  • Contra: Very many properties

A possible solution would be to allow for properties to be nested too. So all the gene properties could be a subclass of "Gene ID" and "GeneID & RNA-ID & ProteinID" would be a subclass of "sequence ID" and "sequence ID" would be a subclass of "ID". I don't know if there are any plans to implement this here, but I think somebody once said in the ProjectChat that that was the way SemanticWeb handles these problems.
I personally think that the identifiers are not that important. The true potential of Wikidata is the item and number datatype that will allow us to create an unbelievable mesh of interlinked data that will in the end be more important than the 50+ identifier-properties that each item will receive sooner or later. --Tobias1984 (talk) 09:59, 11 July 2013 (UTC)

Tobias, you raise some good points about the limitations of the proposed system. I think (hope) some of them will end up being non-issues, but they are issues at the moment. So since this is not an undeniably positive move, let's just continue with the status quo and use database-specific properties. We can always make a change later. To help things move along then, I will:
I think that will cover all the identifier properties needed for the current gene infoboxes. Please discuss more if anyone disagrees with any of these changes! Cheers, Andrew Su (talk) 16:30, 11 July 2013 (UTC)

Gene and protein labels and descriptions[edit]

Since we're talking about identifiers for genes and proteins, I thought it'd be fitting to also discuss their labels and descriptions.

The proposal / guideline at Help:Label#Disambiguation says "When an article title includes disambiguation in it, either by placing it after a comma or by placing it in parenthesis, the disambiguation should be left out. Disambiguation information should instead be placed in the description field". Some of our items don't follow this styling, e.g. we've got items labeled "reelin (human gene)", "reelin (human protein) and "Reln (mouse gene)".

Proposed label and description format:

  • Human genes:
Label: HGNC gene symbol, e.g. RELN
Description: human gene
  • Mouse genes:
Label: MGI gene symbol, e.g. Reln
Description: mouse gene
  • Human proteins:
Label: HGNC full name, e.g. reelin
Description: human protein
  • Mouse proteins:
Label: MGI name, e.g. reelin
Description: mouse protein

For convenience, the HGNC entry for RELN is here and the MGI entry for Reln is here. What do others think? Emw (talk) 03:22, 11 July 2013 (UTC)

I worry a bit that the label won't be interpretable to a non-scientist, but this is a pretty minor worry. The item is really defined by its statements, so the label and description (I think) are really there just for convenience only... So bottom line, I like this proposal... Cheers, Andrew Su (talk) 05:05, 11 July 2013 (UTC)

Hi, I'm not sure if this topic has been discussed before. The few times I made gene or protein databases for projects I used ENSEMBL IDs as primary identifiers as they are usually quite complete and convenient to parse from the data files. However, one problem with ENSEMBL IDs is important to keep in mind. Such an ID is rather meaningless without the information to which ENSEMBL database *version* the ID belongs. ENSEMBL IDs change quite frequently over the history of the database versions and to keep a local database up-to-date to ensure that a used ID actually still refers to the gene/protein it was originally assigned to isn't trivial (although Ensembl maintains tables which record any such changes). I don't see any mentioning of a 'version' with the ENSEMBL IDs in wikidata. The 'source' info of the 'ENSEMBL GENE ID' property links only to 'Ensembl' in general, not a specific database version and also the ID itself links to the general Ensembl entry, which is the latest version. Would that imply that there is a bot who updates the ENSEMBL IDs in the wikidata database regularly to resolve possible arisen conflicts? Cheers, Optimale (talk) 11:42, 6 August 2013 (UTC)

Genome assembly database?[edit]

Hi,

I'm collecting some data on WP here: http://en.wikipedia.org/wiki/List_of_sequenced_plant_genomes

Can we store all those values in WD? --Dan Bolser (talk) 13:48, 10 July 2013 (UTC)

Yes! Putting genome metadata onto Wikidata is a great idea. Each genome should probably be represented as an item. How are these genomes represented, e.g. are most of them assemblies, sequence maps, or?
Here's a possible mapping for fields in the table at list of sequenced plant genomes; some based on relevant existing properties:
Organism strain: (no label) (P89) (perhaps we should create a property "strain" to support assigning sub-species information through qualifiers)
Family: this is extraneous, since we should be able to deduce it from the above field
Relevance: maybe unnecessary?
Genome size: we should probably use a generic "length" property with units "Mbp" (or whichever order of magnitude of "bp" is most appropriate)
Number of genes predicted: we might want to propose a new property for this
Organization: new property, like above
Year of completion: I would suggest using date of publication (P577) to specify the most precise date possible
Assembly status: what does this mean?
@Emw: Status of the assembly using a controlled vocabulary, described here: wikipedia:Talk:List_of_sequenced_plant_genomes --Dan Bolser (talk) 12:03, 20 May 2014 (UTC)
There are more interesting genome properties to consider, but this is a start. Emw (talk) 02:50, 11 July 2013 (UTC)
Agreed, these would be cool data to add! In addition, I can think of two things that might be nice to include that isn't already in your table. First is the sequence identifier for the sequenced genome. Second is the NCBI Taxonomy ID (P685). Possible to add those columns? Cheers, Andrew Su (talk) 04:53, 11 July 2013 (UTC)
Hello! I started work on this today with lots of help from User:Magnus Manske (I was a bit clueless before). We jumped in and proposed one of the properties in the table (and suggested by User:Emw too), here: https://www.wikidata.org/wiki/Wikidata:Property_proposal/Natural_science#Genome_size
I totally agree that GCA and tax_id would be valuable properties to add to the table (in addition I'm planning some form of quality ontology). Are there already properties for these two identifiers? Cheers, --Dan Bolser (talk) 12:39, 15 April 2014 (UTC)

property support or comments please[edit]

Hello all, there are a few proposed properties that could use your support or comments please. Starting with this link, the properties in question are "Encoded by", "Ensembl Transcript ID", and "Ensembl Protein ID". Please contribute your thoughts! Cheers, Andrew Su (talk) 18:01, 16 July 2013 (UTC)

I've supported those properties, which all make sense to me. Are there properties that the GSoC project needs but doesn't have yet? Emw (talk) 03:22, 17 July 2013 (UTC)
In general we still need a lot of item-properties that link genes and proteins to other items. For example "mutation in gene 1 causes disease A". We could also have more links to neurological and physiological functions. --Tobias1984 (talk) 08:33, 17 July 2013 (UTC)
I think new proposals can be done on a separate WD:Property proposal/Biology or WD:Property proposal/Science, as WD:Property proposal/Term is annoyoingly slow to load, and so diverse, that there does not seem to be much point in having everyhthing together. --Zolo (talk) 10:36, 17 July 2013 (UTC)
This discussion could took place in project chat or in a new RfC, and be generalised ... TomT0m (talk) 10:52, 17 July 2013 (UTC)
The reasoning behind the limited amount of subpages is that people should actually also review properties from other scientific fields and find properties that overlap. But I can see the problem with the page being huge at the moment (usually 100+ proposals since March). Maybe we should make a subpage for biology and life sciences? --Tobias1984 (talk) 10:56, 17 July 2013 (UTC)
To avoid huge number of proposal, I would support grouping properties if they make sense together or if they are similar and vote for a group of properties instead of property by property. TomT0m (talk) 11:02, 17 July 2013 (UTC)
I trimmed down the page to about 80 proposals. That should help with the load times. It would help to get some votes for Wikidata:Property_proposal/Term#Medicine_.2F_Medizin_.2F_M.C3.A9decine. --Tobias1984 (talk) 20:30, 17 July 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Infobox neuron is next. Please vote for the 6 proposals at Wikidata:Property_proposal/Term#presynaptic_connection_.28afferent.29 --Tobias1984 (talk) 09:39, 23 July 2013 (UTC)

Did these properties get accepted? If so can you link? Cheers, --Dan Bolser (talk) 12:42, 15 April 2014 (UTC)
@Dan Bolser: A hopefully complete list of the existing properties is at Wikidata:WikiProject_Molecular_biology/Properties Tobias1984 (talk) 15:24, 15 April 2014 (UTC)

links between genes/proteins, diseases, and drugs[edit]

As Tobias mentioned above, it would be great if we could start thinking about how to model the properties that link diseases with genes or proteins. I'd also add links to/from drugs as well. Let me splash down a few thoughts on each of those edge types:

  • gene-disease: primary data source would be OMIM, which as ~4000 links. The relationship type ("mutation causes", "trinucleotide repeat expansion causes", "translocation causes", etc.) looks like it can be parsed out with some effort.
  • gene-drug: primary source would be Drugbank, which has somewhere between 2700 and 10,000 links. No apparent relationship types.
  • disease-drug: primary source would be NDF-RT, which has ~55,000 links. Semantic types include "may_treat" (48000), "may_prevent" (6000), "may_diagnose" (1000), and induces (700).

Questions I would pose to this community are: 1) are there additional/better data sources, and 2) how detailed or generic do we want the properties to be? Cheers, Andrew Su (talk) 23:39, 18 July 2013 (UTC)

Linking genotypes, diseases and drugs would be a great idea. There are many efforts ongoing to link together topics in medicine and genetics in ontologies, and I think doing so in Wikidata will be one of the project's more compelling applications. I'm interested in linking genotypes and phenotypes -- especially diseases. For that, in addition to OMIM, there are also relevant resources in the Human Phenotype Ontology (HPO), MedGen and ClinVar.
HPO has annotations of all the clinical entries in OMIM, and includes gene-to-phenotype mappings. Importantly, HPO also puts this information into an ontology available in OWL, the standard language for representing ontologies: see HPO downloads, hp.owl. We might even be able to import HPO en masse to Wikidata, where it could be linked to other important domains beyond genetics and disease classification.
MedGen is a new project to aggregate information for medical genetics (docs). A MedGen search for RELN shows how the project can map gene names to a page incorporating information from HPO, OMIM, and relevant structured data on clinical features, phenotype hierarchies, etc. Further, if we're interested in how genetic variants are associated with clinically significant phenotypes, ClinVar can tell us that -- see for example the ClinVar entry for RELN. Emw (talk) 03:20, 19 July 2013 (UTC)
Fantastic, thanks for the feedback! A few notes on each of the resources you mentioned:
  • Human Phenotype Ontology: We've actually have quite a few offline discussions with the HPO group, and there is a mutual commitment to work together. I think they would argue that HPO does not deal with diseases per se, but more with the observable clinical symptoms that patients have. So we absolutely were going to link diseases (as indexed by the Disease Ontology) with the related phenotypes (in HPO). Since phenotypes uniquely apply to diseases, I was going to wait on that until after we had our initial gene/protein - disease - drug triangle set up.
  • MedGen: Excellent suggestion! A quick scan of this download file shows that they have almost 6000 links between genes and diseases. My guess is that it largely overlaps OMIM, but even if it only made OMIM available in a more easily-parsable format, that would be a big win. One nice thing is that MedGen also tracks the PMIDs for each link here.
  • ClinVar: This file has 7826 unique links between genes and diseases, nicely parsed via Entrez Gene ID and UMLS CUIs. Again, might largely overlap OMIM, but it's definitely worth adding to the mix! Obviously ClinVar also annotates specific genetic variants. I think annotation of specific variants will be something that we tackle in the future, but for the moment I think we will have our hands full with just the gene-disease links.
Thanks for the great ideas! (And agreed, I'm very confident that getting all these data into Wikidata will lead to many biological killer apps...) Cheers, Andrew Su (talk) 23:08, 19 July 2013 (UTC)
That roadmap sounds really exciting Andrew and Emw. I think it would also be interesting to crosslink and use http://humanmetabolism.org/ as a source. --Tobias1984 (talk) 09:37, 23 July 2013 (UTC)

New ProteinBoxBot Bot flag request[edit]

Hi everyone!, I have submitted a request for bot flag here. The test runs are here. Kindly chime in with your thoughts. Chinmay26 (talk) 18:05, 19 July 2013 (UTC)

A week has passed. Ymblanter is waiting for final comments so the bot can be approved. --Tobias1984 (talk) 15:35, 26 July 2013 (UTC)
If you have experience with bots, please also review Wikidata:Requests_for_permissions/Bot/Chembot. We have a lot of overlapping properties with the chemistry task force. --Tobias1984 (talk) 14:22, 29 July 2013 (UTC)

ProteinBoxBot edits[edit]

Hi, i have run ProteinBoxBot for first 3 proteins under | human proteins. The bot does not handle EC Classification , Gene_atlas image, Alias yet . sample item -- www.wikidata.org/wiki/Q411507. The bot also does not update appropriate qualifiers yet(working on it). I just wanted to run this by the community to confirm/clarify if there are any issues regarding the edits. Chinmay26 (talk) 22:32, 29 July 2013 (UTC)

Thanks Chinmay. I looked over ProteinBoxBot's recent contributions and it seems like things are coming along. A few comments on Cyp21a1 (Q14358793) that generalize to other gene/protein items:
  • The description field reads "Mouse Gene". This should be lowercase -- "mouse gene". Same for "human gene", "human protein", and "mouse protein", since none involve proper nouns.
  • The RefSeq RNA ID (P639) claims should only be used for RefSeq accessions. Per the "Distinguishing Features" section of http://www.ncbi.nlm.nih.gov/refseq/about/, RefSeq accessions all contain underscores -- for example, NM_009995 is a RefSeq accession, but AI323066 is not (it's a GenBank accession).
Below is a template I would suggest we use for these various RefSeq properties. The "valid accession prefixes" constraints are derived from the official table mapping RefSeq accession numbers to molecule types.
Property Valid accession prefixes Molecule type Should be used on Wikidata items that are subclasses of... Example usage in reelin items
RefSeq (P656) NG_, NT_, NC_, AW_, NW_, NS_, NZ_ genomic DNA gene (Q7187) RELN (Q414043) RefSeq (P656) NG_011877.1
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NG_")
RefSeq RNA ID (P639) NM_, NR_, XM_, XR_ RNA gene (Q7187) RELN (Q414043) RefSeq RNA ID (P639) NM_005045.3
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NM_")
RefSeq Protein ID (P637) NP_, AP_, YP_, XP_, ZP_ protein protein (Q8054) reelin (Q13569356) RefSeq Protein ID (P637) NP_005036.2
(per http://www.ncbi.nlm.nih.gov/gene/5649, Ctrl-F for "NP_")
Overall things seem to be making good progress! Emw (talk) 03:48, 30 July 2013 (UTC)
Small point, but could you also add description in other language:
en:human gene
de:menschliches Gen
es:gen humano
fr:gène humain
it:gene umano
zh:人类基因
I would like to add (Sarilho1 (talk) 14:36, 30 July 2013 (UTC)):
pt:gene humano
pt-br:gene humano

Properties[edit]

physically interacts with (P129) is still missing examples, constraints and a description of its scope. We should decide if this proposal (Wikidata:Property_proposal/Term#drug_interaction_.28en.29_.2F_Arzneimittelwechselwirkung_.28de.29) needs to be a separate property or not. --Tobias1984 (talk) 18:23, 31 July 2013 (UTC)

Another proposal needing attention drug action altered by --Tobias1984 (talk) 12:54, 2 August 2013 (UTC)

Discussion about drug-drug interaction qualifiers[edit]

I would like to invite the participants of this project to give their opinion in this discussion: Wikidata_talk:Medicine_task_force#drug-drug_interaction. --Tobias1984 (talk) 19:39, 6 August 2013 (UTC)

reelin and RELN[edit]

Human Gene:

Mouse Gene:

Human

Mouse:

Wouldn't it be good idea to connect reelin (human) and reelin (mouse) to reelin (in general). The sitelinks could go into that general item and statements that are true for both could only go with the parent item. Same could be done for the gene. --Tobias1984 (talk) 11:33, 26 August 2013 (UTC)

Sorry I missed this post. I personally think we should not have items for "in general". Human reelin and mouse reelin are real and tangible things, and I think the abstraction will create more problems than it's worth. For example, what is true for "in general" undoubtedly differs depending on what species you're considering. One claim that is true for human, mouse and rat may not be true for fly, and of course reelin is probably not present in the genomes of many lower organisms. My two cents... Cheers, Andrew Su (talk) 23:04, 10 September 2013 (UTC)

Duplicates ProteinBoxBot[edit]

We probably need to discuss some items that ProteinBoxBot is currently creating. Some items might be duplicates we can merge, others are distinct concepts. Merging needs to be done carefully, in order not to break all the links. The first duplicate I found is: (no label) (Q14330657), aging (Q332154). In my opinion those two could be merged. Any opinions? --Tobias1984 (talk) 17:57, 10 September 2013 (UTC)

Hi Tobias, you raise a very good point. Chinmay has put in quite a few checks to avoid creating duplicates. Of course, no system is perfect (especially with the somewhat incomplete query API at the moment), so letting us know when you run across them is very useful. Speaking on the those two examples in particular... I think the ageing/aging issue would be difficult to detect. The spelling difference prevents us from doing an exact string match, although I think that if it was added as an alias then Chinmay's program would have detected the existing item. (In theory, we may have been able to use the MeSH ID to make the match, and if that becomes a common theme then we can add that feature.) I think the second example for endoplasmic reticulum exists because Q14327640 was created before the redundancy checking was in place... I think that is not a problem in any of the most recent runs. Unless you have any objection, I'll go try to use the merge tool to fix these? Cheers, Andrew Su (talk) 23:00, 10 September 2013 (UTC)
(Edit conflict) I was about to say much the same. I added "aging" (American English) as an alias of the item labeled "ageing" (British English). Gene Ontology terms use American English, but the Wikipedia articles on a good proportion of biomedical subjects use British English titles. Adding the bolded text in article leads as an alias seems like it would solve this problem in one fell swoop, but given the GSOC time constraints I suspect we'll need to do more manual data input. Emw (talk) 23:13, 10 September 2013 (UTC)
@Andrew Su: Will it mess up the bot if we delete them now, or will it use the other item automatically? --Tobias1984 (talk) 07:01, 11 September 2013 (UTC)
Another one? cell-cell signaling (Q14758911) and cell signaling (Q210973) --Tobias1984 (talk) 12:03, 11 September 2013 (UTC)
We should probably always keep the item with the lower number and we also have to change all the incoming links:
Changing the links is now pretty easy with User:BeneBot*/movelinks.js. --Tobias1984 (talk) 15:02, 9 December 2013 (UTC)

ProteinBoxBot progress[edit]

Hi all, in case anyone is interested in how Chinmay's PBB project is going, you can see his recent efforts at https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbsearchentities&format=json&search=entrez%3A&language=en&type=item&limit=50 or https://www.wikidata.org/w/index.php?title=Special:Contributions/ProteinBoxBot&offset=&limit=500&target=ProteinBoxBot. We're in the home stretch on his GSoC project (which is why we haven't been more active in the discussions) so the current priority is to make sure the code that he's written is robust to many different gene examples. Cheers, Andrew Su (talk) 23:08, 10 September 2013 (UTC)

Hormones and biological process (P682)[edit]

Couple of questions. Would it be ok to connect hormones with biological process (P682) to their functions or do we need another property? Example: progesterone (Q26963) = pregnancy (Q11995). Also we need a property for "where the hormone is made in the body". Does anybody know what to call it or is there a property we can recycle? --Tobias1984 (talk) 17:53, 12 September 2013 (UTC)

Endorse funding for wikidata query tool and more?[edit]

There is a proposal for funding to build a wikidata toolkit for developers. If you like, it, please head over there and let them know by adding an endorsement. https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit#Endorsements:

Updates and additions to URL mappings for molecular biology properties[edit]

I've added a request to improve URL mappings for several of the ID properties we use in gene and protein items at MediaWiki_talk:Gadget-AuthorityControl.js#Updates_and_additions_for_molecular_biology_properties. That link also describes an interesting bug with the URL mapping for Ensembl Gene ID (P594) -- it goes to a page expecting a human gene ID even when it's used on claims for mouse genes. Please take a glance over that and note any comments or questions there. Thanks, Emw (talk) 12:51, 15 October 2013 (UTC)

Property proposal: chromosome[edit]

See Wikidata:Property_proposal/Natural_science#chromosome. Emw (talk) 20:56, 14 December 2013 (UTC)

protein binding[edit]

Hi, I have trouble labelling and sorting out this item, which is in the list of most used items without french label : protein binding (Q14633864). Is it a duplicate for peptid bond ? Is it a (class of) bond beetwen proteins ? TomT0m (talk) 22:52, 10 February 2014 (UTC)

The concept protein binding (Q14633864) comes from the Gene Ontology (GO), specifically http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0005515. It is a type of binding, which in turn is a type of molecular function. It isn't a synonym of 'peptide bond' nor a subclass of bond between proteins. More information is available in the GO link; more about GO in general is here: http://www.geneontology.org/GO.doc.shtml. Emw (talk) 00:44, 11 February 2014 (UTC)
I saw that link, I am more confused and curious because protein amino acid binding is cited as synonim, is it a bond beetween too proteins by bonding some of their peptides ? TomT0m (talk) 17:47, 11 February 2014 (UTC)
Let's compare definitions and a few subclasses of relevant GO terms:
Definition: Interacting selectively and non-covalently with peptides, any of a group of organic compounds comprising two or more amino acids linked by peptide bonds.
Subclasses: beta-amyloid binding, oligopeptide binding, peptide hormone binding
Definition: Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules).
Subclasses: apolipoprotein binding, heat shock protein binding, p53 binding
The difference between 'protein binding' and 'peptide binding' is the difference between proteins and peptides: size. Page 85 of Lehninger: Principles of Biochemistry (4th edition) says "molecules referred to as polypeptides generally have molecular weights below 10,000 and those called proteins have higher molecular weights." And that's the case with the children of 'peptide binding' and 'protein binding'. Beta amyloid has 36-43 amino acids (for comparison, cytochrome C has 104 residues and a weight of 13,000 residues), and so is a peptide. Apolipoprotein E, an apolipoprotein, has 317 amino acids, and so is a protein. Beta amyloid would be an object of peptide binding, and apolipoprotein E would be an object of protein binding.
So both 'protein binding' and 'peptide binding' would involve binding amino acids, just amino acids in a bigger or smaller molecule. Importantly, the definitions for both terms explicitly note that they are non-covalent. A peptide bond is a type of covalent bond, so is would be quite incorrect to synonymize "peptide binding" with "peptide bond". Another important thing to consider is that, per GO, 'binding' is a type of activity, not a bond; 'binding' is a process, not an object. Emw (talk) 02:55, 12 February 2014 (UTC)
Thanks, my high school courses are far away, and in non english :) So to sum up, protein and peptide are made of bonds beeteen anime acids, whereas petide and proteins bindings forms respectively peptide and proteins complex (that may include other kinds of molecules). TomT0m (talk) 11:55, 12 February 2014 (UTC)
Yup, pretty much. Emw (talk) 12:16, 12 February 2014 (UTC)

Wikidata Infobox on Czech Wikipedia[edit]

Czech Wikipedia is currently interested in adding protein data to their articles. This would be our chance to test our data in the field, by building a Lua-Infobox that uses Wikidata-data. By adding the infobox one at a time we can slowly work out problems, add sources and gather experience for further deployments. @Hypothalamus: would be our go-to person. Hypothalamus could choose an example page (ideally a page that isn't visited to much, so mistakes can be fixed without anybody noticing). We also still have to find somebody with some Lua-infobox experience. --Tobias1984 (talk) 11:26, 25 March 2014 (UTC)

Yes, the original suggestion was targeted to User:Andrew Su here. I suggest we experiment with infoboxes on a short article on Hepcidin. Is there anything I can do at this point? Do you want me to ask at Czech wiki's Village pump if there is anyone with Lua programming skills willing to help? Hypothalamus (talk) 19:35, 7 May 2014 (UTC)

Andrew Su
Genewiki123
Marc Robinson-Rechavi
Pierre Lindenbaum
Michael Kuhn
Boghog
Emw
Chandres
Dan Bolser
Dan Lawson
Kizar
Pradyumna
Chinmay
Timo Willemsen
Salvatore Loguercio
Tobias1984
Daniel Mietchen
Optimale
Mcnabber091
Ben Moore
Klortho
Hypothalamus
Vojtěch Dostál
Gtsulab
Notified participants to Wikiproject Molecular biology Tobias1984 (talk) 21:03, 7 May 2014 (UTC)

@Hypothalamus: - If you could find a lua-programmer on Czech wiki that would be great. There are some working examples of infoboxes that would just need adjustment e.g. taxobox on hebrew-wiki [7] and the module [8]. This bug still puts some limitations on pulling data from other items than the item that is connected to the wikipedia article [9]. But I think that the most important information can be pulled from the item itself. Tobias1984 (talk) 11:34, 8 May 2014 (UTC)
I'd be happy to help. Can read Czech and Lua fine but am lousy at producing them properly. --Daniel Mietchen (talk) 12:05, 9 May 2014 (UTC)
@Daniel Mietchen, Hypothalamus: - I think another example is at https://fr.wikipedia.org/wiki/Module:Infobox and https://en.wikipedia.org/wiki/Wikipedia:Lua - I still have to read up on how this all works. Especially the modules that are required by Wikipedia still confuse me. Tobias1984 (talk) 18:48, 9 May 2014 (UTC)
@Daniel Mietchen, Hypothalamus: - Another example: fr:Module:Infobox/Composé_chimique used on fr:Undéc-1-ène - Daniel if you have time it would be great if you could help out Hypothalamus and czech-wiki. I still have trouble reading the code and don't really understand what goes where. Tobias1984 (talk) 13:20, 10 May 2014 (UTC)
This is a great initiative and I would be happy to help. I am a native Czech, molecular biologist, but cannot write in Lua. --Vojtěch Dostál (talk) 21:58, 13 May 2014 (UTC)
@Vojtěch Dostál: Welcome to the WikiProject. Help is of course always appreciated and needed at every corner of Wikipedia and Wikidata :) - I see that you didn't make any edits yet to Wikidata. You could start by just looking at some items of your favorite articles and see what kind of information you can add to them. It is good to understand the data structure, so when the infoboxes are put onto czech-wiki you will already know how to fix mistakes. Ping me if you need any help! -Tobias1984 (talk) 22:31, 13 May 2014 (UTC)
@Tobias1984: I am user Vojtech.dostal but have just renamed my account and did not merge all of them :-). Thank you for your kind offer - as long as I do not dig into the programming - I think I'll be all right. --Vojtěch Dostál (talk) 22:39, 13 May 2014 (UTC)

I can't work out how to use the Wikidata:WikiProject Molecular biology page[edit]

Sorry if it's just me, but I wanted to add my new proposed property (genome size) there but couldn't work out how to do it. --Dan Bolser (talk) 12:45, 15 April 2014 (UTC)

@Dan Bolser: Do you have trouble with the markup language? The template of the property list together with the table syntax can be confusing at times. I can help you if you describe your problem. Tobias1984 (talk) 15:26, 15 April 2014 (UTC)
@User:Tobias1984: Syntax is fine, I just don't know what properties to use with what templates. --Dan Bolser (talk) 11:33, 20 May 2014 (UTC)