Property talk:P591

From Wikidata
Jump to: navigation, search

Documentation

EC number
classification scheme for enzymes
Description Enzyme Commission number (EC number) is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. As a system of enzyme nomenclature, every EC number is associated with a recommended name for the respective enzyme.
Represents Enzyme Commission number (Q741108)
Data type String
Template parameter Template:Infobox_enzyme = EC_number
Domain item (note: this should be moved to the property statements)
Allowed values \d\.\d{1,2}\.\d{1,2}\.\d{1,3}
Usage notes ≠ P232
Example Triacylglycerol lipase (Q7839871)2.7.3.2
alcohol dehydrogenase (Q410754)1.1.1.1
creatine kinase (Q409458)2.7.3.2
Source http://www.chem.qmul.ac.uk/iubmb/enzyme/
Formatter URL http://enzyme.expasy.org/EC/$1
Robot and gadget jobs collect data from infoboxes and online databases
Tracking: usage Category:Pages using Wikidata property P591 (Q26249999)
Lists
Proposal discussion Property proposal/Archive/8#P591
Current uses 11,156
[create] Create a translatable help page (preferably in English) for this property to be included here
Format “\d\.(\d{1,2}|-{1})\.(\d{1,2}|-{1})\.(\d{1,3}|-{1})”: value must be formatted using this pattern (PCRE syntax).
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P591#Format
Distinct values: this property likely contains a value that is different from all other items.
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P591#distinct values, SPARQL (every item), SPARQL (by value)
Single value: this property generally contains a single value.
Exceptions are possible as rare values may exist.
List of this constraint violations: Database reports/Constraint violations/P591#single value, SPARQL


Distinguishing enzymes and gene products[edit]

The GNF Protein box template has a field for EC number and this property captures that information as a string. However, I think it might be better to use an 'item' datatype. This is because EC numbers identify a distinct kind of object -- an enzyme -- that has specific properties, namely what type of enzyme it is a subclass of, and what reaction the enzyme catalyzes. For example, pappalysin-1 (Q13107614) -- also known as EC 3.4.24.79 -- is a subclass of (P279) metalloendopeptidase (Q6822865) (EC 3.4.24), and thus a subclass of peptidase (Q212410) (EC 3.4), hydrolase (Q96286) (EC 3) and enzyme (Q8047). The reactions catalyzed by these higher-level enzymes are increasingly general, and thus could be thought to be inherited (and extended) in the derived, more specific classes of enzyme.

Enzymes are not the same as gene products. Some gene products are not enzymes and thus lack an EC number, for example no label (Q7671985), a type of tubulin protein. Some gene products can have multiple enzymatic activities and thus multiple EC numbers, like serine racemase with EC numbers 4.3.1.17, 4.3.1.18 and 5.1.1.18. Also, while all enzymes are gene products, gene products can be other proteins or RNA. For example, RNase P is an enzyme consisting entirely of RNA, and has EC number 3.1.26.5.

If EC number (P591) pointed to a Wikidata item rather than a string, then it might make sense to make the item's label the EC number rather than its EC "accepted name", and make the EC name an alias of the item. Or maybe not. Whichever label EC numbers' items have, I think changing this property's datatype to 'item' would be a good design decision. At the time of this writing, this property is only used on three items, so replacing its usage would be trivial.

What are others' thoughts on changing the datatype of this property to 'item'? Emw (talk) 00:24, 27 June 2013 (UTC)

Hmm, another sticky data modeling question. I actually think we might want to keep this property as a string, and then create a new property that is called something like "EC class" with datatype item. In that system, pappalysin-1 (Q13107614) (the human protein) would have "EC class" that points to a new item for "pappalysin-1" that represents the enzyme class as a concept. (The mouse protein for pappalysin-1, also yet to be created, would also have a property for "EC class" that points to the new pappalysin-1 item.) The pappalysin-1 item (enzyme class) would have a property EC number (P591) with the string "3.4.24.79", and also be a subclass of (P279) metalloendopeptidase (Q6822865). That item metalloendopeptidase (Q6822865) in turn would also have EC number (P591) be "3.4.24", and so on.
I think the only fundamental philosophical difference between the two proposals is that I think your statement "pappalysin-1 (Q13107614) -- also known as EC 3.4.24.79" is imprecise. I think pappalysin-1 (Q13107614) is one instance of a biological concept that has an identifier "3.4.24.79", but that other items (like the mouse/rat/etc. proteins) will also be an instance of that concept. I think this distinction is even more important when the EC annotations aren't so specific. Take for example RELN (Q414043). According to UniProt, the EC number is 3.4.21.-, or "Hydrolase". So I think we should add on RELN (Q414043) a statement that "EC class" is hydrolase (Q96286), and then on hydrolase (Q96286) add a statement that EC number (P591) is "3.4.21.-". My two cents... Hope that was clear... Cheers, Andrew Su (talk) 01:15, 27 June 2013 (UTC)
Interestingly, according to the main contributor to Pappalysin-1 -- the English Wikipedia article for pappalysin-1 (Q13107614) -- that item would represent the enzyme class, and Pappalysin 1 (Q1476411) would represent the protein with that enzyme class (see here and here for more details). The Wikipedia article for Pappalysin 1 (Q1476411) notes the protein has an alias "pappalysin 1". This hints at a problem with using the EC accepted name as the main identifier for enzyme objects: in many cases, an enzyme's EC name effectively overlaps the name of the gene products associated with that enzyme. In other words, the EC class label would be "pappalysin-1" and an alias for the gene product is "pappalysin 1"; these virtually identical identifiers are certain to cause confusion.
Surely there can be one-to-many relationships between an EC number and gene products (either orthologs of the same gene product or similar gene products within the same organism). But I don't see how using the EC class as the item's label would be more precise, since there's a one-to-one mapping between EC class / accepted name and EC number. (The BRENDA interface is somewhat confusing: 'hydrolase' is EC 3, 'peptide hydrolase' is EC 3.4, and 'serine hydrolase' is EC 3.4.21. Each EC name should note its EC number there.)
Whether the EC name or EC number is the item's label, it seems we both agree that: A) the enzyme item should be separate from the gene product item, and B) the EC property used on gene product items should have datatype 'item'. Emw (talk) 04:14, 27 June 2013 (UTC)
Thanks for the clarification. Yes, I misread the that pappalysin-1 (Q13107614) refers to the enzyme class, not a specific protein. It somewhat annoys me that EC gets down to that level or granularity when it's pretty one enzyme (per species) per EC because it leads exactly to this type of confusion (*). But anyway, neither here nor there. So then in that light, I think we are almost completely on the same page. I agree on your assessment of what we agree on. In terms of where to put the EC number on enzyme class pages, I prefer having a string-based "EC number" property (in addition to or instead of using the EC number in the label) simply because I think that piece of knowledge should be typed with a property (and the label is semantically meaningless). Make sense?
(*) And as an aside, my only quibble with the wording on Pappalysin-1 then is that I think the opening line should be "Pappalysin-1 (...) is a class of enzymes." to reduce confusion... In case Dcirovic happens to be here... Cheers, Andrew Su (talk) 04:40, 27 June 2013 (UTC)
I largely agree with everything said above, but I just wanted to emphasize that strictly speaking, an EC number denotes a reaction and not a protein. There are at least "185 EC nodes with two or more experimentally characterized - or predicted - structurally unrelated proteins" (see PMID:20433725). That is, convergent evolution has produced enzymes with completely unrelated folds that catalyze identical reactions. Boghog (talk) 18:25, 27 June 2013 (UTC)
100% agree, good thing for us to keep in mind. And it does underscore why I think EC goes too deep. "hydrolase", "protease", "metalloendopeptidase" all make sense to me as enzyme classes because they do reflect in my mind a specific enzymatic reaction. But "pappalysin-1"? That to me is a specific protein (which happens to be an enzyme), but not a unique reaction worthy of an EC number. </rant>... Cheers, Andrew Su (talk) 18:54, 27 June 2013 (UTC)
Having an EC name like "pappalysin-1" instead of an EC number like "3.4.24.79" as the label of enzyme items seems reasonable to me. Perhaps we can also have the EC number as an alias, since Wikidata's search feature returns items by either label or alias, and users could foreseeably search by EC number. While labeling enzyme items with the EC name might cause confusion in claims for proteins involving granular EC classifications, it allows us to give higher-level enzyme items like hydrolase (Q96286) human-friendly labels like "hydrolase" (the EC accepted name) instead of labels like (EC 3) (the EC number).
So is this the claim structure we're converging on?:
subclass of (P279) protein (Q8054)
EC name pappalysin-1 (Q13107614)
...
subclass of (P279) metalloendopeptidase (Q6822865)
EC number (P591) 3.4.24.79
enzymatic reaction: "Cleavage of the Met135Lys bond in insulin-like growth factor binding protein (IGFBP)-4, and the Ser143Lys bond in IGFBP-5" (Source: http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/79.html. This property would be nice to have, but doesn't seem essential for a first-pass implementation.)
...
I think the important thing is for the property that links gene products to enzymes to have an item datatype rather than a string datatype. Emw (talk) 00:42, 28 June 2013 (UTC)
Nice! I think this would be fantastic! The only suggestion I've have would be to use EC class instead of EC name. Anything "name" to me implies a string data type. But I'm not passionate on this point... Cheers, Andrew Su (talk) 01:10, 28 June 2013 (UTC)
'EC class' works! Think this is ready for a property proposal? Emw (talk) 01:24, 28 June 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── 'EC class' may not be an ideal label since it may cause confusion. According to EC conventions, a class (or main division) refers to the first digit, a sub-class to the second set of digits, a sub-sub-class, the third digit set, and finally the fourth digit set, the serial number of individual enzymes. The tag 'EC accepted name' would be more in line with EC conventions. It also might be more accurate to rename 'EC number' to 'EC code number'. The former name is widely used, but the later may help distinguish it from European Commission numbers. Boghog (talk) 06:58, 28 June 2013 (UTC)

What about "EC classification"? Both "EC name" and "EC accepted name" strike me as odd because we're not pointing to a name (string), we're pointing to a concept (item) (which happens to have a name). Anyway, I suspect I'm just being pedantic here. I'll be fine with anything. Emw, since you raised the suggestion, perhaps you can put in the new property proposal however you see fit (perhaps after giving Boghog a chance to chime in again if he likes)? Cheers, Andrew Su (talk) 07:26, 28 June 2013 (UTC)
I agree that "EC classification" would be better. Boghog (talk) 15:05, 29 June 2013 (UTC)
Agreed, see Wikidata:Property_proposal/Term#EC_classification. Emw (talk) 17:42, 29 June 2013 (UTC)