Auxiliary matching for the COSING Cosmetic Dataset[edit]

Hi Alchemists,

After getting a property for COSING created, I've imported the COSING IDs for matching in Mix N'Match. COSING is a huge cosmetic dataset used EU wide (and beyond). It has a lot of interesting information about stuff we put daily on our bodies, and that is essential to parse cosmetic ingredient lists for Open Beauty Facts.

Some automatic matches have been made by M&M, but it would increase reliability of matches if we could use the following columns as sanity checks, and perhaps to automate matching.

INCI name INN name Ph. Eur. Name CAS No EC No

They are available as open data on the EU Open Data Portal ( , and in the target page on the EU Commission website (

A step further would be to import the addition columns as statements (eg: HUMECTANT, MASKING, SKIN CONDITIONING,SOLVENT properties, SCCS opinions…)

@Teolemon: Optimal importation process is the following:
1) extract WD data in the format Q number/CAS number/EC number in a table A
2) extract EU data in the format INCI name/INN name/Ph. Eur. Name/CAS number/EC number in table B
3) match lines in tables A and B having both same CAS number/EC number and create a new table C with the format Q number/CAS number/EC number/INCI name/INN name/Ph. Eur. Name/function (eg: HUMECTANT, MASKING, SKIN CONDITIONING,SOLVENT properties, SCCS opinions…)
4) match lines in tables A and B having CAS number or EC number but not both and create a new table D with the format Q number/CAS number/EC number/INCI name/INN name/Ph. Eur. Name/function (eg: HUMECTANT, MASKING, SKIN CONDITIONING,SOLVENT properties, SCCS opinions…)
5) Table D has to be checked in order to verify if the WD items are correctly defined and missing or wrong data have to be added/corrected. Once completed/corrected lines of Table D can be added to Table C.
6) Table C can be imported after bot request
7) Table C has to saved somewhere and be used as reference for periodic data check in WD to identify vandalism or wrong data handling. Snipre (talk) 23:58, 2 January 2018 (UTC)
Thanks a lot. Daunting task, but at least the plan is clear :) Teolemon (talk) 12:52, 23 January 2018 (UTC)
The Cosing databases currently comprise 25937 Ingredients (with the name INCI) and substances (with chemical name) components of perfumes or subject to restrictions in the annexes of the European regulation. The INCI names are decided and defined by the INC and the Cosing reports only those recognized in the EU, as the IECIC database reports those recognized in China. The Cosing database is periodically updated. It is of one year ago the addition over 10000 new ingredients. The Cosing number is not clearly identifying, when in wikidata the INCI name is missing as identifier. I proposed to include the INCI name as property.--Rodolfo Baraldini (talk) 12:37, 3 March 2018 (UTC)

Plural or singular labels[edit]

Labels of items like carboxylic acid (Q134856), amine (Q167198) and others that describe groups, classes, families etc. of compounds should be plural or singular? Firstly, I used plural labels in Polish labels, then I was convinced by someone that it would be better to use singular. For some time, hovewer I use pluaral again ;) for Polish labels, because it makes more sense (singular does not in my language; writing in Polish that 'amine' is a 'group of...' is quite funny).

Should there be some guidelines in this case or maybe it's language-specific matter? Now, most of the English labels are singular (like in articles), but eg. many Russian labels are plural. I do not know English well enough to say that it's better to write 'amine is a class/group/family' (BTW the use of these three names could also be regulated somehow) or 'amines are a class...'. Wostr (talk) 22:56, 1 January 2018 (UTC)

@Wostr: Help:Label#Labels_in_English Snipre (talk) 19:52, 3 January 2018 (UTC)
@Snipre:, okay... but I can't find the answer on that page (or maybe you want to indicate that every language has to establish its own guidelines? — if so, it's still not clear to me whether to use plural or singular in English?). Wostr (talk) 20:03, 3 January 2018 (UTC)
@Wostr: Sorry, I didn't read the page to find the info. But this shiould be solved in that page. For ontology building rules, label have to be singular if the singular form can be used (some concepts can be always plural). So as we can use the concept amine in the singular form, the singular form should be used. Better open a new section there and if no opposition are raised then we can add this new rule in the rules list. Snipre (talk) 20:36, 3 January 2018 (UTC)
  • note: ChEBI uses plural for compound classes (like in [1]). Wostr (talk) 21:43, 4 January 2018 (UTC)
  • I think the wikidata (English) label policy comes from English wikipedia: en:Wikipedia:Naming conventions (plurals). That said we should probably have our own independent statement of it. ArthurPSmith (talk) 21:55, 4 January 2018 (UTC)
    • Otherwise, there will always be 1=2 (or 2=1).WP is atavism and rudiment, living by its own rules (only human-readable (Q16716513)), which will lead to its collapse (few people need the right thing, because there is a better alternative). --Fractaler (talk) 18:15, 6 January 2018 (UTC)
    • There are two main types of exceptions to this rule: Articles on groups or classes of specific things – am I the only person who thinks that articles about compound classes and groups are a violation of this? ;) But, in fact, there are some more specific guidelines – en:Wikipedia:Naming_conventions_(chemistry)#Groups_of_compounds from 2008 (hovewer I couldn't find any related discussion in WikiProjects Chemistry/Chemicals). Wostr (talk) 20:20, 6 January 2018 (UTC)
  • note 2: Glossary of Class Names of Organic Compounds and Reactive Intermediates Based on Structure (IUPAC Recommendations 1995) uses plural names (probably to make clearer distinction between classes and compound names, e.g. pyridine vs pyridines). Wostr (talk) 19:45, 6 January 2018 (UTC)
  • note 3: it's seems that other wikis like or uses plural names (en:Wikipedia_talk:WikiProject_Chemistry/Archive_26#Plural_names_for_classes_of_compounds), but this should be verified. In there was short discussion, but it seems quite obvious that class names in Polish should be written in plural. Wostr (talk) 20:20, 6 January 2018 (UTC)
  • note 4: Glossary of Class Names of Polymers Based on Chemical Structure and Molecular Architecture (IUPAC Recommendations 2009) uses singular class names for polymers to enable an individual polymer within a class to be referred to by using the indefinite article, “a”. For example, poly(3-octylthiophene) is a polythiophene and polythiophene itself is also a polythiophene. That may be related to the use of instance of (P31). Wostr (talk) 20:20, 6 January 2018 (UTC)
We create a knowledge base, a common terminological coordinate system. With its help, we can say now only, for example: dicarboxylic acid (Q422050) (dicarboxylic acid) is carboxylic acid (Q134856), tricarboxylic acid (Q2823314) (tricarboxylic acid) is carboxylic acid (Q134856) (carboxylic acid). And can not say that, for example, "dicarboxylic acid (Q422050) and tricarboxylic acid (Q2823314)" is carboxylic acid (Q134856). Because "group of carboxylic acids"/"carboxylic acids" we have not. --Fractaler (talk) 13:02, 8 January 2018 (UTC)
We do not need 'carboxylic acid' and 'carboxylic acids' items. One of them is redundant as both items describe the same thing – structural class of chemical compounds. I think there is only linguistic problem with proper grammatical number and with is a (as a synonym of instace of); in this case the use of instance of (P31) is IMHO incorrect, as the correct ones are: tricarboxylic acids < subclass of > carboxylic acids and dicarboxylic acids < subclass of > carboxylic acids. Wostr (talk) 14:24, 8 January 2018 (UTC)
Ok, what is, for example, "dicarboxylic acid (Q422050) and tricarboxylic acid (Q2823314)" by your version? --Fractaler (talk) 10:05, 9 January 2018 (UTC)
Classes of chemical compounds, of course. Wostr (talk) 14:22, 9 January 2018 (UTC)
"Dicarboxylic and tricarboxylic acids" is "classes" or "class"? "dicarboxylic and tetracarboxylic acids" is "classes" or "class"? "dicarboxylic, tetracarboxylic and heptacarboxylic acids" is "classes" or "class"? Groups or group? --Fractaler (talk) 14:43, 9 January 2018 (UTC)
There is nothing like Dicarboxylic and tricarboxylic acids right now, so discussion about this makes no sense. Dicarboxylic acids is a class and a subclass of Carboxylic acids class. Same with Tricarboxylic acids and so on. Dicarboxylic acid = Dicarboxylic acids as a chemical class (and speaking 1≠2 in that case would be rhetorical figure only, as there is a linguistic problem which version to use, not substantive or methodological problem and both dicarboxylic acid and dicarboxylic acids are describing the same thing: class of chemical compounds). Wostr (talk) 19:18, 9 January 2018 (UTC)
There is nothing like Dicarboxylic and tricarboxylic acids right now where? --Fractaler (talk) 20:12, 9 January 2018 (UTC)
As a WD item. (And as a separate and well-established chemical class). Wostr (talk) 20:35, 9 January 2018 (UTC)
For a long time in ru-Wikipedia I asked the WP-editors: for whom do you create Wikipedia? They did not even want to think about it. And now there was a powerful competitor. Do you think that we are doing WD-items for Wikidata (Q2013)? We make knowledge base (Q593744), a common hierarchy of terms, common frame of reference (Q184876). So, it is not only in the Wikidata, but also outside it (in the Internet conversation, articles, etc.). Example of use: 1) "Alice gave Bob the malic acid (malic acid (Q190143))". So Alice gave Bob (one) dicarboxylic acid" (dicarboxylic acid (Q422050), 1 acid). 2) "Alice gave Bob citric acid and succinic (succinic acid (Q213050)) acids". So Alice gave Bob carboxylic acids " (several acids, >1). Can you duplicate this message (when 2≠1, 1≠2) so that the meaning does not change if you have carboxylic acid = carboxylic acids (2=1, 1=2)? --Fractaler (talk) 09:30, 10 January 2018 (UTC)
I think that your problem is not of any connection with the title problem and I really don't understand what you're trying to achieve. Wostr (talk) 13:28, 10 January 2018 (UTC)
The task is very simple. You are trying to convey your thoughts to your interlocutor (he is a foreigner and does not understand your language). But you have a universal means, a single system of terms - Wikidata. Try to apply it. For example, replacing the words in the sentences (above) with the Wikidata items so that the foreigner correctly understands you. --Fractaler (talk) 14:28, 10 January 2018 (UTC)
English is not my first language and I also have problems in using it, but I'm trying to use it as best as I can. Hovewer, I won't replace English or any other language with some pseudo-langauge based on P's and Q's, sorry. It seems to me that I would understand Russian better (as my first language is Polish and for a short time I studied Russian) than this kind of discussion. As of the previous comment: I don't think that we will ever have duplicated items like 'carboxylic acid' and 'carboxylic acids' (both describing chemical class) just for the sake of some linguistic problems that are not part of chemical classification. That would be a problem for Wiktionary and maybe in the future Wiktionaries will be integrated into/with Wikidata – that would allow to automatically choose required grammatical number. What this topic is about is to decide (A) whether there should be some consistency in naming chemical classes between languages and (B) if so, should the names be plural or singular. Wostr (talk) 15:28, 10 January 2018 (UTC)
WD-language (only Q*, because P* can be replaced by Q*) is not pseudo-language. It is a common terminological reference system (ie, not only a relative, but an absolute path to the term), the next stage (after the mathematical language, semantic network (Q1045785) and etc.) of the evolution of the universal language. Mathematics to anyone who understands it, can explain why 2≠1, 1≠2. WD can show anyone that "carboxylic acid" is not "carboxylic acids", is not synonymous, is not " linguistic problem", because it is object of group (Q36809769) and group (Q16887380). object of group (Q36809769) is group (Q16887380)? And if so convenient, we can continue here). Fractaler (talk) 08:34, 11 January 2018 (UTC)
The advantage of using singular form is the possibility of automatic text creation using the labels (#label instance is a #label class. Then as WD will be linked to the Wiktionary with creation of new properties for the plural and the female, then the label should be the masculine singular form. Snipre (talk) 12:36, 11 January 2018 (UTC)
I don't think I have any choice in using grammatical gender... And still there is no agreement that we should use instance of (P31) to chemical compounds at all. Wostr (talk) 15:31, 11 January 2018 (UTC)
Do we apply the grammar of the language in the grammar of the Wikidata? On what grounds? singular (Q110786)/plural (Q146786) is grammatical number (Q104083) (grammatical category of nouns, pronouns, and adjective and verb agreement that expresses count distinctions (such as "one", "two", or "three or more")). Does anyone else think that between 2 (3, 4, etc.) and 1 there is no difference? That between singular (Q110786)/plural (Q146786) there is no difference? If there are still doubts, then apply the grammatical categories in the sentences (Wikidata has sentence (non-functional linguistics) (Q41796)?). Yes, singular (Q110786)/plural (Q146786) refer to the sentence, to the full, absolute path (full object name (Q38667285), breadcrumb (Q846205))), and not to the Wikidata's "item/superitem" or "subitem/item", short object name (Q38667440). Just use this cool tool to check for a number. --Fractaler (talk) 14:05, 11 January 2018 (UTC)
@Snipre, Fractaler: Okay, let's reverse this problem. In WD we will have chemical classification of compounds based on structure. I am quite sure of this, as it is the basic classification in chemistry. So, we will have items that are equivalent to chemical classes/groups/or whatever you like to call them. And let's take na example here: there is a WD class that describes every compound having pyridine ring in its structure, so it's something similar to commons:Category:Pyridines. This specific item should be named (in English) pyridine or pyridines? (in Russian) пиридин or пиридины? This item will certainly include pyridine (Q210385) (but at this moment we cannot be sure of the exact relation instance of (P31) or subclass of (P279)). What I can say from Polish language perspective is that having it (in Polish) in singular does not make sense because a group/class of objects have to be plural and with singular labels I would have to come up with some weird descriptions matching singular labels (like kind/type of compound which maybe sounds okay in English but not so well in Polish). Wostr (talk) 15:31, 11 January 2018 (UTC)
@Wostr: Who said that classification based on substructures was a good idea ? When you say "we will have chemical classification of compounds based on structure", I think you already conclude the discussion before it starts. Just go deeper in your example: if a chemical compound has a dozen of functional groups, with your classification we will have a dozen of instance. Is it the correct way to do thing ? And then take a big protein with hundreds of functional groups and other substructures. Do you really think that using a classification can help for complex molecules ? For me we shouldn't use functional groups as classification criteria but as we are doing with elements, we should use a new property like "functional group" or "substructure".
Then for your example of pyridine, just write the relation between the compound pyridine (compound) and the pyridine (class). Can't you write in Polish pyridine (compound) is a pyridine (class) ? So if you can write that sentence with pyridine (class) in the singular form, why do you have so many problem to write it in the label ?
If you still have a problem, then use another simpler concept: a dog is an animal, you can easily write animal in the singular form, and I think this is the case in Polish too. So can you still say after that "singular does not make sense"? And this way I want to have at the lowest level of the classification th euse of "instance of", because with that rule you can always create sentences like A is a B, A is a C,... This allows a easy check of the classifcation links. I can always say carbon dioxyde is a gas, is a chemical compound, is a chemical substance,... if I define carbon dioxyde as instance of something. If I define carbon dioxyde as subclass, how can I test the relations: carbon dioxyde is a suclass of chemical compound, carbon dioxyde is a subclass of chemical substance ? This is not so clear as instance of. Snipre (talk) 22:48, 15 January 2018 (UTC)
@Snipre: chemical classification based on structure is used in every chemical database that I know of, so using it seems obvious to me – in other words: not using it would be a great loss. And about proteins: chemical classification based on structure is used of something defined like 'small compounds'; for proteins etc. other classifications are used. Using has part (P527) or other property is of no use here, would be incorrect – acridine (Q342713) do not has part (P527) pyridine (Q210385); it has part (P527) with pyridine ring without some hydrogen atoms etc.
I can write 'pyridine (compound) is as pyridine (class)', but: (1) description won't match label in singular – pyridine (class) is a class of objects, so it seems natural that should be plural and description should have 'class of compounds etc.', (2) is a is a synonym (one from many) for instance of and it's much better IMHO to write 'pyridine (compound) is an instance of pyridines (class)' than 'pyridine (class), (3) what you wrote is correct only for compounds, not for compounds classes: 'dibenzopyridine (class) is a subclass of pyridine (class)'? (quite nonsensical to me). Wostr (talk) 23:02, 15 January 2018 (UTC)
@Wostr: Please can you spot where I wrote we won't used structures as descriptors for chemicals ? I never we shouldn't used structure, I just say that structure should be described using other properties than instance of/subclass of. What is the big advantage of that ? We won't mix sturcture classification with other classification based on use or reglementation. Just remember that ethanol is not only an instance of chemicla compound or alcohol, but a solvent, a drug, a fuel, ... Your structure classification based on instance/subclass will be mixed with dozen of other classifications.
"chemical classification based on structure is used in every chemical database" What is the interest to do what is already done in other databases ? Will WD just be a mirror of ChEBI ? You really show a poor inovative spirit if your argument is mainly "the others are doing like that so just do it like that". Following that spirit, WP and WD shouldn't never be created as referenece encyclopedias already exist in the past. And finally you always have the same lack of ambitions when you say taht proteins should be classified using in a diffferent way than small molecules. Why do we have to do that ? Shouldn't we be a little more open ? Trying new things ? If you really want to copy ChEBI so better extract directly ChEBI clssification in your wiki and avoid all the import and maintenance work related to keep the ChEBI classification up-to-date in WD. This is non sense to copy CheBI in WD and then WD in WP:pl if you can directly do the import from ChEBI to WP:pl.
And for the label problem, you problem is the way to formulate the label: class of objects. Why can't you use as description pyridine (class) = any compound with a pyridine structure or a compound with a specific substructure having 6 carbons ... ? No need of plural for that. Snipre (talk) 22:45, 16 January 2018 (UTC)
@Snipre: starting from the last mentioned thing: yes, I can. But version with 'any compound...' seems like someone tried to adjust this description to poorly chosen label, but plural corresponds with e.g. IUPAC definitions, also ChEBI has no problem with using plural and is a, second also ;) this distinction (singular – compounds, plural – classes) in natural way helps in choosing the right item.
Hmm, and what is the problem with mixing many classifications in subclass of (P279)/instance of (P31)? In most cases of chemical compounds there should be max a few classes, not dozens. Do I understand correctly that your point is to add P31:'chemical compound' everywhere and use some other property like 'chemical class' to add structural classification? If so, could you propose such property – it would be much easier for me to have some formal basis for adding classes (right now I use class of chemical compounds (Q47154513) and subclass of (P279)/instance of (P31), but I'm almost sure that I will have to modify this in the future – hovewer, thanks to class of chemical compounds (Q47154513) it will be possible to easily obtain all items that are in fact classes not compounds, and modify them).
And about my ambitions and innovative spirit (I assume that this is not argumentum ad personam): (1) the fundamental principle in Wikipedia is no original research, thus I'm not here for inventing anything, only for... repurposing what has been already invented to something what will be best suited to the WD needs. I have no illusions about that the few wikidatians would be able to invent unified and correct chemical classification for all chemical compounds including macromolecules – something that has not been created so far, even by rich chemical corporations and scientific institutions, and what has been described as impossible by many authors (because of too big differences between different forms of what chemists call 'chemical compound').
In my opinion, le mieux est l'ennemi du bien – we do not have any chance for the best, we don't have good. What we could have is something I would call reasonably good – and that means existent classifications like in ChEBI. What we have now in WD is nothing and chasing the best will leave us with nothing. Wostr (talk) 23:19, 16 January 2018 (UTC)
@Wostr: Sorry for the delay of my answer.
* In most cases of chemical compounds there should be max a few classes, not dozens. Wrong, if you have a chemical with 15 different functional groups, then using a classification based on structure you will have 15 instance of. Again you assume that the classification will be used only on small molecules which is not the case. Just think again that we have proteins and other big natural compounds so we have to consider them and not just saying that is another kind of classification: we need something which can cover all items under chemical compounds. So if you don't want to have protein in chemical compounds subclasses, please provide the classification tree integrating chemical compound and protein with the definition allowing to differentiate protein from chemical compounds. We shouldn't create different classifications but one classification.
And I won't create any property if we didn't agree about the need of it. I don't want to start a process if nobody is convinced again its utility. The goal of the discussion is mainly to avoid any useless action by defining a priory what is necessary.
My point is first to start with chemical compound everywhere and then to start the creation of a classification which is based on not on the usual functional group structure. For me a hydroxyl group on a big molecule doesn't allow to say that molecule is an alcohol.
Why do you need to add instance of class of chemical compounds (Q47154513) ? You can retrieve all these items by a query looking for all subclass of chemical compounds including subclass of subclass. See
SELECT ?compound WHERE {
  ?compound (wdt:P279/wdt:P279) wd:Q11173
Try it!
We are using a database so instead of creating unnecessary structure, use database properties and in that case we case use queries instead of useless classification.
So if you don't want to create something new, why do you want to import in WD something which is available and maintained out of WD ? ChEBI is one possible classification and not THE classification. If you want to follow the WP rules correctly so please do it completely: according to NPOV (neutral point of view) we can't apply an unique point of view from an unique reference like ChEBI. And finally I am wondering if WP:pl follow your rule of no original research when creating categories in WP. And as classification in WD is similar to category in WP, I think we can apply the same rules. Your argument is very poor because you criticize the unclear definition of chemical compound and you are using a less clear concept of class of chemical compounds (Q47154513). Are you really coherent ? I don't think so. Snipre (talk) 14:48, 22 January 2018 (UTC)
PS: I have nothing against you personaly I just try to find something which can convince me about your proposition.
@Snipre: that is simply not true — e.g. in monoethanolamine (Q410387) there shouldn't be two separate classes (alcohols and amines), but only one (hydroxyamines). The whole concept of chemical classification based on structure is to create more specified subclasses if there are enough compounds sharing specific structure.
From your SPARQL I get a bunch of unrelated items being minerals etc., but there are in fact many classifications of chemical compounds that should be separated, e.g. class of chemical compounds (Q47154513) is about structural classification (maybe there should be 'structural' in the label), there is also functional classification already present in WD (like 'acids', 'bases', 'oxidants' etc.), there is classification based on use of compound (pigments, bla bla). Adding instance of (P31) with specific item about class (like class of chemical compounds (Q47154513)) is IMHO the easiest way to retrieve only items being classes in specific classification.
And the structural classification can be used on macromolecules too, but in different way (as it is done already in science): e.g. by indicating amino acids building blocks (not every functional group in every amino acid) or other macrostructural feature (not every funct. group that feature is composed of). But at some point transition between classification of small compounds and macromolecules, even the most smooth transition, is some kind of boundary between very accurate structural indication for small compounds and something like estimation for macromolecules).
And unclear definition of concepts like class of chemical compounds (Q47154513) is IMHO an advantage here (until we establish solid model for classification), because classification based on such items will be correct either we model chemical compounds as molecular entities or we choose to model these item as chemical substances. Wostr (talk) 16:48, 22 January 2018 (UTC)
Now for pyridine (Q210385): properties (mass (P2067), etc). It is mass (P2067) of what? 1 (pyridine, пиридин), >1 (pyridines, пиридины), compound, compounds, substance, substances?--Fractaler (talk) 17:13, 15 January 2018 (UTC)
@Fractaler: why pyridine (Q210385) is a problem here? Mass of pyridine is a mass of molecule (if in Daltons [u]) or mass of a mole of molecules (if in moles [mol]). But I don't understand why you asking me this? The problem is not how to name pyridine (Q210385) (it will always be pyridine/пиридин because its about molecule/compound). The problem here is how to name pyridines (Q47317020) — item that describes class of compounds = all compounds that have pyridine ring in their structure (so this class includes e.g. bromazepam (Q422435), 2,4,6-trimethylpyridine (Q409155), 2,6-pyridinedicarboxylic acid (Q417164), 2,6-lutidine (Q209284), 4-methylpyridine (Q2189778) and many, many others). It's similar to Wikipedia categories: ru:Категория:Пиридины is for every compound from pyridine class = compound having pyridine ring in the structure. Wostr (talk) 19:38, 15 January 2018 (UTC)
pyridines (Q47317020) (pyridines, "class of chemical compounds with pyridine ring"): what does "compounds with pyridine ring" mean here? --Fractaler (talk) 09:10, 16 January 2018 (UTC)
@Fractaler: that compounds have pyridine ring (i.e. pyridine core without one or more hydrogen atoms) as part of their structure. Wostr (talk) 16:02, 16 January 2018 (UTC)
@Wostr: The structure of compound is the same as the structure of molecule? Can compound has ring? Or molecule of pyrimidine has piri idine ring?--Fractaler (talk) 18:35, 16 January 2018 (UTC)
@Fractaler: sorry, but I have an impression that either you're not a chemist, or we have too big language barrier here. Structure of compound is the same as structure of molecule (because in the most popular definition of compound, it is substance composed of one kind of molecules, and the terms 'structure of compound' and 'structure of molecule' are used interchangeably). And yes, compound/molecule can have a ring, e.g. toluene has benzene ring and methyl group. But no, pyrimidine does not have pyridine ring – pyrimidine ring has two heteroatoms in its ring, pyridine only one heteroatom in its ring. Wostr (talk) 18:48, 16 January 2018 (UTC)
@Wostr: substance composed of one kind of molecules. Agree. So, substance consists of molecules. Molecule consists of atomes. So, we have two levels, and of course thenthen on this levels objects can not have the same structures (it is not fractals). --Fractaler (talk) 19:07, 16 January 2018 (UTC)
@Fractaler: I really don't know what are you getting at? IUPAC uses 'compound' in both meanings, as most chemists do. So e.g. 'alkynes' are a subclass of 'acetylenes' (both are chemical classes) — (1) 'alkynes' (molecules in which there is one C≡C) is a subclass of 'acetylenes' (molecules in which there is one or more C≡C); (2) 'alkynes' (substances composed of molecules in which there is one C≡C) is a subclass of 'acetylenes' (substances composed of molecules in which there is one or more C≡C). The exact meaning depends on the chosen definition and classification tree, but on this level remain the same. So, if we choose that we classify all 'chemical compounds' as 'molecules' (cf. discussion about definition of chemical compound), then classification will be based on this. If we choose otherwise ('chemical compounds' are 'substances composed of one kind of molecules), then classification will be the same, with the same definitions, and the same connections. Wostr (talk) 19:17, 16 January 2018 (UTC)
@Wostr: Does IUPAC make any knowledgebase or we? IUPAC is just for rules, for notability,for living of item in the WD-space. To do it legitimic.Also as link to WD, very cool source and so on by notability. --Fractaler (talk) 19:29, 16 January 2018 (UTC)
@Fractaler: at this point I think that further discussion makes really no sense, sorry. Wostr (talk) 19:31, 16 January 2018 (UTC)
@Fractaler: Wikidata's idea is to be a secondary source. The definitions of other authorities matter a great deal. ChristianKl❫ 14:32, 17 January 2018 (UTC)
Of course. I mean: no Wikidata:Notability (IUPAC, other cool sources), no life in WD for item. Fractaler (talk) 14:39, 17 January 2018 (UTC)
Ok, but idea that structure of molecule = structure of compound is wrong. We can ask other editors here about this. --Fractaler (talk) 19:44, 16 January 2018 (UTC)
@Wostr: IUPAC choses not to use the term chemical compound and doesn't have any definition for it. It does have a concept of inclusion compounds that seems to me like it includes multiple molecules. ChristianKl❫ 14:32, 17 January 2018 (UTC)
After reading a bit more it seems that neither ChEBI nor IUPAC have a concept of a "chemical compound". Why should we have one? Would it make sense to rename the item into "pure substance"?ChristianKl❫ 15:30, 17 January 2018 (UTC)
/conflict/ @ChristianKl: yep, and that's why when IUPAC definition contains 'compound' (and IUPAC uses compound very often, yet without defining it), it's the matter of context which version (substance vs molecule) should be used – so in WD it is the matter of defining 'chemical compound' or the chosen model of chemistry top-level items the problem, i.e. aromatic compound (Q19834818) will always be a valid chemical class for e.g. benzene (Q2270) or pyridine (Q210385) – but both items can be modelled either as a substance or as a molecule (the discussion about how should we treat items about chemical compounds – as items about molecules or items about substances composed of such molecules – is still ongoing: Wikidata talk:WikiProject Chemistry/Proposal:Models). Wostr (talk) 15:38, 17 January 2018 (UTC)
Not rename, as 'pure substance' is not a synonym of 'chemical compound' (pure substance includes also chemical elements), but 'chemical compound' can be just ignored in modelling chemical items (but it is still notable concept and should have its item in WD) and other terms may be used. Wostr (talk) 15:40, 17 January 2018 (UTC)
If it's a notable concept, why doesn't ChEBI and IUPAC define it? It seems to me like they don't have a concept in their database with that name because the term has no clear meaning and is used with different meanings and they prefer to have terms with clear meanings.
A "inclusion compound" for example is not a single molecule or even multiple molecules of the same type but a complex. ChristianKl❫ 18:17, 17 January 2018 (UTC)
@ChristianKl: It is one of most basic concepts in chemistry, but... it has more than one definition (some widely used, some used only in narrow fields). And I don't advocate for using 'chemical compound' concept as a base for classification – but using 'compound' as a part of many terms and definitions is unavoidable.
And you are of course right that IUPAC does not want to define it – if clear definition is required, IUPAC uses e.g. 'molecular entity', 'chemical substance', 'chemical species' etc. But when this is not required (i.e. when something can be related to both 'molecular entity' and 'chemical substance'), IUPAC uses 'chemical compound' a lot. This is the case of the whole chemical nomenclature, where chemical names are not limited to molecular entities but are valid also for chemical substances composed of such entities.
You are right too about the fact that the most widely used definition of a compound ('chemical substance composed of molecules') is not strictly used by IUPAC. And there are many more similar examples: salts and hydrates are not molecules either (but are usually included in something called 'chemical compounds'). Frustrated Lewis pairs are not compounds either, etc. etc.
But I think that has nothing to the title problem ;) You can write about this in Wikidata talk:WikiProject Chemistry/Proposal:Models and I would be grateful if you do so, because there are not many users in this WikiProject (that are active in the discussions) and I think it would be easier to come to conslusions with more users.
And, frankly speaking, I do not know why this discussion in this topic came to this point. Fractaler is asking question, which are IMHO not related to the title problem and I just don't know what is wrong or what are his proposals. I decided to not participate in discussions with him, as I see that he's indefinitely blocked on his home project for disrupting actions and I really don't have time to solve his enigmatic and philosophical questions. Wostr (talk) 23:44, 17 January 2018 (UTC)
@Wostr:I see that he's indefinitely blocked on his home project for disrupting actions: You see? for disrupting actions? Proof, please. Otherwise I will consider it as the distribution of unverified information (real chemists, like other real scientists, fake information do not spread). And how can you discuss something here, if you use the terms (plural label, singular label), which have not yet been defined? Mankind has long passed the stage of such philosophical discussions. --Fractaler (talk) 09:30, 18 January 2018 (UTC)
@Fractaler: sorry, I really don't have time to answer questions/comments that (1) are not related to the problem or I can't find any relation to the problem – maybe there is some relation, but I wrote a few times, that I don't understand how your questions are related to anything in this topic; (2) are full of Q's and P's (like your comment on January 11th about grammatical numbers) what makes them really hard to understand; (3) includes only enigmatic questions without any proposals or even indications of what is wrong. Since I'm a volunteer, like all of us here, I also have the right not to answer to your questions and comments — the right I intend to use until your questions will be substantive and clearly related to the problem. As for the proof you want: your block log on is not a secret. Wostr (talk) 18:48, 18 January 2018 (UTC)
Excuse me, yes, I have such a drawback (to lead a person to an idea). Ok, my idea was (I probably needed to say this at the beginning): pyridine (Q210385) (pyridine, molecule (Q11369) (molecule), молекула пиридина) is component (Q1310239) (component, компонент) of pyridines (set of all pyridine molecules, пиридины, группа всех пиридиновых молекул).
block log said "disrupting actions" and I see disrupting actions (in my opinion, maybe this is the result of my poor knowledge of English) it sounds differently (has a different meaning). Fractaler (talk) 07:45, 19 January 2018 (UTC)
@Fractaler: so your idea is something like:
  1. existing item pyridine (Q210385) would describe one molecule (molecule (Q11369)) of pyridine
  2. there will be also another item (not existing now) that would describe pyridine/pyridines as a portion of matter (chemical substance (Q79529) → set of molecules of pyridine (Q210385))
Do I get this right? Wostr (talk) 19:55, 19 January 2018 (UTC)
Yes! (again sorry for using that bad method "step by step"). --Fractaler (talk) 20:04, 19 January 2018 (UTC)
Okay, so there is also the third thing. The 1st is 'item about one molecule' (this is obviously singular, like pyridine (Q210385)), the 2nd is 'item about set of molecules of one type' (the grammatical number here is not so obvious: if it was about countable number of molecule, then it would certainly be plural, like two pyridines, five pyridines etc., but as it's a set of uncountable and not specified number of molecules, the grammatical number may be language-specific, e.g. in Polish it would be singular).
The above two things cane be discussed here: Wikidata talk:WikiProject Chemistry/Proposal:Models – I think there was a comment similar to your idea. But it's not clear yet, if we should use only 1st (molecule items), only 2nd (substance items) or maybe both.
The 3rd thing is: pyridines (Q47317020). It's not the 1st nor the 2nd thing. This is for grouping different kind of molecules (or substances if we choose not to use molecule items at all) into classes on the basis of their structure – so it's some abstract class, not something what can be achieved in reality. Let's take an example based on your idea:
But all three items are pyridines (Q47317020) = their molecules have pyridine ring in structure (or rather all three items belongs to chemical class named pyridines (Q47317020)). So it's here the problem: should this 3rd abstract class be пиридин or пиридины? Wostr (talk) 00:14, 20 January 2018 (UTC)
@Wostr: about grammatical number may be language-specific - it seems that while it is necessary to point at the place of the label: one molecule X, a small amount of (small) molecules X, many molecules (large group) X, all molecules X (the whole group) X.
about based on your idea - this is not my idea, it is the idea of/from chemical elements. There we had the same problem (the problem was due to homonymy (Q21701659)) - an element or elements, an atom or atoms, an isotope or isotopes, etc. As soon as the words are indicated (the totality of all atoms of a certain kind, the atom from this set, the molecule of such atoms, the totality of such molecules), the problem disappears. So, α-picoline (Q2216745), 4-ethylpyridine (Q27257452), triprolidine (Q417654),..., X = 1) molecule of X, 2) all molecules of X, 3) something other (because homonymy (Q21701659) now).
their molecules have pyridine ring in structure, so, how about "molecule with pyridine ring in structure"/"molecules with pyridine ring in structure"?
So, to continue, it is better to move into WikiProject Chemistry|the project? or while continuing here, and there to make specific proposals? Fractaler (talk) 09:37, 20 January 2018 (UTC)

potassium ferrocyanide[edit]

We have potassium ferrocyanide (Q422017) and no label (Q27279279). In the 1st we have all sitelinks and Wikipedia-imported properties about anhydrous form and PubChem-imported properties about trihydrate, the 2nd is PubChem-created item for anhydrous form. Which is better: move all properties, sitelinks, labels etc. about anhydrous form to no label (Q27279279) or move PubChem-imported data between these two items? (we have also no label (Q27110378), but this has to be merged when the above problem is fixed). Wostr (talk) 20:27, 27 January 2018 (UTC)

Usually the correct way is to respect the label: if the label is different from the data, this often means that the data import was not correct as people don't check what is the real compound defined by the item. Snipre (talk) 08:46, 29 January 2018 (UTC)
Okay, I'll fix this that way, thanks. Wostr (talk) 18:57, 30 January 2018 (UTC)
potassium ferrocyanide (Q422017), no label (Q27279279) and no label (Q27110378) merged; data about trihydrate moved to new item potassium ferrocyanide trihydrate (Q47520593). It appears that the 1st step was that some bot added wrong PubChem CID (but correct CAS#), then another bot changed data (including CAS#) based on PubChem CID.
However, in potassium ferrocyanide (Q422017) there are doubled ChemSpider IDs, InChI, SMILES now – that is beacause in databases like PubChem/ChemSpider there are different items about the same compounds, but represented in different way (like [2] and [3] in this case). I don't know whether I should delete one ID or maybe add some exceptions to unique value constraint? Wostr (talk) 19:29, 30 January 2018 (UTC)

Help I made a mess![edit]

Hi. I'm not sure if this is the best place to ask, but it has been brought to my attention that a big edit i did on a batch of items for Human Genes has gone wrong. I was trying to copy the en aliases to cy using quickstatements but some alias have become fragmented, so '9-cis,12-cis,15-cis-octadecatrienoic' has become '9-cis' '12-cis' and '15-cis' for example. At the very least i need help to get these removed so that i can start from scratch. Or if some one knows how to transfer the alias programmatically that would be even better. I can provide a list of Q's for effected items. Please can someone help? Best Jason.nlw (talk) 16:34, 30 January 2018 (UTC)

If you can create a list of the bad aliases along with the list of Q's then a bot could remove those aliases. Or if you think it's ok a bot could remove all the cy-language aliases for those items. About how many items were affected? ArthurPSmith (talk) 16:57, 30 January 2018 (UTC)
It effects up to 888 items, and around 9000 alias (Genes have a lot!) I couldn't easily prepare a list of effected aliases so i think it would be best to remove all cy alias from those 888 items and i will start again. None of these had aliases before my edit yesterday. Here is a list of the items. Thanks for your help! Jason.nlw (talk) 17:23, 30 January 2018 (UTC)
Hmm, I've been trying to fix these, but so far the bot approaches I've taken don't work. Something special about removing aliases? Both Quickstatments and WikidataIntegrator don't seem to want to do that. I will look at it again but maybe you should post a request on Wikidata:Bot requests to get it done sooner... ArthurPSmith (talk) 19:57, 31 January 2018 (UTC)
Thanks ArthurPSmith, i will post a bot request as suggested. Thanks again for your help. Jason.nlw (talk) 09:54, 1 February 2018 (UTC)

chemical formula (P274)[edit]

It is now set with single value constraint (Q19474404), but clearly there is more than one way to write chemical formula (I noticed that when I tried to add inorganic formula [cation-anion]). Shouldn't this be removed? And the Hill formulas should be tagged with criterion used (P1013) = Hill system (Q900739) (and maybe other formulas too, but with different items like 'inorganic formula' [I don't know at this moment whether this kind of formulas have any official name])? So it would be mandatory qualifier constraint (Q21510856) and one-of constraint (Q21510859) (Hill system (Q900739) and others). Wostr (talk) 21:00, 30 January 2018 (UTC)

@Wostr: We already discussed a little about the problem (see Wikidata_talk:WikiProject_Chemistry#New_property_for_composition). First step: definition of the different kind of chemical formulas in order to see if new properties are required or not. Then if no need of new property, we should replace the constraint of an unique chemical formula by the constraint of the qualifier criterion used (P1013) with the different possibilities. We can perhaps do a bot request to add criterion used (P1013) = Hill system (Q900739) to all statements chemical formula (P274) having Pubchem as reference. Snipre (talk) 08:18, 31 January 2018 (UTC)
I'll check in a few days if there are any official names for different formula writing systems. Wostr (talk) 15:37, 31 January 2018 (UTC)

New Wikidata aware <chem/> tags[edit]

In wikitext <chem /> tags represent chemical sum Formulae. For instance <chem>H2O</chem> rendered as

represents Q283. However the rendering mechanism has some issues. Most fundamentally it is based on mhchem version 2, which is not optimal and can not be updated to mhchem v3. Maybe @mhchem: can add a link to the details about the incompatibilities.

In Wikidata there is already a property P274 which expresses the sum formula as UTF chars. This information can currently be displayed using either the parser function invoke or the lua module wikidata.

My goal is to improve the situation, by adding a better version of chem tags and use information from Wikidata in Wikipedia. I would like to find a page where I can interact with the community to brainstorm how this could be done best. Is this the correct place? From a technical perspective, I see two main questions:

  1. What grammar should be used to encode chemical structures?
  2. Where should the data be stored (inside the tag or in wikidata)?

--Physikerwelt (talk) 12:01, 21 February 2018 (UTC)

@Physikerwelt: This is probably the right place to discuss this. What do you see as the problem(s) with chemical formula (P274)? We also have general formula (P1673) which works similarly, and chemical structure (P117) which links to an image. ArthurPSmith (talk) 18:26, 21 February 2018 (UTC)
Physikerwelt, what do you mean by grammar in relation to chemical formulae? Wostr (talk) 19:33, 21 February 2018 (UTC)
I am not entirely sure if there are problems with the properties mentioned above. I clearly see the svg image as a disadvantage since it's hard to change. For instance adding another element to the example linked in chemical structure (P117) [4] would require the user to download the svg change it and upload a new svg and link that. With mhchem one can express more than just sum formulae such as "chemical equations" [5] or even more complex structures
Are there any SPARQL queries using chemical properties, that would probably give a better intuition what is already good and what could be improved. --Physikerwelt (talk) 16:36, 26 February 2018 (UTC)
And how it is possible to obtain such structural formula in mchem? Wostr (talk) 00:01, 10 March 2018 (UTC)
@Physikerwelt: A good thing would be to discuss a new « chemical formula » datatype with the dev team : @Lea Lacroix (WMDE): author  TomT0m / talk page 18:41, 21 March 2018 (UTC)

peramivir hydrates[edit]

Are these two (peramivir trihydrate (Q47495829) and peramivir tetrahydrate (Q27158395)) duplicates? Some ext-ids indicate so, but e.g. PubChem pages are for trihydrate and tetrahydrate (but maybe it's a mistake – names in PubChem indicate trihydrate)? Wostr (talk) 21:28, 23 February 2018 (UTC)

Never looked at the names in PubChem: this is not under the control of PubChem team and nobody checked if the name is relevant with th formula.
Pubchem has 2 different strutures (see InChIKey) so 2 items are necessary. So th ecorrect way is to delete all redundant ID and to rename peramivir tetrahydrate (Q27158395) peramivir tetrahydrate. Snipre (talk) 22:57, 23 February 2018 (UTC)
Okay, thanks for the changes. Wostr (talk) 00:12, 27 February 2018 (UTC)

Beilstein numbers[edit]

I wonder why we have Beilstein Registry Number (P1579) named that suggests it contains numbers in Beilstein database? As far as I know there is no such numbers right now and the Elsevier's Reaxys use only 'Reaxys Registry Number'. What's more this property is used in items that are not in Beilstein, but in other databases available trough Reaxys; also, some of these numbers comes from sources where it is indicated that the number is Reaxys Registry Number not Beilstein. I think we should rename this property accordingly. Wostr (talk) 00:12, 27 February 2018 (UTC)

@Wostr: Do you know if the Reaxys databse use the same numbers that the Beilstein database ? I mean if compound x had Beilstein number YYY in the Beilstein database, does Reaxys reuse that number YYY as Reaxys number ? Snipre (talk) 14:18, 13 March 2018 (UTC)
AFAIK (from the time I had access to Reaxys, which was about half a year ago) there is no such thing as Beilstein database or Beilstein numbers right now – there is only Reaxys (with all the information from Beilstein, Gmelin etc. databases included in it and Reaxys numbers). But to be sure, I will ask a person who have this access. Wostr (talk) 14:45, 13 March 2018 (UTC) PS Also, sometimes there are both ids (Reaxys and Beilstein) in ChEBI and both are the same number, but nevertheless I asked about it. Wostr (talk) 14:51, 13 March 2018 (UTC)
Yep, the Beilstein RN is now Reaxys RN (or Reaxys ID) and the numbers are the same. Also, some examples: in ChEBI [6], [7]; an source [8]; funnily, we have some Beilstein RN imported from sources where it is described as Reaxys RN, see e.g sulfuric acid (Q4118). I think we should relabel this property and Beilstein RN should be an alias. That's, however, not the case of Gmelin numbers (Gmelin number (P1578)), as these are different from Reaxys/Beilstein numbers. Wostr (talk) 16:59, 13 March 2018 (UTC)
Yes, Reaxys is a superset of beilstein now, using the same numbers. Most of these Beilstein numbers I imported from ChEBI. So maybe we should relabel Beilstein to Reaxys. Sebotic (talk) 13:33, 5 April 2018 (UTC)

GHS data after creation of Property:P4952[edit]

As I see that Snipre is making some progress in relation to this property, I have to ask about the proper value in safety classification and labelling (P4952), because the proposition that we should use e.g. safety classification and labelling (P4952) = Regulation (EC) No. 1272/2008 (Q2005334) may cause some problems. I'm placing this in subsection, because I'm planning to compile a list of needed changes and needed new items, which I place in the next subsections to discuss.

1. Value in Property:P4952[edit]

If we use safety classification and labelling (P4952) = Regulation (EC) No. 1272/2008 (Q2005334) in items, it can have some implications in the future, because very few people understand which H-phrases one should choose from the source and place in WD. As an example for further discussion, the GHS classification and labelling for 2,2,4-trimethylpentane (Q209130) taken from Sigma-Aldrich SDS for European Union, relatively up-to-date (2017) [9]:

  • classes and categories (classification): Flammable liquids (Category 2); Aspiration hazard (Category 1); Skin irritation (Category 2); Specific target organ toxicity - single exposure (Category 3); Acute aquatic toxicity (Category 1); Chronic aquatic toxicity (Category 1)
  • H-phrases (classification): H225, H304, H315, H336, H400, H410
  • H-phrases (labelling): H225, H304, H315, H336, H410
  • EUH-phrases (labelling): none
  • P-phrases (labelling): P210, P261, P273, P301 + P310, P331, P501
  • GHS pictograms (labelling): 02, 07, 08, 09
  • signal word (labelling): Danger

So, the options I see are:

  1. use safety classification and labelling (P4952) = Regulation (EC) No. 1272/2008 (Q2005334)
    • it will have to be clearly indicated that GHS hazard statement (obsolete) (P728) is only used for: H-phrases (labelling).
    • in this option it will not be possible to add both classification and labelling data in one item (so the TomT0m's method for classification using subclass of (P279) would have to be adopted).
  2. use safety classification and labelling (P4952) = GHS labelling (Q50490754) (and if we agree to add GHS classification using P4792, also safety classification and labelling (P4952) = GHS classification (Q50490688))
  3. use safety classification and labelling (P4952) = Qxxx (Qxxx created as a subclass of e.g. Regulation (EC) No. 1272/2008 (Q2005334) and GHS labelling (Q50490754): GHS labelling according to CLP Regulation)
    • there will be no need for qualifiers, but we would need a few new items for each document (USA, EU, Japan, etc., etc.)
    • if we agree to add GHS classification using P4792, we would have two items for each country, e.g. Qxxx: GHS labelling according to CLP Regulation and Qyyy: GHS classification according to CLP Regulation.

But maybe there is some other way which I don't see? Or maybe some problems may be eliminated in a way I'm not familiar with? Wostr (talk) 19:04, 14 March 2018 (UTC)

@Wostr: Do we need to do the difference ? You never find all labelling data (signal word, GHS pictograms, H-phrases, P-phrases, EUH-phrases) under classification so if you have only H-phrases without other data this means that the editor took the information from the wrong section. Then if the editor mixed H-phrases from classification section and other labelling data from labelling section then this is not our fault: if someone doesn't understand the difference between both sections then we can't teach everyone about everything. I prefer to specify in the property page the rules of use (meaning that P4952 used with Regulation (EC) No. 1272/2008 (Q2005334) implies that only labelling data from labelling section) and that's it. Snipre (talk) 14:23, 21 March 2018 (UTC)
@Snipre: the problem is that I've corrected dozens of GHS data in Wikipedia, because someone added wrong H-phrases (because I didn't know there is a difference etc.), so that's why I am a bit oversensitive on this. And we don't have to make distinction by safety classification and labelling (P4952) = GHS labelling (Q50490754), we can agree that GHS hazard statement (obsolete) (P728) should be used for labelling H-phrases and add some complex constraints (that would catch situations where there is a probability that classification H-phrases has been added; if it's possible of course to make such constraints, e.g. if there is Hxxx and Hyyy then...). That may be however kind of confusing if we agree in the future that classification (classes, categories) should be added by safety classification and labelling (P4952) too – then is should be noted somewhere that: H-phrases in safety classification and labelling (P4952) are for labelling and H-phrases for classification have to be taken from GHS categories items by some query. Maybe Wikidata usage instructions (P2559) can be of some use here. Wostr (talk) 17:57, 21 March 2018 (UTC)

2. NFPA 704[edit]

Do we agree to file a bot request for merging existing NFPA 704 data into new property? And, of course, adding constraint to NFPA 704 properties that from now these properties should be used as qualifiers only?

The proposed model (identical like in the property's discussion):

Wostr (talk) 19:09, 14 March 2018 (UTC)

  • As there is no answer for my bot request (migration NFPA 704 from an old model to the new), I'll try to do the most of these edits myself using QuickStatements (and the rest manually). This will take some time and will result in a situation in which for a few days some part of NFPA 704 data will be present in WD in an old model (every NFPA 704 property separated) and some in new model (every NFPA 704 property as a qualifier of safety classification and labelling (P4952)). Wikipedias using NFPA 704 data has been notified ~week ago about the change. If anyone have any comments about this, please let me know. Wostr (talk) 09:56, 27 April 2018 (UTC)
  • Most of the NFPA 704 data has been changed to the new model. The completed batch included P143-sourced NFPA 704 data only (most of NFPA 704 data we have): ~150 items with full NFPA 704 labelling (4 properties) and ~1040 items with 3 properties (without NFPA 704 Special/Other). There is over 100 items in which NFPA 704 is incomplete/unsourced/sourced in a way that was not easy to convert using QuickStatements/etc. — these I'll try to edit manually (after update of constraint violations pages). Wostr (talk) 00:32, 5 May 2018 (UTC)

Agreement to distinguish between system and document[edit]

Do we agree to use legal documents or standard documents instead of classification systems for safety classification and labelling (P4952) ?

For example:

Globally Harmonized System of Classification and Labelling of Chemicals (Q899146) is a system but can have different applications depending on the country. For EU, US and China at least some differences can appear due to different regulatory application texts. An we can't rely on the source to determine the good application text. For example an international company has to issue a MSDS for each country where its chemical is sold according to the local regulatory text. So for one product sold by one company, we can have at least 4 MSDS with slight differences (one for US, one for EU, one for China and one following the UN documentation). I don't know for other countries and I hope contributors can help me to define which text is relevant for each country.

Then if we agree for that solution for Globally Harmonized System of Classification and Labelling of Chemicals (Q899146), do we agree to use the same distinction for other safety classification system like NFPA 704 (Q208273) ? NFPA 704 (Q208273) is for the system and we have to create a new item for the document which describe the NFPA 704 system ? Snipre (talk) 14:48, 21 March 2018 (UTC)

That solution would solve two problems; normally we should use system item in safety classification and labelling (P4952) with some qualifier to distinguish between different jurisdictions. Don't know though if we should e.g. for UE GHS distinguish between different ATPs? With NFPA 704 the problem is that the document is NFPA 704 (it's a NFPA standard and 704 is a code for this standard) which introduces system (AFAIK usually called NFPA 704 too) to determine which categories should be used in NFPA 704 hazard diamond. So in the case of NFPA 704 I think we already have the document item.
The problem is for GHS, because I really don't know how the GHS for US and other countries placed in legal acts – if it's a single document we can use just one item for specific country or maybe there were more than one documents in different times. Fortunately, in Russian Wikipedia there is no GHS in their infoboxes so there won't be mass uploads of their unsourced data – but nevertheless I'l try to determine how it is done in Russia (AFAIK GHS in Russia will be mandatory from 2020? 2021?). Wostr (talk) 18:13, 21 March 2018 (UTC)
@Wostr: I don't like to mix different types of items as value for safety classification and labelling (P4952):
No mixing of concepts, that's the rule to avoid bad infering later. Snipre (talk) 20:33, 21 March 2018 (UTC)
Okay, I know what you mean. We should establish some constraint in this property, because we will have 'NFPA 704' item (about system), 'NFPA 704: Standard System for the Identification of the Hazards of Materials for Emergency Response' item (about standard) and a few 'NFPA 704: Standard System for the Identification of the Hazards of Materials for Emergency Response (version xxxx)' about editions of this standard. It won't be clear for people to understand which item they should use. And, if I understand this correctly, only the edition items will be correct? However, this will be somewhat not consistent with using Regulation (EC) No. 1272/2008 (Q2005334) – there were several amendments to this regulation (most of them called ATPs) which were introducing some changes to the UE GHS. There are situations where GHS data according to CLP Regulation after X ATP is different than GHS data (for the same substance) after X+1 ATP. So, should we make items for different ATPs and use them in safety classification and labelling (P4952)? Wostr (talk) 23:06, 21 March 2018 (UTC)
@Wostr: You clearly described the problem and no we won't use the versions because there is no way to define which version was used to define the classification/labelling of a compound. Only the fundamental document is mentioned in the SDS, not the version. If I list the versions, this is just to have an idea about the up-date of the fundamental document: if you have no up-date since 10-20 years, perhaps a new fundamental document is used. Snipre (talk) 11:03, 22 March 2018 (UTC)
  • This and this may be of some help. BTW I think that – when we agree on all issues regarding this property – we could establish the full instruction here and just transclude relevant sections of this instruction to all properties discussions (rather than write instructions one by one). Wostr (talk) 14:16, 22 March 2018 (UTC)

GHS statements[edit]

I've created items for GHS pictograms, H and P statements (see here). I will add items for EUH/AUH statements and for obsolete H/P statements the next week. Also, I'll try to convert old GHS data to the new model so as to GHS hazard statement (obsolete) (P728) and GHS precautionary statements (obsolete) (P940) could be deleted. Wostr (talk) 19:47, 17 April 2018 (UTC)

EC Inventory[edit]

The EC Inventory is a database that contains 106,211 unique substances/entries. Has it been (partially/fully) imported? EC ID (P232) is currently used in 20,339 items. --Leyo 12:08, 9 April 2018 (UTC)

@Leyo: No, and I prefer to avoid any large data import before a good curation of the existing items:
- we still have 1122 items sharing the same CAS number and 196 items with 2 different CAS numbers (see report)
- 82 items sharing the same EC number (see "Single_value"_violations report)
- 88 items sharing the same InChIKey and 396 items having 2 different InChIKey (see [10])
Just adding large amount of data in the current situation will create more mess.
If you really want to work with the above source, you can extract the EC number and the CAS number from WD items having one values for these two properties and check if both values are the same in the EC inventory database, then create a list of conflicts and we will curate that list. Snipre (talk) 13:45, 9 April 2018 (UTC)
Items with CAS number issues or having EC numbers already shall not be changed.
Unfortunately, I am not really skilled in doing tasks like the one you proposed efficiently. --Leyo 14:20, 9 April 2018 (UTC)
@Leyo: So you can see what is the future need for WD: datasets comparison and analysis of possible matching: if we have 4 datasets and for one entry, 3 datasets have the same data, can we conclude that the entry is the same for all datasets ? And can we do the same if only 2 datasets have the same data ?
But ebefore doing that kind of job we have to clean our reference dataset, WD, and be sure that we don't have 2 items for the same chemical or one item mixing data about 2 chemicals. Snipre (talk) 14:52, 9 April 2018 (UTC)
Just to be clear: I was not suggesting to create any new items, but to import the EC number to existing items lacking a EC number based on the CAS number in an item. Items with CAS number issues are to be skipped. I don't think that such an import would cause a many issues. If so, I will fix them manually. --Leyo 15:00, 9 April 2018 (UTC)
@Leyo: This is not only a question of new items, this is a question of adding the data to the right place. You have in any case to do a choice in the data import process:
  • use the CAS numbers in WD as matching parameter and then add the corresponding EC number from the EC inventory database
  • use the EC numbers in WD as matching parameter and then add the corresponding CAS number from the EC inventory database
In each case you need to curate the existing items having some constraint violations before to be able to run that process import. If you have 2 items with the same CAS number, do you want to add the EC number to both items without checking if the CAS number id correctly used ?
If you try to use the name or the chemical formula to match the WD items with the EC inventory database, in the best case you will find no correlation, in the worse case you will add the data to the wrong item (typical example: an item with the English label describing an isomer but the item data are describing the isomers mixture).
If you want to be convincing about the relevance of your proposition, perhaps can you describe the process you will use to add the data ? Just to explain my position: one year ago, more than 1000 constraint violations were reported for CAS numbers. With the help of several contributors, we were able to reduce that number to less than 600. I don't want to see that number growing again just because someone wants to add data without taking care about consequences. I am direct because I spent a lot of time to curate data and I am tired to try to improve WD when others just play with data without any care.
I prefer few data with low errors than a lot of data with a lot of errors. Snipre (talk) 19:58, 9 April 2018 (UTC)
Most of your questions have already been answered. Didn't I express myself clearly? --Leyo 12:46, 10 April 2018 (UTC)
@Leyo: Sorry I missed the "Items with CAS number issues are to be skipped". I would propose to do the invers: use the EC number as matching parameter and add the CAS number. CAS number is not a reliable parameter especially not in WD. Snipre (talk) 11:16, 13 April 2018 (UTC)
Well, I intend adding EC numbers. There are currently 72,137 items with a CAS number, but only 20,336 with a EC number. I wonder how many items contain the latter, but not the former. --Leyo 12:14, 13 April 2018 (UTC)
The problem is that CAS numbers are not reliable mainly because we don't an official open source for CAS numbers. Snipre (talk) 13:47, 13 April 2018 (UTC)
By the way can you extract the ECHA InfoCard ID from ECHA database and add it to the corresponding EC number ? Snipre (talk) 11:19, 13 April 2018 (UTC)
A while ago, ECHA InfoCard ID (P2566) was added to items based on the CAS number by a bot. --Leyo 12:14, 13 April 2018 (UTC)

tetrakis(triphenylphosphine)palladium (Q27284745)[edit]

Is tetrakis(triphenylphosphine)palladium (Q27284745) supposed to be a Pd or a Pb compound? It links CID 91667687 that is erroneus in that sense, i.e. a mishmash. --Leyo 16:30, 20 April 2018 (UTC) PS. It is potentially a duplicate of Tetrakis(triphenylphosphine)palladium(0) (Q2366402).

It seems that's an erroneous entry imported from external database and I think there are two options: (1) if the lead compound exists, we should update tetrakis(triphenylphosphine)palladium (Q27284745) respectively (by removing some properties/moving to Tetrakis(triphenylphosphine)palladium(0) (Q2366402) etc.), (2) we could merge these two items and deprecate some erroneous data (also with Wikidata reason for deprecation (Q27949697) and proper value; we have applies to other compound (Q51734763), so I think we should also have something like erroneous entry in external source or more specific reasons, because this is not an isolated case and we had and will have issues like this). As I can't find anything about this lead complex I'd choose the second option. Wostr (talk) 17:09, 20 April 2018 (UTC)
@Leyo, Wostr: Better report the error to PubChem team and see what is the answer. Email of PubChem: Please indicate the CID of the palladium complex too. Snipre (talk) 22:26, 20 April 2018 (UTC)
I did so. --Leyo 07:52, 25 April 2018 (UTC)

