Topic on User talk:Wostr/Structured Discussions Archive 1

Jump to navigation Jump to search
SCIdude (talkcontribs)

Please correct this How many did you do wrong in this batch?

Wostr (talkcontribs)

Yes, you're right about the fact that this item describe stereochemically defined compound. I did not know that there may be InChIs with "?" in sublayers /b or /t that are not an indication of an undefined stereocenter. The problem is the 3-iminopyrazol-1-yl group. InChI from PubChem indicates that double-bond stereochemistry of H-N=C< is undefined. However, it's hard to reproduce this in any software available to me and even redrawn PubChem structure in ChemDraw gives different InChI.

I'll check all the InChIs in these 4 batches for possible 'false positives' in /b sublayer. I can't tell you right now how big is this problem, but I'll contact you as soon as I have that kind of knowledge. Then I'll correct all incorrectly changed statements, but I can't tell you right know if these kind of errors are occasional and I'll correct them manually, or I'll have to use semi-automatic tools.

SCIdude (talkcontribs)

It may be a problem with InChis from ChEBI. If so, we should replace them in bulk if that solves the problem.

Wostr (talkcontribs)

After a quick check I see that there may be about 20–30 items in these batches that have to be checked. I'll do that manually (however, I don't know when — probably tomorrow od the day after as I have a really hot week in work).

Wostr (talkcontribs)

There are 30 items that I'll be reviewing, all are listed here. It seems that most of the problems is a result of double bond on nitrogen atoms or double bonds in rings. I'm not sure why InChI in these items shows e.g. oxime group HO-N=C< as a group that should have defined stereochemistry. However, there are situations like in Heme O (Q620211): InChI from PubChem shows undefined configuration of many double bonds of porphyrin, while InChI from ChemSpider shows all that double bonds as stereochemically defined.

I'll try to check whether these InChIs from PubChem are correct for these items. Maybe we should have more than one InChI is such situations (even with a deprecated rank).

SCIdude (talkcontribs)

This problem also showed with my current ChEBI InChi key fixes, resulting in different keys. I agree multiple InChis and keys are unavoidable. But, when using different ranks, we should have a consistent way to assign these. For example, do we prefer to not specify oxime bonds? What about diazo -N=N- bonds, PubChem usually leaves them unspecified (I agree with this). As to porphyrin bonds, is the (E) configuration geometrically possible? If not, the bond does not need to be specified. This has to be defined on some project page. Could you please do this?

SCIdude (talkcontribs)

BTW I think I found out why the ChEBI InChis may have a problem. Take the SMILES "C1C[C@@H]2CC[C@H]1C2" which is norbornan with redundant stereo information (the centers are potentially stereogenic but not in this case). When input in PubChem, they automatically remove the stereo specs, input in ChEBI does not. From this the InChis become different. So effectively it's a ChEBI software problem.

Wostr (talkcontribs)

As I thought that ~30 incorrect items is a very low number given the scale of QS batches, I checked the whole batch in Excel rather than trying to query it using SPARQL from WD.

I found 774 potential InChIs that may have ? in /b sublayer and may not be a group of stereoisomers. I've manually checked all the items (unfortunately, most have only one source – DSSTOX database – because were created by GZWDer imports) and found:

  • about 58,5% are correct (mostly undefined configuration of C=C bonds or substituted diazo bonds)
  • about 23,8% have to be checked more carefully, however I think that most are correct (about 95% of these are undefined configuration of double bonds in eight or more membered rings – I checked that it is possible to have an eight-membered ring with at least one E double bond, so probably these items are correctly described as 'group of stereoisomers')
  • about 17,7% are probably incorrect (about 85% of these are unsubstituted imino groups that are treated in many databases as stereochemically undefined, however at least 3 different InChIs can be assigned for such situations; the rest are unsubstituted diazo groups, heterocyclic compounds or some weird borderline cases + some items in which InChIs from different sources differ).

I'll post on Wikidata:WikiProject Chemistry discussion page in a few days about this problem. Most incorrectly added 'groups of stereoisomers' for compounds with unsubstituted imino group can be reverted using QS, so it won't be a problem to do it technically, but we have to do it in uniform way for all cases.

The problem you've mentioned about norbornan and redundant stereo descriptors may cause additional problems in the future. I added 'group of stereoisomers' to items that have ? in InChI sublayers /b or /t. However, there are also many InChIs for groups of stereoisomers that lack these sublayers (if a compound have 2 stereocentres and 1 is undefined – there is a /t sublayer with a ? descriptor for one stereocenter; if all 2 stereocentres are undefined – there is no /t sublayer – so we stil have thousands of 'groups of stereoisomers' (with all stereocentres undefined) that are classified as 'chemical compounds'. I asked Egon Willighagen about the script he wrote in 2019 – items like norbornan may be false positives that we should try to exclude.