User:SoCalChemBot/Analysis of PubChem CIDs in WD and WP
- 1 Introduction
- 2 Results
- 3 Discrepancies needing manual curation and/or classification
The PubChem compound ID (PubChem CID) is one of the most widely used open chemical identifiers. It is also widely used in Wikipedia chemical infoboxes. In order to determine the concordance of PubChem CIDs in Wikidata items and the corresponding Wikipedia infoboxes (linked through interwiki links), we retrieved PubChem CIDs from all WP chemistry templates (Drugbox and chembox), and compared those values to Wikidata. The snapshot was taken on 31 January 2017.
Based on above approach, we retrieved 16,991 distinct Wikipedia chemical compounds. Here a detailed overview and more numbers:
The full list of Wikidata to Wikipedia comparison and indication matches or mismatches can be found here
The Jupyter Notebooks:
- Data aggregation: wp_wd_chem_data_aggregation.ipynb
- Data analysis: WP_WD_PubChem_CID-comparison.ipynb
Discrepancies needing manual curation and/or classification
Group A: Compounds where PubChem CID of WP and WD do not match (1067)
When resolving these conflicts, it is NOT sufficient to just choose one CID and past it into the WD item and the WP infobox. All other chemical identifiers need also to be matched. Very important in this process it to make sure that the correct stereoisomer and salt form is used and that these are consistent between WD and WP. Most errors are due to mix-ups of different stereoisomers or salt forms. In total, these are 1,067 articles/items or 6.3% of all articles with CID.
Group B: Wikipedia articles with Wikidata items needing improvements (736):
This set contains the 736 WP articles with a CID but where the corresponding WD item does not have a CID or is empty.