The ChemID Initiative aims to compare the different free databases about chemicals in order to match their corresponding IDs and to use Wikidata as connection center between the databases data sets.
Different databases list chemicals according to different criteria and identify the chemicals in their database by an internal identifier. These identifiers are often used outside of the databases in order to allow the identification of the chemicals. The most famous identifier is the CAS number.
Most databases try to integrate some identifiers of other databases in order to offer links between data sets dispatched in the different databases but this is not performed in a systematic way: for example database A adds identifier of database B and C, database B integrates identifier of databases C and D, database C adds identifier of databases E, F and G,...
Wikidata can be a central point of connection between the databases and the data they store by listing in each item of a chemical the list of all databases identifiers.
How to achieve this goal ?
Most of databases are open data and free (but we have to check the licence to see to which extend we can import the IDs) and even propose the access to their data in a standard format allowing the data reading by machine (xml, SDF,...):
These data sets contain for each chemical different data: internal identifier, identifiers of other databases, chemical formula, InChI, InChIKey, SMILES, CAS number, IUPAC name,... All these data can be used in order to match the data sets of the databases and to find the corresponding sets of an unique chemical.
- Create a matching list of the Q number of Wikidata items defining chemicals with one well known identifier (PubChem CID).
- This has to be done manually and in order to create a reference list, this has to be done 2 or times by different groups of persons. Then these list can be compared and then discrepancies can be treated in detail. This is necessary because the different wikipedia don't manage the chemical articles in the same way.
- To perform this task the best solution is to provide the list of Q numbers of all chemicals present in Wikidata to the different Chemistry projects of the different wikipedia and to ask them to add the corresponding external identifier.
- After some months the lists are recovered and data are compared.
- Second task is to download the data from the databases cited above and to store all this data in an unique format.
- Third task is to compare the reference list of Q numbers with the database created in task 2 through the help of the external identifier.
- Last step is the data import in Wikidata.
Schematic view of the process:
- Task1 : Q number and ID1 -> [Q number;ID1]
- Task2 : ID1, ID2, ID3 and ID4 -> [ID1;ID2;ID3;ID4]
- Task3 : [Q number;ID1] and [ID1;ID2;ID3;ID4] -> [Q number;ID1;ID2;ID3;ID4]
- Contributors ready to check data from wikipedias and to identify clearly the chemical described by a Wikidata item
- Persons with programming skills to handle data from databases and to create a super database
- Bot operators ready to import data from the super database into Wikidata
A list of all articles about chemicals from WP:en, WP:de and WP:fr is available here with the PubChem ID and the CAS number extracted from the articles. This list contains 12882 chemicals. The job is now to check the PubChem ID of all these chemicals using the PubChem website in order to allow the data extraction of selected datat from PubChem database.
If you want to participate to this task, just ask user:Snipre for the access to the on-line spreadsheet.
An intermediary step can be performed: the comparison of the PubChem IDs of different data sets from different WP.
- 26.3.2014: Request for data extraction from WP:de put in Wikipedia:Redaktion Chemie
- 9.4.2014: Got the list of articles with CAS number, PubChem CID and Q number for WP:de and WP:en, see fr:Utilisateur:Snipre/Infobox Chimie/en and fr:Utilisateur:Snipre/Infobox Chimie/de
- 28.3.2015: An unique table with PubChem CID is available and a first analysis is available
Available data in external databases
X data available, - no data available, ? not checked
From the data extracted from de:WP, en:WP and fr:WP, 12882 chemical were identified (the criterion for the identification was to have a chemical infobox in the article and a link to WD).
From the merged list, the following results are given
- 12882 chemicals with Q number
- 2347 chemicals without PubChem CID (in any WP)
- 6555 chemicals with only one PubChem CID (only one WP has a value)
- 3093 chemicals with two PubChem CID (two WP have a value)
- 476 chemicals have different values
- 2617 chemicals have the same value
- 885 chemicals have three PubChem CID (all WP have a value)
- 18 chemicals have a different PubChem CID in each WP
- 139 chemicals have a similar PubChem CID in at least 2 WP
- 728 chemicals have a similar PubChem CID in 3 WP