User:Carlinmack

From Wikidata
Jump to navigation Jump to search

I operate:

Crossref Project[edit]

Crossref is a central index of research items across publishers[1]. In April 2022 they released a metadata dump with 134 million records which is available to download via a torrent. The benefits of it being a torrent are that you can download the dump piecewise rather than all at once and the data is transferred peer-to-peer reducing load on any node in the network. The licensing of Crossref is not the usual CC license but rather in the territory of "no known restrictions apply" and as such the dump is not given a license.

Annotating Wikidata items with licenses[edit]

Originally I worked with pieces of the dump and toyed with the idea of downloading chunks of the dump to process at a time so that I didn't need to keep 150gb of storage free. This turned out to be far more complicated than I wanted to deal with so I made space and downloaded the full dump which completed in the same day.

The code is public domain and is available here https://github.com/carlinmack/qid-id/blob/main/crossref.py. Control of the script is altered manually at the bottom of the file. While the script isn't novice friendly and is uncommented, I'd like to think that it's would be easy to modify or take inspiration from.

In short the script iterates over the full dump and extracts DOIs and license information for licenses which contain "creativecommons". Then I query WQDS with the resulting list of DOIs to find whether they exist in Wikidata and whether they are already licensed. I then create a list of QuickStatements for the DOIs which are in Wikidata but do not have a license. Specifically:

  • main(batch) — Iterates over the Crossref dump and generates CC.csv which has the columns [doi, licenseURL, licenseQID]. The QID of the license is derived from Wikidata
  • getIDs(batch) — Takes the DOIs from CC.csv and sends them in batches to WDQS to get the corresponding QID and optional license. It outputs qid-doi.csv which has the columns [qid, doi, wdLicenseQID]
  • collate(batch) — This joins CC.csv and qid-doi.csv and outputs alreadyLicensed.csv, which contains the information for the DOIs which have QIDs and license information in Wikidata, and qs.csv which contains the QuickStatements, a P275 (copyright license) and P6216 (copyright status) for each DOI.
  • edit() — Uses pywikibot to perform the edits on Wikidata from qs.csv

The batch flag is for testing and will add a user specified suffix to all data files so that runs are not overwritten.

In December 2022 I ran this code and found 300k items to annotate and processed this via QuickStatement batches. After subsequent discussion I discovered that Wikidata normalises DOIs in uppercase and so I had missed all DOIs which contain letters. I re-ran my code and found 1.45 million DOIs which could be annotated on Wikidata with licenses.

Coverage of Crossref in Wikidata[edit]

WikiCite relies on Wikidata as its source for collating sources, and as such it would be interesting to know how much of Crossref is covered by Wikidata at current.

Of the 134 million DOIs in the dump, we filtered the dataset based upon type of record to rule out things that are not likely to be sourced such as journal issues and book series. We filtered the data set to just ["journal-article", "book-chapter", "proceedings-article", "dissertation", "book", "monograph", "dataset"] which left us with 112 million (112,013,354) DOIs.

Using a sample size calculator we find that we only need to sample 16.5k DOIs to find out what percentage of the general population can be found in Wikidata to 99% accuracy with a 1% margin of error. However as my script is quick enough I checked 1% or 1.1 million samples from the list.

I created my sample using shuf -n 1120133 input.txt > output.txt as this was recommended on StackOverflow. I then used my ID to QID script which processed all 1.1 million DOIs in 50 minutes.

The script works as follows. The list of DOIs is read into a Python list and split into pages of 125k items. We split into pages in case we have malformed items which would cause the entire process to fail. For each page we iterate through the dump in 100 item sized chunks which are inserted into a SPARQL query like so (only including 10 DOIs for readability):

SELECT ?item ?id ?license {
 VALUES ?id { '10.4067/S0717-95022018000401439' '10.1111/J.1471-0528.1982.TB05083.X'
'10.1016/S1297-9589(03)00146-2' '10.17116/ONKOLOG2020905145' '10.1177/1368430299024002' '10.1109/TNET.2008.2011734'
'10.32838/TNU-2663-5941/2020.6-2/03' '10.1039/C4RA01952K' '10.7717/PEERJ.11233' '10.1016/J.ACTAASTRO.2014.11.037' }
 ?item wdt:P356 ?id .
 optional {?item wdt:P275 ?license}
}
Try it!

We store the results in a list before creating a pandas DataFrame and saving as a CSV. The result of this process is several CSV files which we can join together and process into a report. Our report is as follows:

Total number of DOIs is 112,013,354

Sample size was 1,120,133 or 1.00% of total

251,865 were found on Wikidata, or 22.49% of the sample

3,657 have a license in Wikidata, or 1.45% of those found

From our random sample of the data, which is well above statistically significant, 22% of the filtered Crossref data dump is on Wikidata.

Coverage of Wikidata in Crossref[edit]

To validate the above, we will next retrieve a list of DOIs from Wikidata (hopefully in the order of millions) and see how many of these are found in the Crossref dump.