Wikidata:ScienceSource project/WikiCite backlogs

From Wikidata
Jump to navigation Jump to search

WikiCite, which began life as a conference series, may mean a few different things, to different people. But it certainly includes the construction here on Wikidata of a large scientific bibliography. Which can function as a back end for further projects.

One clear motivation is to be able to run queries that are more complicated than are available on the current repositories. This is where ScienceSource came in: with the idea that the MEDRS criterion on Wikipedia for acceptable referencing of health information could in principle be solved by a SPARQL query on Wikidata.

For a practical implementation, one comes across the "backlogs": issues that need to be resolved before one can just press a button and get answers on MEDRS, or on any other such ambitious goal (for example, the bibliographical work behind a systematic review). At the very least, the information one is hoping to work with needs to have been imported into Wikidata, and to be accurate. So there are "missing information" issues, in particular. There is a constant need for updates.

A simple form of MEDRS looks at publication type and date of publication, only. It could be run just with instance of (P31) and publication date (P577) statements. The latter is typically added at the creation of an item about an article, but there now some ambiguity between e-publication date and hard copy publication date. This is a simple example where ambiguity can create rather fuzzy statements.

To get further with MEDRS, there is a need to be lenient on publication dates for topics that count as "neglected diseases", or "rare diseases". So first you need to have a reliable source of topical information, and definitions of "neglected" or "rare". Then the definition of "predatory journal" to be used, to exclude articles in journals that have a poor reputation, and cannot be relied up for peer review.

These concepts should in principle be computable, from some data, but this is an area in which there are multiple definitions to choose from. The relevant data would be very interesting, but what is here is certainly deficient. So it is natural to resort to ad hoc whitelists and blacklists, which can only be provisional.

At this point in the discussion, it is probably better to turn back and look again at what "WikiCite backlogs" should be addressed first, as foundational work. A MEDRS query would just be one application.

Queries[edit]

Missing information[edit]

#MeSH descriptor statements, for D-numbers, without MeSH tree code
SELECT ?item ?itemLabel
    WHERE {?item wdt:P486 ?meshid.
           FILTER STRSTARTS(?meshid, 'D')
           MINUS {?item wdt:P672 [ ]}
           
           SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
          }
Try it!
#ScienceSource focus list items without CC license statement
SELECT ?item ?itemLabel ?date
     WHERE {?item wdt:P5008 wd:Q55439927;
                  wdt:P577 ?date .
            MINUS {?item wdt:P275 [ ]}
            
            SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
            
            }
ORDER BY ASC(?date)
Try it!

Inaccurate information[edit]

#Sample type from Followups page
SELECT ?item 
 WHERE {?item wdt:P921 wd:Q58779680}
Try it!

See Wikidata:ScienceSource project/Followups for many more. Typically they come from incorrect disambiguation by the Source MD tool.

#Sample query for broader terms to check
SELECT ?item ?title ?date
  WHERE {?item wdt:P5008 wd:Q55439927;
               wdt:P921 wd:Q808;         
               wdt:P577 ?date;
               wdt:P1476 ?title.
         
         SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
        }
    
ORDER BY ASC(?date)
Try it!

See Wikidata:ScienceSource project/NCBI2wikidata rsplus1 Typically these will be taken from runs carried out from NCBI2wikidata output, processed with a script.

Completed: 2022/11/07