User:Charles Matthews/ContentMine workshop 15 December 2018
Programme for [ContentMine workshop] 15 December 2018, Makespace classroom, 16 Mill Lane, Cambridge; see m:Meetup/Cambridge/39.
Introduction
[edit]Introductory videos:
Number | Video link | Topic |
---|---|---|
1 | Introduction to the project | |
2 | ScienceSource focus list | |
3 | What's a neglected disease? | |
4 | ScienceSource annotations |
Control content
[edit]- "In the beginning was the dictionary"
ContentMine dictionaries are
- lists of search terms like "leukemia",
- paired with Wikidata items such as leukemia (Q29496), and
- usually extracted from Wikidata by a query.
Example: http://sciencesource.wmflabs.org/wiki/File:Cereal_dictionary2.png
- Wikidata
It is Wikimedia's knowledge base, multilingual and part of the Semantic Web family of machine-readable sites. It is also illustrated with several million images.
Our interests are mainly in drugs, diseases and scientific papers, which are not the best topics to show off the illustrations. But here is something in the drug field:
#ImageGrid for compounds "-sterone".
#defaultView:ImageGrid
SELECT DISTINCT ?item ?itemLabel ?pic
WHERE
{
?item wdt:P31 wd:Q11173;
rdfs:label ?itemLabel;
FILTER (lang(?itemLabel) = "en")
FILTER regex (?itemLabel, "(sterone)$")
OPTIONAL { ?item wdt:P18 ?pic }
}
So what is that?
- SPARQL
The query language common to Wikidata and other Semantic Web sites.
First activity: run this query.
#ImageGrid for taxon
#defaultView:ImageGrid
SELECT DISTINCT ?item ?itemLabel ?pic
WHERE
{
?item wdt:P31 wd:Q16521;
wdt:P171* wd:Q21860.
OPTIONAL { ?item wdt:P18 ?pic }
}
Then replace Q21860 with the number on your card, and run it again.
Content close-ups
[edit]- Metadata
Roughly speaking, data that can be used for cataloguing purposes.
- Annotations
Layers of commentary about a given text, made up of comments pointing to places in it, directly or indirectly.
- Co-occurrence
Two search terms found in the same text. To see examples:
lists sample article texts, alphabetically. Given the Q-number navigate to the item using the browser line, with Item:Q... . Use "What links here" to find anchor points. From anchor point statements find annotations. Locate the terms in the actual text.
Assess content
[edit]The ScienceSource project wiki at http://sciencesource.wmflabs.org/ will apply SPARQL to explore uploaded papers, in a number of ways. User:Charles Matthews/ScienceSource queries is a warehouse of queries, in a readable form.
- Histograms
Infographics that represent where search term annotations lie, when you divide up a text into a number of parts.
- TF-IDF
Formula used to rank search term annotations in a corpus of texts.
A table like this, but on a larger scale, can record how many hits for each dictionary term in a corpus of texts.
text | term1 | term2 | term3 | term4 | term5 |
---|---|---|---|---|---|
text1 | 3 | 0 | 0 | 4 | 1 |
text2 | 5 | 1 | 2 | 0 | 2 |
text3 | 0 | 0 | 0 | 2 | 0 |
The idea is to scale each row with a factor (TF) that takes into account the length of the text, and each column by a factor (IDF) that varies inversely with the number of non-zero entries. Then all the entries are ranked by "interestingness".
With some caveats, SPARQL can carry out this ranking within ScienceSource, on batches of articles.
- MEDRS
Guideline used on Wikimedia about "medical reliable sources", in order to decide which citations are acceptable for health information.
A subproblem is to exclude papers from "predatory publishers". Third activity: hunt the predator!
#Journals without publishers on ScienceSource focus list
SELECT DISTINCT ?journal ?journalLabel
WHERE {?item wdt:P5008 wd:Q55439927;
wdt:P1433 ?journal.
MINUS {?journal wdt:P123 ?publisher}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Auxiliary query: replace Q5506062 by Q-number of journal.
#Query to check for articles published in a given journal.
SELECT ?article ?articleLabel
WHERE {?article wdt:P1433 wd:Q5506062.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}