Wikidata:WikiProject Scholia/Robustifying/Testing

Testing whether and how Scholia performs is a key element of its development. Scholia is used thousands of times a day to visualize content that changes over scales from minutes to years, so the development of the tool has to take into account typical trajectories of such changes. Currently, such trajectories often mean that Scholia functionality breaks when it would be most useful, i.e. for highly curated content. The Robustifying Scholia project aims to address such situations.

Pilot corpora are example collections of Wikidata content which model Scholia profiles for various use cases. Scholia profiles content from Wikidata, so when there is complete content in Wikidata for a query, then the Scholia profile is higher quality. When Wikidata has incomplete content, then the Scholia profile will be incomplete. The pilot corpora below are complete enough for showcasing as model examples to copy, or to consider in critique, or in demonstrating Scholia and especially its limits.

Backend testing are activities aimed at addressing the subset of Scholia's limits that are due to the infrastructure that Scholia runs on, i.e. a Flask web app deployed on Wikimedia Toolforge that embeds JavaScript-enhanced iframes from the Wikidata Query Service that contain the results of SPARQL queries triggered by the web app. In the context of the Robustifying Scholia project, we are reviewing all of these components and their interactions to explore room for optimization.

About[edit]

Background[edit]

The ideal pilot corpora are complete datasets with some networking to other complete datasets. For example, to profile a topic it is not necessary to profile the leading authors publishing on that topic, but for the purpose of showcasing Scholia having the option to click various options can make a better first impression for anyone learning about Scholia as a feature-rich tool. More complete profiles demonstrate the sort of content users may curate for useful applications, and can inspire others to contribute data to profile content of their own choice.

As is common in Wikimedia projects, hundreds of people continuously edit the data corpora which are the foundations of these profiles, and consequently they are always changing. In the context of this crowdsourced engagement, individuals and the collective community do wiki-style documentation of changes, develop and publish guidelines for the general Wikidata environment, and experience personal and collective insight and cultural development toward best practices for curating this content. As of November 2019, the pace of Wikidata and Scholia development is too rapid to justify regular updates on best practices for curation. Anyone wishing to do curation should instead consider the below examples as showcased models where individuals and groups have worked toward completeness. For further details it is best to ask any experienced Wikidata contributor or Scholia project participant.

Usage[edit]

Pilot corpora only make sense in the context of particular profiling aspects. For example, a profile for a university may be from the perspective of papers from that university as an organization, or about that university as a topic. As different profiles present different content, the subject of a profile may be complete in one aspect but not another.

In the context of the Robustifying Scholia project, the corpora are used to test the limits of Scholia, and to explore workarounds and alternatives.

Key resources[edit]

Scholia corpora tickets in GitHub
Wikidata:WikiProject Zika Corpus — the literature around Zika virus (Q202864) serves as the primary testing ground for Scholia but it is too small to
- reach a audiences interested in unrelated topics
- hit some of the limits that Scholia experiences for some larger corpora

Testing goals for 2020[edit]

This project requires systematic testing of Scholia performance across possible use cases or usage scenarios. On that basis, we will assemble a corpus of examples that test Scholia’s technical limits that can help us optimize the infrastructure or inform technical design decisions. We will make such decisions around the types of visualizations available through Scholia and how they are cached or preserved, the ways in which the data to visualize gets into Scholia from Wikidata, the ways in which users can configure the experience (e.g. for comparisons), and the ways in which Scholia is integrated with WikiCite curation workflows, or hardware requirements.
While these technical test sets may be of limited use or interest outside the Scholia development team, the systematic testing of Scholia’s limits can also help identify circumstances where the tool works well, and in conjunction with usage information, we can then start to build pilot datasets like the Zika corpus or the Invasion biology corpus to serve as examples that engage different user communities. Having example sets to show off creates a model and workflow for others to emulate to open, expand, integrate or clean up the datasets which are relevant to them.
—Scholia team, Robustifying Scholia, 2019

Milestones[edit]

Corpora curation

Every corpus which Scholia presents for profiling is a milestone in Scholia development. Stages of development of corpora are their ingestion into Wikidata, their curation to refine quality, then designating them as ready for use and public feedback.

A corpus is a collection of data which Scholia can visualize in a useful way. Examples are below. The Robustifying Scholia proposal explains that these are important test cases because if Scholia can profile a corpus, then profiles should also work for any similar general case. It is import to curate corpora for development, and because profiles of corpora are a major attraction to users, and because the most common community crowdsourcing behavior in Scholia participation is the curation of corpora.

Corpora curation should be familiar to Wikidata and Wikimedia editors, as content curation is community behavior focused on a dataset. More specific to Scholia development and goals for 2020 are development of technical infrastructure which enable Scholia to profile corpora.

Technical developments relevant to Scholia

Factoring out the SPARQL end point URL
1. https://github.com/fnielsen/scholia/issues/809
Wikibase testing
1. https://github.com/fnielsen/scholia/issues/916
2. OCLC Wikibase pilot “Project Passage”
  1. http://hangingtogether.org/?p=7385
  2. http://hangingtogether.org/?p=7398
  3. http://hangingtogether.org/?p=7433
  4. https://www.oclc.org/research/publications/2019/oclcresearch-creating-library-linked-data-with-wikibase-project-passage.html
  5. https://doi.org/10.25333/faq3-ax08
  6. Has a frontend called “Passage Explorer”
Learning Wikibase
1. http://learningwikibase.com/
Professional Wikibase hosting?
1. E.g. https://twitter.com/ProWikiExperts/status/1130334015637082112
Running Wikibase and Semantic MediaWiki on the same wiki
1. https://twitter.com/wikiworks/status/1176863528864616448
2. https://twitter.com/datao/status/1176862020332871681
Determination of the price of operating Scholia
1. https://twitter.com/andrawaag/status/1187025994177302528
2. https://twitter.com/ProWikiExperts/status/1130334015637082112
Dockerizing Scholia
1. Specialist will be hired to work on this January 2020
2. One pull request for a Docker file: https://github.com/fnielsen/scholia/pull/691/files
3. Depends on the SPARQL endpoint to be configurable, see https://github.com/fnielsen/scholia/issues/809
SPARQL visualizer tool from Potsdam
1. https://github.com/fnielsen/scholia/issues/754

Corpora[edit]

For the moment, corpora that indicate technical issues (testing corpora) and corpora for which Scholia works fine and thus allows community engagement (community corpora) are both listed here, since transitions between both groups are common as bugs and errors arise and get addressed. We are exploring how best to facilitate distinction between the two when it matters.

Gallery[edit]

Testing corpora[edit]

If queries take longer than a minute, they time out. This is frequent for complex queries.
For comparisons, the likelihood of time-outs or other errors increases with the number of items to be compared (in this case, seven authors named "Li Li").
This error occurs when too many requests are being sent to the Wikidata Query Service in a short period from the same IP address. As Scholia profiles include multiple panels that are all requested at roughly the same time, the error is rather frequent. A simple remedy is to reload the page after a minute or so.
Publications with hundreds of authors can cause problems with network visualizations like the co-author graph.
Well-curated topics can cause the browser to stall.
Sometimes, the iframe embedded from the Wikidata Query Service does not result in any visual display.
Page number data is not readily available in a structured format, which is why this panel on an author's profile is often empty even if they have published lots of works during the indicated period.

Organizations[edit]

The following query (source) provides a list of organizations, sorted (in descending order) by number of affiliated people known to Wikidata. For most of these 200 most curated institutions, several of the panels in Scholia's organization aspect have issues with display.

The following query uses these:

Properties: employer (P108)  

, member of (P463)  

, affiliation (P1416)  

, part of (P361)  

, GRID ID (P2427)  

SELECT ?count ?institution ?institutionLabel 
WITH {
  SELECT (COUNT(DISTINCT ?researcher) AS ?count) ?institution WHERE {
    ?researcher ( wdt:P108 | wdt:P463 | wdt:P1416 ) / wdt:P361* ?institution .
    ?institution wdt:P2427 ?grid .
  } 
  GROUP BY ?institution
} AS %result
WHERE {
  INCLUDE %result
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,ep,fr,jp,nl,no,ru,sv,zh" . } 
} 
ORDER BY DESC(?count)
LIMIT 200

Try it!

Universities[edit]

Africa[edit]

Asia[edit]

Tohoku University (Q1062129)

Europe[edit]

North America[edit]

Department or subgroup[edit]

Other[edit]

Topics[edit]

Sustainable development[edit]

In this section, the target corpora are bolded, whereas the other items are listed to provide some of the context of curation in this area.

Sustainable Development Goals (Q7649586)

Individuals[edit]

For individuals, our testing mainly revolves around those that have Wikipedia articles with Scholia profiles as well as candidates for having that template added to their English Wikipedia article. However, we might list a few individuals below who do not neatly fit these two groups but who may be of interest for another reason.

Physicians

Other scientists

Jo Dunkley (Q28757988)
Karine Breckpot (Q38326061)
Almaz A. Aldashev (Q25571999)
Paolo Morettini (Q76757065)
- high-energy physicist, so loads of papers with loads of co-authors, which brings Scholia to its limits

Other academics

Publishing[edit]

Journals

Publishers

Locations[edit]

Copenhagen (Q1748)

Countries[edit]

Denmark
Netherlands
- via Wikidata:Wiki-wetenschappers
India
Tanzania
Uganda
Estonia
- via Estonian Research Portal person ID (P2953)

Wikidata:WikiProject Scholia/Robustifying/Testing

Contents

About[edit]

Background[edit]

Usage[edit]

Key resources[edit]

Testing goals for 2020[edit]

Milestones[edit]

Corpora[edit]

Gallery[edit]

Testing corpora[edit]

Organizations[edit]

Universities[edit]

Africa[edit]

Asia[edit]

Europe[edit]

North America[edit]

Department or subgroup[edit]

Other[edit]

Topics[edit]

Sustainable development[edit]

Other topics[edit]

Individuals[edit]

Publishing[edit]

Locations[edit]

Countries[edit]

Events[edit]

Awards[edit]

Fields[edit]

Clinical trials[edit]

Chemistry[edit]

Genes[edit]

Metabolic pathways[edit]

Taxon[edit]

Lexemes[edit]

See also[edit]

Navigation menu

Wikidata:WikiProject Scholia/Robustifying/Testing

About[edit]

Background[edit]

Usage[edit]

Key resources[edit]

Testing goals for 2020[edit]

Milestones[edit]

Corpora[edit]

Gallery[edit]

Testing corpora[edit]

Organizations[edit]

Universities[edit]

Africa[edit]

Asia[edit]

Europe[edit]

North America[edit]

Department or subgroup[edit]

Other[edit]

Topics[edit]

Sustainable development[edit]

Other topics[edit]

Individuals[edit]

Publishing[edit]

Locations[edit]

Countries[edit]

Events[edit]

Awards[edit]

Fields[edit]

Clinical trials[edit]

Chemistry[edit]

Genes[edit]

Metabolic pathways[edit]

Taxon[edit]

Lexemes[edit]

See also[edit]

Navigation menu

Search