Wikidata:WikiProject Scholia/Robustifying/Testing

From Wikidata
Jump to navigation Jump to search

Home

 

Testing

 

Accessibility

 

Workflows

 

Citation

 

Usage

 

Documentation

 

Growth

 

Outreach

 

Testing whether and how Scholia performs is a key element of its development. Scholia is used thousands of times a day to visualize content that changes over scales from minutes to years, so the development of the tool has to take into account typical trajectories of such changes. Currently, such trajectories often mean that Scholia functionality breaks when it would be most useful, i.e. for highly curated content. The Robustifying Scholia project aims to address such situations.

Pilot corpora are example collections of Wikidata content which model Scholia profiles for various use cases. Scholia profiles content from Wikidata, so when there is complete content in Wikidata for a query, then the Scholia profile is higher quality. When Wikidata has incomplete content, then the Scholia profile will be incomplete. The pilot corpora below are complete enough for showcasing as model examples to copy, or to consider in critique, or in demonstrating Scholia and especially its limits.

Backend testing are activities aimed at addressing the subset of Scholia's limits that are due to the infrastructure that Scholia runs on, i.e. a Flask web app deployed on Wikimedia Toolforge that embeds JavaScript-enhanced iframes from the Wikidata Query Service that contain the results of SPARQL queries triggered by the web app. In the context of the Robustifying Scholia project, we are reviewing all of these components and their interactions to explore room for optimization.

About[edit]

Background[edit]

The ideal pilot corpora are complete datasets with some networking to other complete datasets. For example, to profile a topic it is not necessary to profile the leading authors publishing on that topic, but for the purpose of showcasing Scholia having the option to click various options can make a better first impression for anyone learning about Scholia as a feature-rich tool. More complete profiles demonstrate the sort of content users may curate for useful applications, and can inspire others to contribute data to profile content of their own choice.

As is common in Wikimedia projects, hundreds of people continuously edit the data corpora which are the foundations of these profiles, and consequently they are always changing. In the context of this crowdsourced engagement, individuals and the collective community do wiki-style documentation of changes, develop and publish guidelines for the general Wikidata environment, and experience personal and collective insight and cultural development toward best practices for curating this content. As of November 2019, the pace of Wikidata and Scholia development is too rapid to justify regular updates on best practices for curation. Anyone wishing to do curation should instead consider the below examples as showcased models where individuals and groups have worked toward completeness. For further details it is best to ask any experienced Wikidata contributor or Scholia project participant.

Usage[edit]

Pilot corpora only make sense in the context of particular profiling aspects. For example, a profile for a university may be from the perspective of papers from that university as an organization, or about that university as a topic. As different profiles present different content, the subject of a profile may be complete in one aspect but not another.

In the context of the Robustifying Scholia project, the corpora are used to test the limits of Scholia, and to explore workarounds and alternatives.

Key resources[edit]

Testing goals for 2020[edit]

This project requires systematic testing of Scholia performance across possible use cases or usage scenarios. On that basis, we will assemble a corpus of examples that test Scholia’s technical limits that can help us optimize the infrastructure or inform technical design decisions. We will make such decisions around the types of visualizations available through Scholia and how they are cached or preserved, the ways in which the data to visualize gets into Scholia from Wikidata, the ways in which users can configure the experience (e.g. for comparisons), and the ways in which Scholia is integrated with WikiCite curation workflows, or hardware requirements.
While these technical test sets may be of limited use or interest outside the Scholia development team, the systematic testing of Scholia’s limits can also help identify circumstances where the tool works well, and in conjunction with usage information, we can then start to build pilot datasets like the Zika corpus or the Invasion biology corpus to serve as examples that engage different user communities. Having example sets to show off creates a model and workflow for others to emulate to open, expand, integrate or clean up the datasets which are relevant to them.
—Scholia team, Robustifying Scholia, 2019

Milestones[edit]

Corpora curation

Every corpus which Scholia presents for profiling is a milestone in Scholia development. Stages of development of corpora are their ingestion into Wikidata, their curation to refine quality, then designating them as ready for use and public feedback.

A corpus is a collection of data which Scholia can visualize in a useful way. Examples are below. The Robustifying Scholia proposal explains that these are important test cases because if Scholia can profile a corpus, then profiles should also work for any similar general case. It is import to curate corpora for development, and because profiles of corpora are a major attraction to users, and because the most common community crowdsourcing behavior in Scholia participation is the curation of corpora.

Corpora curation should be familiar to Wikidata and Wikimedia editors, as content curation is community behavior focused on a dataset. More specific to Scholia development and goals for 2020 are development of technical infrastructure which enable Scholia to profile corpora.

Technical developments relevant to Scholia
  1. Factoring out the SPARQL end point URL
    1. https://github.com/fnielsen/scholia/issues/809
  2. Wikibase testing
    1. https://github.com/fnielsen/scholia/issues/916
    2. OCLC Wikibase pilot “Project Passage”
      1. http://hangingtogether.org/?p=7385
      2. http://hangingtogether.org/?p=7398
      3. http://hangingtogether.org/?p=7433
      4. https://www.oclc.org/research/publications/2019/oclcresearch-creating-library-linked-data-with-wikibase-project-passage.html
      5. https://doi.org/10.25333/faq3-ax08
      6. Has a frontend called “Passage Explorer”
  3. Learning Wikibase
    1. http://learningwikibase.com/
  4. Professional Wikibase hosting?
    1. E.g. https://twitter.com/ProWikiExperts/status/1130334015637082112
  5. Running Wikibase and Semantic MediaWiki on the same wiki
    1. https://twitter.com/wikiworks/status/1176863528864616448
    2. https://twitter.com/datao/status/1176862020332871681
  6. Determination of the price of operating Scholia
    1. https://twitter.com/andrawaag/status/1187025994177302528
    2. https://twitter.com/ProWikiExperts/status/1130334015637082112
  7. Dockerizing Scholia
    1. Specialist will be hired to work on this January 2020
    2. One pull request for a Docker file: https://github.com/fnielsen/scholia/pull/691/files
    3. Depends on the SPARQL endpoint to be configurable, see https://github.com/fnielsen/scholia/issues/809
  8. SPARQL visualizer tool from Potsdam
    1. https://github.com/fnielsen/scholia/issues/754

Corpora[edit]

For the moment, corpora that indicate technical issues (testing corpora) and corpora for which Scholia works fine and thus allows community engagement (community corpora) are both listed here, since transitions between both groups are common as bugs and errors arise and get addressed. We are exploring how best to facilitate distinction between the two when it matters.

Gallery[edit]

Testing corpora[edit]

Organizations[edit]

The following query (source) provides a list of organizations, sorted (in descending order) by number of affiliated people known to Wikidata. For most of these 200 most curated institutions, several of the panels in Scholia's organization aspect have issues with display.

The following query uses these:

  • Properties: employer (P108)  View with Reasonator View with SQID, member of (P463)  View with Reasonator View with SQID, affiliation (P1416)  View with Reasonator View with SQID, part of (P361)  View with Reasonator View with SQID, GRID ID (P2427)  View with Reasonator View with SQID
    SELECT ?count ?institution ?institutionLabel 
    WITH {
      SELECT (COUNT(DISTINCT ?researcher) AS ?count) ?institution WHERE {
        ?researcher ( wdt:P108 | wdt:P463 | wdt:P1416 ) / wdt:P361* ?institution .
        ?institution wdt:P2427 ?grid .
      } 
      GROUP BY ?institution
    } AS %result
    WHERE {
      INCLUDE %result
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en,da,de,ep,fr,jp,nl,no,ru,sv,zh" . } 
    } 
    ORDER BY DESC(?count)
    LIMIT 200
    

Universities[edit]

Africa[edit]
Asia[edit]
Europe[edit]
North America[edit]

Department or subgroup[edit]

Other[edit]

Topics[edit]

Sustainable development[edit]

In this section, the target corpora are bolded, whereas the other items are listed to provide some of the context of curation in this area.

Other topics[edit]

Individuals[edit]

For individuals, our testing mainly revolves around those that have Wikipedia articles with Scholia profiles as well as candidates for having that template added to their English Wikipedia article. However, we might list a few individuals below who do not neatly fit these two groups but who may be of interest for another reason.

Physicians
Other scientists
Other academics

Publishing[edit]

Journals
Publishers

Locations[edit]

Countries[edit]

Events[edit]

Awards[edit]

Fields[edit]

Clinical trials[edit]

Chemistry[edit]

Genes[edit]

Metabolic pathways[edit]

Taxon[edit]

Lexemes[edit]

See also[edit]