Wikidata:Tools/Wikidata Topic Curator

From Wikidata
Jump to navigation Jump to search

Wikidata Topic Curator is a rewrite of ItemSubjector into a webapp to help wikimedians add relevant topics to items.

Features[edit]

Based on a given topic QID it fetches articles matching the label, aliases or a custom user-provided term of that QID that is currently missing the main subject property.

  • Multi-term support
  • Populating terms from label and aliases
  • User defined terms
  • Excluding items with a certain term (via CirrusSearch affix, see below)
  • Support batch upload by sending the matches to QuickStatements
  • Support for multiple languages
  • Support for any subgraph
  • Support for nudging users to match subtopics BEFORE parent topics. (E.g. "domestic violence" before "violence" and to exclude matches that has a subtopic of the current topic when matching the parent topic. In this example an article matched with "domestic violence" already will be excluded when using the tool to match articles to a parent topic like "violence".)
  • 3 predefined subgraphs that can be selected:
  • Scientific articles
  • Journals
  • Riksdagen Documents

Suggested workflow[edit]

  • find a topic of interest
  • send it to the tool using the user script (see below)
  • choose the terms you want to include (recommended: only those longer than ~5 chars to avoid false positives)
  • inspect the titles of the matched articles to make sure you have a sufficiently narrow topic
  • inspect a few of the matches more closely (go the item/full resource) to make sure they make sense
  • check all the items you want to match
  • log in to QuickStatements in a new tab
  • submit to QuickStatements

Excluding terms[edit]

Sometimes you want to exclude certain words from the labels of items. E.g. when matching on "parental alienation" you don't want "parental alienation syndrome" to appear. That can easily be archived by entering "-inlabel:syndrome" as affix or adding &affix=-inlabel:syndrome to the url.

User script[edit]

Consider using User:So9q/wikidata-topic-curator-link.js to send items to the start page conveniently using the link in the Toolbox. The tool will automatically fetch and populate the terms based on label and aliases of the item upon load.

Synia support[edit]

Synia topics now include links to this tool.

Deployment details[edit]

The tool is served via Gunicorn with 20 workers and 45 second timeout from an OVH datacenter in Germany.

If a worker time out (e.g. if there are many subtopics) you get a 502. Please report urls that return 502 in this issue.

Roll your own[edit]

If you run the tool locally (e.g. via Pycharm or Docker) you can avoid all timeout limitations.

It's nice to do your own computing as it gives you freedom and control.

Source code[edit]

Impact[edit]

2024-02-18 week 3[edit]

  • Scientific articles: 24 079 010/41 447 254. Items missing at least one topic: 58,10%
  • Riksdagen documents: 183 820/263 831. Items missing at least one topic: 69,67%

The tool is now hosted in a private VPS in Germany.

2024-02-01 week 1[edit]

Down to 24 153 666. Difference: -~40k

The tool has not yet been launched in a way that enables batches larger than 30 items because of limitations in Toolforge.

2024-01-24 baseline[edit]

The number of scholarly articles missing any subject is 24 194 625. This is a few million items more are needing a topic since I quit using ItemSubjector and recommended others to do the same in 2022.

The matching of journals to topics has only changed by a few hundred to 85 080 since I worked on it with ItemSubjector. We really need that to be zero so we can match better by selecting subset of articles. E.g. when working on a physics topic we can exclude al articles not linked to a physics journal.