Wikidata:Wikidata curricula/Activities/Pywikibot/Missing label in target language

A very simple Pywikibot script to replicate missing language labels.

Example: Select all Belgian persons (politicians, engineers, business people, ...) having missing language labels.

This was my first Python script I ever wrote.

First I implemented it as a Wikidata Script with Quickstatements. But it is much more effective with a fully automated Python script (no manual data manipulation with Excel required).

Usage[edit]

You can run this script from the shell, or from PAWS

./missing_person_label.py [input language] [output languages]...

Simplified script[edit]

The below simplified script gives you an idea of what it does.

Note: You can consult the complete script code history, including technical details, and more documentation.

Tips:

You can select the country, the source, and the target languages
I am running this on an always-on Raspberry Pi (actually a Piwikibot 🙂)
- Low power consumption, and it is serving other functions in the home anyway...
- My laptop does not need to stay online... saving electricity, and allows me to travel onto another network while the script runs on the Pi...

#!/usr/bin/python3

import sys
import time
import pywikibot
from datetime import datetime
from pywikibot import pagegenerators as pg

def wd_proc_all_items():
    QUERY = """# Search for Belgian/Netherlands citizen with missing en label
SELECT DISTINCT ?item WHERE {
  VALUES ?instance { wd:Q5 }
  VALUES ?country {wd:Q31 }
  ?item wdt:P31 ?instance;
    wdt:P27 ?country;
    rdfs:label ?itemLabel.
  FILTER((LANG(?itemLabel)) = '""" + inlang + """')
  MINUS {
    ?item rdfs:label ?label.
    FILTER((LANG(?label)) = '""" + outlang + """')
  }
}
"""
    print(QUERY)
    wikidata_site = pywikibot.Site("wikidata", "wikidata")
    generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

    i=0
    errsleep = 0
    print('Getting data')
    now = datetime.now()

    for item in generator:
        i += 1
        status = 'OK'
        label = ''

        try:
            item.get()
            if inlang in item.labels:
                label = item.labels[inlang]

            if label == '':     # Label not available; skip update
                status = 'Ignore'
            elif outlang in item.labels: # Target label already updated; skip duplicate update
                status = 'Skip'
            else:               # Update the target item label

                item.editLabels( {outlang: label}, summary="Pwb copy " + inlang + " label" )
                errsleep = 0
        except KeyboardInterrupt:
            sys.exit(1)
        except:
            status = 'Error'
            totsecs = int((datetime.now() - now).total_seconds())       # Calculate technical error penalty
            if totsecs >= 30:   # Technical error
                errsleep += totsecs * 5
            if errsleep > 0:    # Allow the servers to catch up
                print('%d seconds maxlag wait' % errsleep)
                time.sleep(errsleep)

        prevnow = now
        now = datetime.now()
        isotime = now.strftime("%Y-%m-%d %H:%M:%S")
        totsecs = (now - prevnow).total_seconds()
        print('%d\t%s\t%s\t%f\t%s\t%s' % (i, isotime, status, totsecs, item.getID(), label))

param = sys.argv                # Get the command parameters
if len(param) <= 2:             # Welcome the user
    print('in out...')
else:
    param.pop(0)                # Skip the name of the executable
    inlang = param.pop(0)       # P1 = Source language (mandatory parameter)

    for outlang in param:       # Loop for all target languages (mandatory parameter)
      if inlang != outlang:     # Skip input language
        wd_proc_all_items()     # Execute all items for one language

Prerequisites[edit]

You need to install and configure Pywikibot on a (virtual, private) Linux system, or use PAWS on a shared server.

Known problems[edit]

It is important to have a proper error handler to allow the script to recover from single transaction errors. Without proper error handler the script would fail (repeatedly) with a fatal error on the (same) first transaction in error and would not continue with the rest of the transactions.

Execute by item

Updates should be executed by item, instead of by language (avoid multiple watch notifications)

User errors

WARNING: Http response status 400
- Syntax error in the SPARQL code
  - Paste your query in Wikidata Query to verify and correct
ERROR: An error occurred for uri ... WARNING: Waiting 240 seconds before retrying.
- Too many items in query: add additional filters to reduce the number of items
- You can relax the filters after data gets processed
WARNING: Http response status 429
- LIMIT too high or missing

Data errors

WARNING: wikibase-form datatype is not supported yet.
WARNING: wikibase-lexeme datatype is not supported yet.
WARNING: API error modification-failed: Item Q682310 already has label "Gerrit Schimmelpenninck" associated with language code en, using the same description text.
- The target label remained empty
- Another item had the same label and description
  - Edit the 2 items to have a unique description (e.g. adding the birth/death date)
  - You should add different from (P1889) for both items
  - Possibly you might need to merge 2 identical items
WARNING: API error editconflict: Edit conflict. Could not patch the current revision.

Login failure

WARNING: API error badtoken: Invalid CSRF token: general problem with authentication (temporary)

Server errors

MaxlagTimeoutError: retry later (replication server busy)
OtherPageSaveError: ignore (the update was still made; verify the item update history)
ReadTimeoutError: retry later (HTTPS network error)
WARNING: API error failed-save: The save has failed. (general error)
Sleeping for 9.0 seconds, 2020-06-11 00:17:54
- The script runs pretty slow, not to overload the servers (about 10 transactions per minute; use put_throttle = 6)
- Set noisysleep = 60.0 to avoid too many "Sleeping" messages
- When a transaction error occurs, the application sleep for some minutes (maxlag wait suspected)
- Create and configure a bot account (higher transaction speed allowed)
- You can increase the execution speed by assigning a lower value to put_throttle
Maximum retries attempted due to maxlag without success.
Maximum retries attempted without success.
WARNING: API error readonly: The database has been automatically locked while the replica database servers catch up to the master
WARNING: API error internal_api_error_JobQueueError

Network errors

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='query.wikidata.org', port=443): Read timed out.
requests.exceptions.ConnectionError: HTTPSConnectionPool (network error)
Remote end closed connection without response

PAWS

"Username unknown" problem with OAth and sitelinks to special namespaces

Tracked in Phabricator
Task T168222

Tracked in Phabricator
Task T252306

Workaround: create the missing username with e.g. wikisource:Special:UserLogin

Notes[edit]

You should manually amend failed transactions
Transactions that would be skipped once, due to a transient error, can be retried later.
You should wait until the transactions are replicated to the SPARQL reporting instance before re-executing the script to avoid duplicate transactions

External link[edit]

https://github.com/geertivp/Pywikibot/blob/main/copy_label.py (replaces the above script)