Wikidata:Wikidata curricula/Activities/Pywikibot/Missing label in target language

From Wikidata
Jump to navigation Jump to search

A very simple Pywikibot script to replicate missing language labels.

Example: Select all Belgian persons (politicians, engineers, business people, ...) having missing language labels.

This was my first Python script I ever wrote.

First I implemented it as a Wikidata Script with Quickstatements. But it is much more effective with a fully automated Python script (no manual data manipulation with Excel required).

Usage[edit]

You can run this script from the shell, or from PAWS

./missing_person_label.py [input language] [output languages]...

Simplified script[edit]

The below simplified script gives you an idea of what it does.

Note: You can consult the complete script code history, including technical details, and more documentation.

Tips:

  • You can select the country, the source, and the target languages
  • I am running this on an always-on Raspberry Pi (actually a Piwikibot 🙂)
    • Low power consumption, and it is serving other functions in the home anyway...
    • My laptop does not need to stay online... saving electricity, and allows me to travel onto another network while the script runs on the Pi...
#!/usr/bin/python3

import sys
import time
import pywikibot
from datetime import datetime
from pywikibot import pagegenerators as pg

def wd_proc_all_items():
    QUERY = """# Search for Belgian/Netherlands citizen with missing en label
SELECT DISTINCT ?item WHERE {
  VALUES ?instance { wd:Q5 }
  VALUES ?country {wd:Q31 }
  ?item wdt:P31 ?instance;
    wdt:P27 ?country;
    rdfs:label ?itemLabel.
  FILTER((LANG(?itemLabel)) = '""" + inlang + """')
  MINUS {
    ?item rdfs:label ?label.
    FILTER((LANG(?label)) = '""" + outlang + """')
  }
}
"""
    print(QUERY)
    wikidata_site = pywikibot.Site("wikidata", "wikidata")
    generator = pg.WikidataSPARQLPageGenerator(QUERY, site=wikidata_site)

    i=0
    errsleep = 0
    print('Getting data')
    now = datetime.now()

    for item in generator:
        i += 1
        status = 'OK'
        label = ''

        try:
            item.get()
            if inlang in item.labels:
                label = item.labels[inlang]

            if label == '':     # Label not available; skip update
                status = 'Ignore'
            elif outlang in item.labels: # Target label already updated; skip duplicate update
                status = 'Skip'
            else:               # Update the target item label

                item.editLabels( {outlang: label}, summary="Pwb copy " + inlang + " label" )
                errsleep = 0
        except KeyboardInterrupt:
            sys.exit(1)
        except:
            status = 'Error'
            totsecs = int((datetime.now() - now).total_seconds())       # Calculate technical error penalty
            if totsecs >= 30:   # Technical error
                errsleep += totsecs * 5
            if errsleep > 0:    # Allow the servers to catch up
                print('%d seconds maxlag wait' % errsleep)
                time.sleep(errsleep)

        prevnow = now
        now = datetime.now()
        isotime = now.strftime("%Y-%m-%d %H:%M:%S")
        totsecs = (now - prevnow).total_seconds()
        print('%d\t%s\t%s\t%f\t%s\t%s' % (i, isotime, status, totsecs, item.getID(), label))

param = sys.argv                # Get the command parameters
if len(param) <= 2:             # Welcome the user
    print('in out...')
else:
    param.pop(0)                # Skip the name of the executable
    inlang = param.pop(0)       # P1 = Source language (mandatory parameter)

    for outlang in param:       # Loop for all target languages (mandatory parameter)
      if inlang != outlang:     # Skip input language
        wd_proc_all_items()     # Execute all items for one language

Prerequisites[edit]

You need to install and configure Pywikibot on a (virtual, private) Linux system, or use PAWS on a shared server.

Known problems[edit]

It is important to have a proper error handler to allow the script to recover from single transaction errors. Without proper error handler the script would fail (repeatedly) with a fatal error on the (same) first transaction in error and would not continue with the rest of the transactions.

Execute by item

  • Updates should be executed by item, instead of by language (avoid multiple watch notifications)

User errors

  • WARNING: Http response status 400
    • Syntax error in the SPARQL code
  • ERROR: An error occurred for uri ... WARNING: Waiting 240 seconds before retrying.
    • Too many items in query: add additional filters to reduce the number of items
    • You can relax the filters after data gets processed
  • WARNING: Http response status 429
    • LIMIT too high or missing

Data errors

  • WARNING: wikibase-form datatype is not supported yet.
  • WARNING: wikibase-lexeme datatype is not supported yet.
  • WARNING: API error modification-failed: Item Q682310 already has label "Gerrit Schimmelpenninck" associated with language code en, using the same description text.
    • The target label remained empty
    • Another item had the same label and description
      • Edit the 2 items to have a unique description (e.g. adding the birth/death date)
      • You should add different from (P1889) for both items
      • Possibly you might need to merge 2 identical items
  • WARNING: API error editconflict: Edit conflict. Could not patch the current revision.

Login failure

  • WARNING: API error badtoken: Invalid CSRF token: general problem with authentication (temporary)

Server errors

  • MaxlagTimeoutError: retry later (replication server busy)
  • OtherPageSaveError: ignore (the update was still made; verify the item update history)
  • ReadTimeoutError: retry later (HTTPS network error)
  • WARNING: API error failed-save: The save has failed. (general error)
  • Sleeping for 9.0 seconds, 2020-06-11 00:17:54
    • The script runs pretty slow, not to overload the servers (about 10 transactions per minute; use put_throttle = 6)
    • Set noisysleep = 60.0 to avoid too many "Sleeping" messages
    • When a transaction error occurs, the application sleep for some minutes (maxlag wait suspected)
    • Create and configure a bot account (higher transaction speed allowed)
    • You can increase the execution speed by assigning a lower value to put_throttle
  • Maximum retries attempted due to maxlag without success.
  • Maximum retries attempted without success.
  • WARNING: API error readonly: The database has been automatically locked while the replica database servers catch up to the master
  • WARNING: API error internal_api_error_JobQueueError

Network errors

  • requests.exceptions.ConnectionError: HTTPSConnectionPool(host='query.wikidata.org', port=443): Read timed out.
  • requests.exceptions.ConnectionError: HTTPSConnectionPool (network error)
  • Remote end closed connection without response

PAWS

  • "Username unknown" problem with OAth and sitelinks to special namespaces

Workaround: create the missing username with e.g. wikisource:Special:UserLogin

Notes[edit]

  1. You should manually amend failed transactions
  2. Transactions that would be skipped once, due to a transient error, can be retried later.
  3. You should wait until the transactions are replicated to the SPARQL reporting instance before re-executing the script to avoid duplicate transactions

External link[edit]

See also[edit]