Wikidata:Pywikibot - Python 3 Tutorial/Data Harvest

From Wikidata
Jump to navigation Jump to search

This chapter will show you how to gather data. We will explore the Wikidata API and look at the different return values. After working through this chapter you should be competent enough to explore the return values yourself and print the desired data to the terminal.

Remember that all scripts are run inside the core directory using the following command:

$ python3 nameofscript.py

Getting labels[edit]

This section as a PAWS notebook (raw)

This example gets all the Wikidata-data from the Item connected with the English Wikipedia page for Douglas Adams:

#!/usr/bin/python3
import pywikibot

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "Douglas Adams")
item = pywikibot.ItemPage.fromPage(page)

print(item)

Let us look at the output in detail. print(item) will output [[wikidata:Q42]], which is a ItemPage-object. The methods available for the object can be shown by adding a print function (print(dir(item))) to the script and running it again. The result will look like this:

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_cmpkey', '_content', '_cosmetic_changes_hook', '_defined_by', '_diff_to', '_getInternals', '_get_parsed_page', '_isredir', '_latest_cached_revision', '_link', '_namespace', '_normalizeData', '_normalizeLanguages', '_revid', '_revisions', '_save', 'addClaim', 'aliases', 'applicable_protections', 'aslink', 'autoFormat', 'backlinks', 'botMayEdit', 'canBeEdited', 'categories', 'change_category', 'claims', 'content_model', 'contributingUsers', 'contributors', 'coordinates', 'data_item', 'data_repository', 'defaultsort', 'delete', 'descriptions', 'editAliases', 'editDescriptions', 'editEntity', 'editLabels', 'editTime', 'embeddedin', 'encoding', 'exists', 'expand_text', 'extlinks', 'fromPage', 'fullVersionHistory', 'full_url', 'get', 'getCategoryRedirectTarget', 'getCreator', 'getDeletedRevision', 'getID', 'getLatestEditors', 'getMovedTarget', 'getOldVersion', 'getRedirectTarget', 'getReferences', 'getRestrictions', 'getSitelink', 'getTemplates', 'getVersionHistory', 'getVersionHistoryTable', 'getdbName', 'id', 'image_repository', 'imagelinks', 'interwiki', 'isAutoTitle', 'isCategory', 'isCategoryRedirect', 'isDisambig', 'isEmpty', 'isFlowPage', 'isImage', 'isIpEdit', 'isRedirectPage', 'isStaticRedirect', 'isTalkPage', 'is_flow_page', 'iterlanglinks', 'iterlinks', 'itertemplates', 'labels', 'langlinks', 'lastNonBotUser', 'latestRevision', 'latest_revision', 'latest_revision_id', 'linkedPages', 'loadDeletedRevisions', 'markDeletedRevision', 'mergeInto', 'move', 'moved_target', 'namespace', 'oldest_revision', 'pageAPInfo', 'permalink', 'preloadText', 'previousRevision', 'previous_revision_id', 'properties', 'protect', 'protection', 'purge', 'put', 'put_async', 'removeClaims', 'removeImage', 'removeSitelink', 'removeSitelinks', 'replaceImage', 'repo', 'revision_count', 'revisions', 'save', 'section', 'sectionFreeTitle', 'setSitelink', 'setSitelinks', 'set_redirect_target', 'site', 'sitelinks', 'templates', 'text', 'title', 'titleForFilename', 'titleWithoutNamespace', 'toJSON', 'toggleTalkPage', 'touch', 'undelete', 'urlname', 'userName', 'version', 'watch']

Don't be overwhelmed by all the possibilities and just learn the methods one by one. If we for example add a print-function in the example to say print(item.exists()), we kind of already expect that it will return a True boolean.

The next important method is item.get() which turns the whole item into a dictionary. Add the following lines to the example and run it again:

item_dict = item.get()
print(item_dict.keys())

So now we know that the returned dictionary has the following keys:

dict_keys(['aliases', 'sitelinks', 'descriptions', 'claims', 'labels'])

And if we are interested in the labels we can add print(item_dict["labels"]) in the example. This will output:

{'gl': 'Douglas Adams', 'pl': 'Douglas Adams', 'gu': 'ડગ્લાસ એડમ્સ', 'wo': 'Douglas Adams', 'nl': 'Douglas Adams', 'is': 'Douglas Adams', 'vo': 'Douglas Adams', 'de-ch': 'Douglas Adams', 'eu': 'Douglas Adams', 'nb': 'Douglas Adams', 'frp': 'Douglas Adams', 'an': 'Douglas Adams', 'tr': 'Douglas Adams', 'vep': 'Adams Duglas', 'ak': 'Doglas Adams', 'mg': 'Douglas Adams', 'lb': 'Douglas Adams', 'zh-cn': '道格拉斯·亚当斯', 'ckb': 'دەگلاس ئادمز', 'vi': 'Douglas Adams', 'gd': 'Douglas Adams', 'sc': 'Douglas Adams', 'zh-sg': '道格拉斯·亚当斯', 'ka': 'დაგლას ადამსი', 'scn': 'Douglas Adams', 'da': 'Douglas Adams', 'be': 'Дуглас Адамс', 'kn': 'ಡಾಗ್ಲಸ್ ಆಡಮ್ಸ್', 'sh': 'Douglas Adams', 'de': 'Douglas Adams', 'ar': 'دوغلاس آدمز', 'pt-br': 'Douglas Adams', 'sl': 'Douglas Adams', 'li': 'Douglas Adams', 'hy': 'Դուգլաս Ադամս', 'zh': '道格拉斯·亚当斯', 'nds': 'Douglas Adams', 'sv': 'Douglas Adams', 'sr-el': 'Daglas Adams', 'bn': 'ডগলাস', 'zh-tw': '道格拉斯·亞當斯', 'ca': 'Douglas Adams', 'uk': 'Дуглас Адамс', 'ta': 'டக்ளஸ் ஆடம்ஸ்', 'la': 'Duglassius Adams', 'ja': 'ダグラス・アダムズ', 'el': 'Ντάγκλας Άνταμς', 'id': 'Douglas Adams', 'co': 'Douglas Adams', 'sk': 'Douglas Adams', 'nrm': 'Douglas Adams', 'lt': 'Douglas Adams', 'en-ca': 'Douglas Adams', 'ms': 'Douglas Adams', 'de-at': 'Douglas Adams', 'hu': 'Douglas Adams', 'als': 'Douglas Adams', 'zh-my': '道格拉斯·亚当斯', 'fur': 'Douglas Adams', 'io': 'Douglas Adams', 'rm': 'Douglas Adams', 'lij': 'Douglas Adams', 'et': 'Douglas Adams', 'jv': 'Douglas Adams', 'fr': 'Douglas Adams', 'nds-nl': 'Douglas Adams', 'ro': 'Douglas Adams', 'zh-mo': '道格拉斯·亞當斯', 'bg': 'Дъглас Адамс', 'kl': 'Douglas Adams', 'oc': 'Douglas Adams', 'az': 'Duqlas Noel Adams', 'kg': 'Douglas Adams', 'sr': 'Даглас Адамс', 'zh-hant': '道格拉斯·亞當斯', 'gsw': 'Douglas Adams', 'mrj': 'Адамс', 'bar': 'Douglas Adams', 'war': 'Douglas Adams', 'zh-hk': '道格拉斯·亞當斯', 'it': 'Douglas Adams', 'lmo': 'Douglas Adams', 'be-tarask': 'Дуглас Адамз', 'vec': 'Douglas Adams', 'si': 'ඩග්ලස් ඇඩම්ස්', 'br': 'Douglas Adams', 'wa': 'Douglas Adams', 'ko': '더글러스 애덤스', 'sr-ec': 'Даглас Адамс', 'pms': 'Douglas Adams', 'zh-hans': '道格拉斯·亚当斯', 'ia': 'Douglas Adams', 'te': 'డగ్లస్ ఆడమ్స్', 'be-x-old': 'Дуглас Адамс', 'hi': 'डग्लस अ\u200dडम्स', 'fi': 'Douglas Adams', 'nap': 'Douglas Adams', 'ne': 'डगलस एडम्स', 'ru': 'Дуглас Адамс', 'mk': 'Даглас Адамс', 'cs': 'Douglas Adams', 'ie': 'Douglas Adams', 'he': 'דאגלס אדאמס', 'vls': 'Douglas Adams', 'sco': 'Douglas Adams', 'es': 'Douglas Adams', 'min': 'Douglas Adams', 'fo': 'Douglas Adams', 'ml': 'ഡഗ്ലസ് ആഡംസ്', 'bs': 'Douglas Adams', 'eo': 'Douglas Adams', 'pt': 'Douglas Adams', 'sw': 'Douglas Adams', 'ast': 'Douglas Adams', 'arz': 'دوجلاس ادامز', 'or': 'ଡଗ୍\u200cଲାସ୍\u200c ଆଦାମ୍\u200cସ', 'en-gb': 'Douglas Adams', 'nn': 'Douglas Adams', 'en': 'Douglas Adams', 'zu': 'Douglas Adams', 'lv': 'Duglass Adamss', 'hr': 'Douglas Adams', 'pcd': 'Douglas Adams', 'ur': 'ڈگلس ایڈم', 'ga': 'Douglas Adams', 'rwr': 'डग्लस अ\u200dडम्स', 'mr': 'डग्लस अॅडम्स', 'cy': 'Douglas Adams', 'af': 'Douglas Adams', 'sq': 'Douglas Adams', 'fa': 'داگلاس آدامز'}

By the curly braces you can tell that you are getting a dictionary, whose keys are the Wikipedia language codes, and whose values are the current label in that language. In addition you can see the big advantage of Python 3: we get all the UTF-8 characters for free and the output is much more readable than for Farsi for example: {..., u'fa': u'\u062f\u0627\u06af\u0644\u0627\u0633 \u0622\u062f\u0627\u0645\u0632', ...}.

Accessing the Wikidata item directly[edit]

This section as a PAWS notebook (raw)

It is also possible to call the Wikidata item directly. All that is needed is the Q-code of the item. This code will allow the same access to the ItemPage-object:

import pywikibot

site = pywikibot.Site("wikidata", "wikidata")
repo = site.data_repository()
item = pywikibot.ItemPage(repo, "Q42")

Getting claims[edit]

This section as a PAWS notebook (raw)

Next we look at how to extract claims from Wikidata. We will also continue our explorative approach and learn how to work like a programmer in unchartered territory. Lets start out with this script, that calls the page electron (Q2225):

import pywikibot

#Get the item
site = pywikibot.Site("wikidata", "wikidata")
repo = site.data_repository()
item = pywikibot.ItemPage(repo, "Q2225")

item_dict = item.get() #Get the item dictionary
clm_dict = item_dict["claims"] # Get the claim dictionary
print(clm_dict)

When we run this script as usual we will get a nice dictionary of all the statements the electron-item currently has:

{'P1014': [<pywikibot.page.Claim object at 0x7f38a8fbc438>], 'P1123': [<pywikibot.page.Claim object at 0x7f38ac53f9e8>], 'P227': [<pywikibot.page.Claim object at 0x7f38ac53fa90>], 'P31': [<pywikibot.page.Claim object at 0x7f38ac53fb00>], 'P517': [<pywikibot.page.Claim object at 0x7f38ac53fc50>, <pywikibot.page.Claim object at 0x7f38ac53fe48>], 'P461': [<pywikibot.page.Claim object at 0x7f38ac53ff98>], 'P1122': [<pywikibot.page.Claim object at 0x7f38ac4c65c0>], 'P279': [<pywikibot.page.Claim object at 0x7f38ac4c62e8>], 'P61': [<pywikibot.page.Claim object at 0x7f38ac4c6358>], 'P373': [<pywikibot.page.Claim object at 0x7f38ac4c64a8>], 'P910': [<pywikibot.page.Claim object at 0x7f38ac4c6128>], 'P575': [<pywikibot.page.Claim object at 0x7f38ac4c6668>], 'P508': [<pywikibot.page.Claim object at 0x7f38ac4c6748>], 'P349': [<pywikibot.page.Claim object at 0x7f38ac4c67f0>], 'P1051': [<pywikibot.page.Claim object at 0x7f38ac4c67b8>], 'P1036': [<pywikibot.page.Claim object at 0x7f38ac4c6a20>], 'P1360': [<pywikibot.page.Claim object at 0x7f38ac4c66d8>], 'P2069': [<pywikibot.page.Claim object at 0x7f38ac4c6a90>], 'P646': [<pywikibot.page.Claim object at 0x7f38ac4c6c18>]}

As you can see the keys of the dictionary are strings (e.g. "P1014") and the values are always lists (even if only one item is in the list). This will be important for the next step, where we want to get the value and unit of the magnetic moment (P2069) claim. Add the following lines at the end of the script:

clm_list = clm_dict["P2069"]

for clm in clm_list:
    print(clm)

The code we added will iterate over the list printing each claim that is made for the magnetic moment (P2069) property. As you might already expect, the variable clm is a Python object defined by the pywikibot framework. It is a Claim object. If you add another print function in the for-loop like this: print(dir(clm)) you will again get the methods that are available to the object. This will output:

['SNAK_TYPES', 'TARGET_CONVERTER', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_formatDataValue', '_formatValue', '_type', 'addQualifier', 'addSource', 'addSources', 'changeRank', 'changeSnakType', 'changeTarget', 'fromJSON', 'getID', 'getRank', 'getSnakType', 'getSources', 'getTarget', 'getType', 'has_qualifier', 'hash', 'id', 'isQualifier', 'isReference', 'on_item', 'qualifierFromJSON', 'qualifiers', 'rank', 'referenceFromJSON', 'removeSource', 'removeSources', 'repo', 'setRank', 'setSnakType', 'setTarget', 'snak', 'snaktype', 'sources', 'target', 'target_equals', 'toJSON', 'type', 'types', 'value_types']

For a quick overview over each claim we can use the toJSON method. We can change the example to this now and run it again:

import pywikibot

site = pywikibot.Site("wikidata", "wikidata")
repo = site.data_repository()
item = pywikibot.ItemPage(repo, "Q2225")

item_dict = item.get()
clm_dict = item_dict["claims"]
clm_list = clm_dict["P2069"]

for clm in clm_list:
    print(clm.toJSON())

This will output a practical JSON representation of all the magnetic moment (P2069) statements that should look like this:

{'id': 'Q2225$edaaaf4e-48fd-6503-016c-27d857e55f40', 'type': 'statement', 'mainsnak': {'datavalue': {'value': {'lowerBound': -1.00115965218077, 'unit': 'http://www.wikidata.org/entity/Q737120', 'amount': -1.00115965218076, 'upperBound': -1.00115965218075}, 'type': 'quantity'}, 'property': 'P2069', 'snaktype': 'value', 'datatype': 'quantity'}, 'rank': 'normal'}

But we want to get the value and unit directly, so we will change the example another time and continue our unpacking adventure:

for clm in clm_list:
    ...
    clm_trgt = clm.getTarget()
    print(clm_trgt)
    print(type(clm_trgt))
    print(dir(clm_trgt))

This will print:

{
    "amount": -1.00115965218076,
    "lowerBound": -1.00115965218077,
    "unit": "http://www.wikidata.org/entity/Q737120",
    "upperBound": -1.00115965218075
}

<class 'pywikibot.WbQuantity'>

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'amount', 'fromWikibase', 'lowerBound', 'toWikibase', 'unit', 'upperBound']

So now we see that the method getTarget() returns a WbQuantity object that also has some methods of its own. Amount and units sound like good candidates for what we would like to have. So let us run this script to see if we are right:

import pywikibot

site = pywikibot.Site("wikidata", "wikidata")
repo = site.data_repository()
item = pywikibot.ItemPage(repo, "Q2225")

item_dict = item.get()
clm_dict = item_dict["claims"]
clm_list = clm_dict["P2069"]

for clm in clm_list:
    clm_trgt = clm.getTarget()
    print(clm_trgt.amount)
    print(clm_trgt.unit)

And finally we have the information we are looking for. The value of magnetic moment (P2069) claim in the item electron (Q2225) is -1.00115965218076 (a Python float) and the unit is http://www.wikidata.org/entity/Q737120 (a Python string, which corresponds to the item Bohr magneton (Q737120)).

Conclusion[edit]

After these two examples you should have a good overview over how to start getting data you need from Wikidata. Remember to check your return-values, their types - and if they are objects, their methods!