Wikidata:WikiProject Cultural heritage/Reports/WLM on WD (Italy)

From Wikidata
Jump to navigation Jump to search

Introduction[edit]

This report describes the process of ingesting data about Italian heritage and cultural properties from Wiki Loves Monuments project lists. The project was carried out in the summer/fall 2016 and was made by User:Nvitucci in coordination with User:Cristian Cenci (WMIT).

Objectives of the project[edit]

The main objectives of this project were:

  • to make the generation of WLM lists on Wikipedia easier;
  • to move as much valid information as possible to Wikidata, for easier management and interoperability with other data sources.

Source data[edit]

Our first step was to build a stable “internal” database starting from the lists built over previous editions of the Wiki Loves Monuments contest, in order to make the data management easier. In order to do so, we had to:

  • deal with monuments that have been added for some editions and whose authorization was revoked (or not renewed) for later editions;
  • deal with a legacy ID system, which could possibly cause the duplication of items;
  • clean the identifiers: some IDs had a -MIBAC suffix that was added in the past to support a custom template (see [1]), but had otherwise no special meaning.

The database was initially kept as an OpenOffice spreadsheet file in order not to disrupt the existing processes of (new) data insertion while retaining machine readability; it was later converted to a TSV file for easier processing and transformation. This database was (and is) our central source of data related to WLM, and was used to create WLM lists as well as create/update Wikidata items.

List creation[edit]

We built a tool to generate (or update) Wikitext in order to help the creation of WLM 2016 lists from the newly created database; the tool made it easier to add custom information to selected monuments sets (e.g. to add local prize fields to monuments from selected cities) without resorting to bots. Later on we extended the tool to create (or update) Wikidata elements related to the monuments.

Wikidata items update/creation[edit]

The update and creation of Wikidata elements was carried out in phases.

  1. First of all we searched for existing Wikidata items, with a combination of automated search (e.g. by matching the monument Wikipage and/or name) and manual check. This was needed because in some cases the Wikipage, although already existing, was not related to the monument directly but rather to a set of monuments or even to the city where the monument is located; in other cases, the monument name was not found on Wikidata because it was not included in the label, or because an alias was used instead.
  2. We then updated the existing elements via the QuickStatements tool; we created the statements to be loaded (see the #QuickStatements details section) using our extended tool.
  3. We identified the elements that could be created automatically, again via the QuickStatements tool: the best showcase for this process is the list of Pompeii items #QuickStatements details discussed extensively here.
  4. We created a template (still on a user page only) meant to fetch a single item’s information from Wikidata, with the intention to use it formally with the 2017 edition; we decided not to use it for the 2016 edition because it would have required several changes to the lists. The template can be found here: it "blends" the existing WLM template, where all the information regarding a monument are inserted manually, with another template (wrapping a LUA module) meant to extract such information from Wikidata when a Wikidata identifier is available.
  5. We also developed a Web tool to make the direct insertion of monument data in Wikidata easier, but the tool is still experimental.

QuickStatements details[edit]

The QuickStatements tool is used to batch-insert content into Wikidata. QuickStatements is “unsupervised” (i.e. the content is just inserted with no formal verification process except for the format of what is inserted); since it’s possible to insert unverified data and (even in large amounts), care is required when adding content.

Schema[edit]

We mostly used classes and properties that already existed in Wikidata when we started (summer 2016), especially the existing property WLM ID (that comes along with constraint on the use of other properties). We needed to create the Italian national heritage (Q26971668) View with Reasonator View with SQID item to assign as a requested value for the heritage designation (P1435) View with SQID property, although we found that there might be some shortcomings (e.g. is it ok to use it also for natural heritage? Can it be used even if there is no official heritage list?)

Updating existing items[edit]

The “safest” route is to only update existing elements with further information, in our case with a WLM ID, labels and (sometimes) addresses. Example:

Qxxxxxx Lit "name"
Qxxxxxx P2186 "0123456789" S143 Q19960422
Qxxxxxx P17 Q38 S143 Q19960422
Qxxxxxx P131 Qyyyyyyyy S143 Q19960422
Qxxxxxx P1435 Q26971668 S143 Q19960422

These statements would update the element with Q number Qxxxxxx by:

All the statements are qualified with the imported from Wikimedia project (P143) View with SQID qualifier, so to say that this information has been imported from Wiki Loves Monuments Italia (Q19960422) View with Reasonator View with SQID.

Creating new items[edit]

The somewhat "riskier" route is to create a new Wikidata element as we did for Pompeii buildings; one should first make sure that the element does not already exist, since the risk is to create duplicates. That said, here is an example:

CREATE
LAST Lit "{nome} ({regio}.{insula}.{pos})"
LAST Ait "Pompei {regio}.{insula}.{pos}"
LAST Aen "Pompeii {regio}.{insula}.{pos}"
LAST P17 Q38 S143 Q19960422
LAST P131 Q36471 S143 Q19960422
LAST P31 Q109607
LAST P1435 Q26971668
LAST P2186 "0123456789" S143 Q19960422
LAST P276 Q43332 S143 Q19960422
LAST P361 Qxxxxxx S143 Q19960422
LAST P528 "{regio}.{insula}.{pos}" P972 Q27055447

These statements would create an item with:

plus some information specific for Pompeii items:

For Pompeii, the insertion of all the items (~2000) took several minutes. In order not to use up too many resources, we loaded such statements in chunks depending on the item's regio (Q26912005) View with Reasonator View with SQID (there are 9).

Results[edit]

There are now ca. 4,500 WLM items on Wikidata spanning more than 100 types, the five most represented being:

Most of these types were extracted directly from monument names (possibly with some text manipulation, e.g. character substitution or the use of synonyms such as "chiesa" and "chiesetta" or "palazzo comunale" and "palazzo municipale"). All the monuments belonging to each class were manually checked before insertion in order to avoid duplicates (with the exception of most of the Pompeii monuments, which were created from scratch).

Challenges[edit]

During the migration project we faced a number of challenges.

  1. The creation of stable IDs: since 2012 the monuments IDs are created within the WLM project since there is no comprehensive, unified, national Monument DB (although some effort is being made now);
  2. "Noisy" data: since any municipality can propose its one monuments to the list, we had (and still have) some items that are not actually "monuments" but rather points of interests or cultural properties (even though the “monument” definition is vague sometimes).
    1. Relevance (as a consequence): is every item from WLM lists a "notable" Wikidata item? After some discussions we opted for the “Yes, because it supports the WLM project” route, although we still had doubts about hamlets, main squares, and other elements to be treated as "monuments".
    2. Manual work (especially for verification) often needed.
    3. Time needed to agree on data structure.
  3. Discussions about the creation of new items (especially for Pompeii):
    1. When several monuments are grouped together (e.g. arcades, buildings of a hamlet, or Pompeii buildings), is every single building a “monument” on its own or is it just a part of a larger monument (i.e. "arcades", or the Pompeii site)?
    2. Since there is no official list of codes for all the buildings in Pompeii, is it possible to use a non-official (but well documented) source?
    3. How should types (and English labels/description) be obtained for a building? In some cases we could extract the type from a monument name, otherwise this should be added afterwards (but the addition of items without at least a type is discouraged).
    4. Question about dates: should we use Julian or Gregorian dates?
  4. Address validation: we planned to add them after some further verification, to possibly make use of some geolocation.
    1. What is the "address" of a natural resource (e.g. a park or a river)? Can the address provided by a municipality (and inserted in the WLM lists) be always considered "valid"?

Conclusions[edit]

Our conclusion is that this migration process is not technically difficult per se, but it brings many decisions and questions to be answered from a more general point of view. Our main results were:

  • cleaner data (especially data from Emilia Romagna region, thanks to their monument database);
  • cleaner and stable IDs;
  • a way to (semi-)automate creation and update of WLM-related Wikidata items from a database, possibly with custom rules;
  • a way to make the creation and update of WLM lists easier, so that information does not have to be scattered and repeated.

We already planned to make some fixes and updates for the 2017 edition of WLM.

2018 continuation[edit]

See https://github.com/synapta/wikidata-mibact-luoghi-cultura

See also[edit]