Wikidata:WikiProject Cultural heritage/Reports/WLM on WD (Italy)

From Wikidata
Jump to navigation Jump to search

This report describes the process of ingesting data about Italian heritage and cultural properties from Wiki Loves Monuments project lists. The project was initially carried out in the summer/fall 2016 by User:Nvitucci in coordination with User:Cristian Cenci (WMIT) and Yiyi (volunteer), and later continued by AlessioMela, Nemo_bis (volunteer) and Laurentius (volunteer) in 2018.

Data model[edit]

To be eligible for Wiki loves monuments Italia, Wikidata items must have the following properties:

Whenever possible, they should also have the following properties to properly identify the object:

All objects which actually participate in Wiki Loves Monuments also have:

Additionally we strive to add statements which help improve coverage and illustration of the objects, such as:

Finally, many items use some of the following properties (see a list of examples):

Less used, but potentially useful, are many other properties such as Wikidata:List of properties/art and Wikidata property for Wikivoyage.

Additionally, the OpenStreetMap element about the monument should link to Wikidata, using the wikidata: key.

WLM identifiers[edit]

Since 2015, Italian WLM identifiers are 10-characters strings like 00A0000000, where:

For instance, the first monument registered in the municipality of Abbadia San Salvatore (Q91096) has been assigned the identifier 09A0060001: 09 stands for Tuscany, A006 for Abbadia San Salvatore, and 0001 is used because it is the first monument.

A code conversion table is available.

With a SPARQL query you can list the IDs have been already assigned in a municipality or the IDs that have a certain prefix (the query may be slow).

A different format is used for some specific categories of monuments and for identifiers defined before 2015 (which were just the 6 digit ISTAT ID (P635) of the municipality + 4 digit counter, supposed to be unique across all monuments).

For veteran trees, a unique identifier was produced by prefixing 6 alphanumeric characters to the catalog code without slashes. A "bis" has been added in case of duplicate catalog codes.

Activities in 2016[edit]

Objectives of the project[edit]

The main objectives of this project were:

  • to make the generation of WLM lists on Wikipedia easier;
  • to move as much valid information as possible to Wikidata, for easier management and interoperability with other data sources.

Source data[edit]

Our first step was to build a stable “internal” database starting from the lists built over previous editions of the Wiki Loves Monuments contest, in order to make the data management easier. In order to do so, we had to:

  • deal with monuments that have been added for some editions and whose authorization was revoked (or not renewed) for later editions;
  • deal with a legacy ID system, which could possibly cause the duplication of items;
  • clean the identifiers: some IDs had a -MIBAC suffix that was added in the past to support a custom template (see commons:Template:Italy-MiBAC-disclaimer), but had otherwise no special meaning.

The database was initially kept as an OpenOffice spreadsheet file in order not to disrupt the existing processes of (new) data insertion while retaining machine readability; it was later converted to a TSV file for easier processing and transformation. This database was (and is) our central source of data related to WLM, and was used to create WLM lists as well as create/update Wikidata items.

List creation[edit]

We built a tool to generate (or update) Wikitext in order to help the creation of WLM 2016 lists from the newly created database; the tool made it easier to add custom information to selected monuments sets (e.g. to add local prize fields to monuments from selected cities) without resorting to bots. Later on we extended the tool to create (or update) Wikidata elements related to the monuments.

Wikidata items update/creation[edit]

The update and creation of Wikidata elements was carried out in phases.

  1. First of all we searched for existing Wikidata items, with a combination of automated search (e.g. by matching the monument Wikipage and/or name) and manual check. This was needed because in some cases the Wikipage, although already existing, was not related to the monument directly but rather to a set of monuments or even to the city where the monument is located; in other cases, the monument name was not found on Wikidata because it was not included in the label, or because an alias was used instead.
  2. We then updated the existing elements via the QuickStatements tool; we created the statements to be loaded (see the #QuickStatements details section) using our extended tool.
  3. We identified the elements that could be created automatically, again via the QuickStatements tool: the best showcase for this process is the list of Pompeii items #QuickStatements details discussed extensively here.
  4. We created a template (still on a user page only) meant to fetch a single item’s information from Wikidata, with the intention to use it formally with the 2017 edition; we decided not to use it for the 2016 edition because it would have required several changes to the lists. The template can be found here: it "blends" the existing WLM template, where all the information regarding a monument are inserted manually, with another template (wrapping a LUA module) meant to extract such information from Wikidata when a Wikidata identifier is available.
  5. We also developed a Web tool to make the direct insertion of monument data in Wikidata easier, but the tool is still experimental.

QuickStatements details[edit]

The QuickStatements tool is used to batch-insert content into Wikidata. QuickStatements is “unsupervised” (i.e. the content is just inserted with no formal verification process except for the format of what is inserted); since it’s possible to insert unverified data and (even in large amounts), care is required when adding content.

Schema[edit]

We mostly used classes and properties that already existed in Wikidata when we started (summer 2016), especially the existing property WLM ID (that comes along with constraint on the use of other properties). We needed to create the Italian national heritage (Q26971668)  View with Reasonator View with SQID item to assign as a requested value for the heritage designation (P1435) View with SQID property, although we found that there might be some shortcomings (e.g. is it ok to use it also for natural heritage? Can it be used even if there is no official heritage list?)

Updating existing items[edit]

The “safest” route is to only update existing elements with further information, in our case with a WLM ID, labels and (sometimes) addresses. Example:

Qxxxxxx Lit "name"
Qxxxxxx P2186 "0123456789" S143 Q19960422
Qxxxxxx P17 Q38 S143 Q19960422
Qxxxxxx P131 Qyyyyyyyy S143 Q19960422
Qxxxxxx P1435 Q26971668 S143 Q19960422

These statements would update the element with Q number Qxxxxxx by:

All the statements are qualified with the imported from Wikimedia project (P143) View with SQID qualifier, so to say that this information has been imported from Wiki Loves Monuments Italia (Q19960422)  View with Reasonator View with SQID.

Creating new items[edit]

The somewhat "riskier" route is to create a new Wikidata element as we did for Pompeii buildings; one should first make sure that the element does not already exist, since the risk is to create duplicates. That said, here is an example:

CREATE
LAST Lit "{nome} ({regio}.{insula}.{pos})"
LAST Ait "Pompei {regio}.{insula}.{pos}"
LAST Aen "Pompeii {regio}.{insula}.{pos}"
LAST P17 Q38 S143 Q19960422
LAST P131 Q36471 S143 Q19960422
LAST P31 Q109607
LAST P1435 Q26971668
LAST P2186 "0123456789" S143 Q19960422
LAST P276 Q43332 S143 Q19960422
LAST P361 Qxxxxxx S143 Q19960422
LAST P528 "{regio}.{insula}.{pos}" P972 Q27055447

These statements would create an item with:

plus some information specific for Pompeii items:

For Pompeii, the insertion of all the items (~2000) took several minutes. In order not to use up too many resources, we loaded such statements in chunks depending on the item's regio (Q26912005)  View with Reasonator View with SQID (there are 9).

Results[edit]

There are now ca. 4,500 WLM items on Wikidata spanning more than 100 types, the five most represented being:

Most of these types were extracted directly from monument names (possibly with some text manipulation, e.g. character substitution or the use of synonyms such as "chiesa" and "chiesetta" or "palazzo comunale" and "palazzo municipale"). All the monuments belonging to each class were manually checked before insertion in order to avoid duplicates (with the exception of most of the Pompeii monuments, which were created from scratch).

Challenges[edit]

During the migration project we faced a number of challenges.

  1. The creation of stable IDs: since 2012 the monuments IDs are created within the WLM project since there is no comprehensive, unified, national Monument DB (although some effort is being made now);
  2. "Noisy" data: since any municipality can propose its one monuments to the list, we had (and still have) some items that are not actually "monuments" but rather points of interests or cultural properties (even though the “monument” definition is vague sometimes).
    1. Relevance (as a consequence): is every item from WLM lists a "notable" Wikidata item? After some discussions we opted for the “Yes, because it supports the WLM project” route, although we still had doubts about hamlets, main squares, and other elements to be treated as "monuments".
    2. Manual work (especially for verification) often needed.
    3. Time needed to agree on data structure.
  3. Discussions about the creation of new items (especially for Pompeii):
    1. When several monuments are grouped together (e.g. arcades, buildings of a hamlet, or Pompeii buildings), is every single building a “monument” on its own or is it just a part of a larger monument (i.e. "arcades", or the Pompeii site)?
    2. Since there is no official list of codes for all the buildings in Pompeii, is it possible to use a non-official (but well documented) source?
    3. How should types (and English labels/description) be obtained for a building? In some cases we could extract the type from a monument name, otherwise this should be added afterwards (but the addition of items without at least a type is discouraged).
    4. Question about dates: should we use Julian or Gregorian dates?
  4. Address validation: we planned to add them after some further verification, to possibly make use of some geolocation.
    1. What is the "address" of a natural resource (e.g. a park or a river)? Can the address provided by a municipality (and inserted in the WLM lists) be always considered "valid"?

Conclusions[edit]

Our conclusion is that this migration process is not technically difficult per se, but it brings many decisions and questions to be answered from a more general point of view. Our main results were:

  • cleaner data (especially data from Emilia Romagna region, thanks to their monument database);
  • cleaner and stable IDs;
  • a way to (semi-)automate creation and update of WLM-related Wikidata items from a database, possibly with custom rules;
  • a way to make the creation and update of WLM lists easier, so that information does not have to be scattered and repeated.

We already planned to make some fixes and updates for the 2017 edition of WLM.

2018 continuation[edit]

Luoghi della cultura[edit]

Bot codebase[edit]

See: https://github.com/synapta/wikidata-mibact-luoghi-cultura

All the code of the bot is published on Github with a documentation (in Italian) both at a high level to understand the flow of data, both at running level. The bot can also be launched in the future to upload to Wikidata any new data that MIBACT will publish. This is a real hypothesis given that during the activity the so-called cultural places considered have gone from 26,899 to 27,513. (Some of them proved to be particularly poor or duplicated and were discarded by the bot).

Before the activity on Wikidata there were 13,310 Italian monuments according to the query:

https://query.wikidata.org/#SELECT%20%3FidWD%20%0A%20%20%20%20WHERE%20%7B%0A%20%20%20%20%20%20%20%20%7B%20%3FidWD%20wdt%3AP2186%20%3FidWLM%20.%20%7D%20UNION%0A%20%20%20%20%20%20%20%20%7B%20%3FidWD%20wdt%3AP1435%20wd%3AQ26971668%20.%0A%20%20%20%20%20%20%20%20%20%20%20%20MINUS%20%7B%20%3FidWD%20wdt%3AP2186%20%3FidWLM%20.%20%7D%0A%20%20%20%20%20%20%20%20%7D%0A%20%20%20%20%20%20%20%20%3FidWD%20wdt%3AP17%20wd%3AQ38%0A%20%20%20%20%7D

After the activity the number has increased to 34,495. Considering the 27,290 edits made by the bot, 20,085 creations and 7,205 updates were made to existing items.

Frontend[edit]

See: https://wlm.synapta.io

At the address above, we loaded an interface that uses a more detailed version of the previous query as a data input with a search engine on the possible monuments for WLM. The data is updated automatically with a few minutes delay compared to Wikidata. So even future automatic or manual entries can be viewed on that table.

In the research hole you can enter a municipality, a province or a region to see the monuments of that place.

Wikipedia integration[edit]

See: w:it:Progetto:Wiki_Loves_Monuments_2018/Monumenti/Piemonte/Città_metropolitana_di_Torino.

In this example page on Wikipedia we applied the use of the Template:Wikidata_list to automatically create the lists once generated by hand like Progetto:Wiki Loves Monuments 2017/Monumenti/Piemonte/Città metropolitana di Torino.

Export[edit]

Use a Wikidata SPARQL query to export items and their WLM-related data to a spreadsheet.

Another query can be used to export a smaller dataset with the "codice catasto" of the municipality where the object is located.

Further tweaks to the data[edit]

Data is continuously being improved. Next steps include:

Tuscany[edit]

Tuscany has a parallel organization since 2018, summarized here.

The region produces a considerable outptut of the national image upload (25-40%). As stated in the past also inside in WMI, it's therefore statistically wrong to analyze the Italian data as a whole, you should compare if possible Tuscany and Italy without Tuscany, because the processes are different.

One of the reason of the separate organization is the need of a more "wiki" approach, and the the one the reduction of mistakes. Originally, the problem of clean -up of massive imports, but in general the need to a constant check up f the process. According to the data of local volunteers, circa 10-15% of WLM information provided by local authorities are wrong (mix-up of different concepts, wrong properties of the places, minor mistakes of addresses). The network of volunteers carefully check them and verify with the offices, this is considered a necessary step to reduce more time-consuming corrections later.

Tuscan items have in general more properties, more links to commons categories, more IDs. IDs related to cultural heritage on Wikidata are often produced as a result of the Tuscan Wikidata activity in the framework of WLM (Art Bonus ID (P8564),Visit Tuscany ID (P8083),Arachne building ID (P6787),Pietre della Memoria ID (P5726),BeWeb church ID (P5611),TCI destination ID (P5601))

The system was discussed abroad, for example at the WikiData Days 2019 in Portugal.

Since 2020 the Tuscan volunteer network started to improve the OTRS system used to store the permissions, uploading them on Commons. Currently 80% of the permissions can be entirely handled by the community (with the exception of I.D. information) with great reduction of costs and increase in efficiency. It's also easier now to monitor evolution of the competition, and use such information as reliable sources for statements.

Statistics and reports[edit]

Tips[edit]

SPARQL queries can help with various tasks:

See also the recent changes connected to the WLM lists.

Statistics[edit]

For various uses:

Reports for cleanup and data improvement[edit]

See #Data_model for additional information on why these properties are important for WLM. See /Property coverage for the coverage of main properties.

Issues that are likely to create problems for Wiki loves monuments:

Other issues:

To support improving the data:

Images, categories and links[edit]

You can use existing data to semi-automatically add image links (P18):

Several reports help find items ripe for improvement:

See also[edit]