Wikidata:WikiProject Performing arts/Reports/Ingesting Data about Professional Theatre Troupes in Switzerland
This report describes the process of ingesting data about professional theatre companies in Switzerland.
The data source being used was the dataset "Professional Performing Arts Companies" (14 April 2017) provided by the Swiss Theatre Collection (STC).
The procedure followed consists of nine steps:
- Step 1: Get a thorough understanding of the data (class diagram: what are the classes? what are the properties? what are the underlying definitions?)
- Step 2: Analyze to what extent Wikidata already has the required data structures (relevant classes and properties already defined?). If necessary, create the relevant classes and properties (community discussion required). Create a mapping between the source data and the target data structure on Wikidata.
- Step 3: Check whether data formats in the source data correspond to acceptable formats on Wikidata. If not, some data cleansing may be required.
- Step 4: Check whether there is an explicit or implicit unique identifier for the data entries. – If not, think about creating one (how best to go about this needs clarification).
- Step 5: Check whether a part of the data already exists on Wikidata. If yes, create a mapping in order to add data to already existing items and to avoid creating double entries in Wikidata.
- Step 6: Model the data source in Wikidata (how best to go about this needs clarification).
- Step 7: Clean up existing data on Wikidata.
- Step 8: Ingest the data, providing the source for each statement added (avoid overwriting existing data). Log conflicting values. At the beginning, proceed slowly, step by step; check which aspects of the upload process can be automatized using existing tools and which aspects need to be dealt with manually. Ask community members for a review of the first items ingested, before batch ingesting.
- Step 9: Visualize the data using the Listeria tool or SPARQL queries in order to inspect the data (quality check).
This procedure was introduced by Beat Estermann in his report: Wikidata:WikiProject Cultural heritage/Reports/Ingesting Swiss heritage institutions (Estermann 2016). Other reports also provide guidelines.
The STS data set has its origins in a physical archive. Because of this the original dataset contained a lot of metadata which refers to holdings of the archive that is not relevant for Wikidata. Also, some of the data was not tidy and had to be cleaned up before it can be ingested. The data was analysed by the author. Figure one gives an overview of the original data set, and figure 2 contains the elements that can be ingested into Wikidata.
There are different reasons why certain data cannot be used in Wikidata: Some columns in the original data set were void or mostly void (Geodaten, Adresse, Verband); some had too many different or inconsistent values (Attribute, Byline, Description); and in one case data quality was regarded as insufficient for ingestion by the data provider (Verband).
During the analysis the following mapping was created:
|Property in datafile||Property in Wikidata||Refers to class in WD / possible values||Remarks|
|Name Corporate Body||main label||Crazy Hotel Company||The language used in the label had yet to be decided|
|Name Corporate Body alias||also known as (Alias)||Crazy Diner Show Company||labels, aliases and descriptions are not real properties in Wikidata.|
|Frühere Namen||not existing but, official name can be used with a qualifier||(former name)||First, a new property called "former name" was proposed by the author. Following the community discussion it was decided, that "official name" with a qualifier should be used instead.|
|ID_Nr||inventory number (P217)||accession number (Q1417099)|
|Ort||located in the administrative entity (P131)||municipality of Switzerland (Q70208)|
|Kanton||located in the administrative territorial entity (P131)||canton of Switzerland (Q23058)|
|Land||country (P17)||country (Q6256)|
|Webadresse||official website (P856)||official website (Q22137034)
|(characteristics of the items)||instance of (P31)||theatre troupe (Q2416217)||The property is not stated in the STC dataset, but should be ingested into Wikidata anyways.|
A part of the data that was not tidy was already deleted in step one of this case study. Still there were inconsistencies to be found in the remaining data. The following examples show what sort of inconsistencies occurred in the data set:
|Section||Example||Remark / Solution|
|Ort||Aarau ?||The question mark indicates that the authors of the STC data set are not sure whether the location actually is Aarau. This data will be left out.|
|Ort||Aarau <vor 1989 Luzern>||The data in the <> brackets gives additional information; in this case it shows the location before 1989. The data will be left out.|
|Kanton||BK||Simple spelling mistake, the authors meant BL. The data will be corrected according to the municipality.|
|Kanton||“-” or “?”||Data is unknown and will be left out.|
|Ort||Basel; Birsfelden||Two values in one cell. The data will be separated and added into the second data sheet.|
|Ort||o.O. <Ursprung: Basel>||This indicates that there is no fixed place but the entity's origins are in Basel. There is no existing property for cases like that, furthermore, it is not essential data.|
|Ort||Basel_||Unnecessary blank spaces after the actual data.|
|Webadresse||x||This indicates that the website is not known and that the STC does not aim to search for it. In cases where thethe website is not known and the STC aims to provide it in the future, there is just a blank space in the cell. Data will be left out|
|Webadresse||bitebulletdance.com||The “www.” was missing and added manually as it is important for the ingest.|
|Webadresse||z.Z. noch unter situ, bald neue Webseite||Website is being created at the moment. Not relevant data|
|Name Corporate Body||Zürcher Kammerorchester (inklusive ZKO Opera Box)||The addition in the brackets does not belong to the label but to the description of the data|
|Name Corporate Body||,xavance les scènes||Simple spelling mistake, data will be corrected|
|Corporate Body||Strassentheater diverse <ohne Auswertung>||The Authors meant an aggregation of several small theatre groups. As this is not an official group, data will be left out|
In the column “Ort”, various combination of the examples shown in the table above were found and corrected. In most of the cases it was easy to interpret what was meant by the authors of the data set. However, figuring out every single case took quite some time. While the detection of the different classes of inconsistencies took a small amount of time, its correction was more time consuming. Certain patterns could be detected and corrected with the help of the OpenRefine Tool others had to be corrected manually. Eventually, it was decided to take the special cases (such as items with a second canton in the column canton) out into a separate file and treat them manually. By handling all the special cases in a separate file, it was possible to continue with the normal cases in a more more efficient manner because they were better usable for tools such as OpenRefine or QuickStatements.
The unique identifier given by the STS could be used in Wikidata under the property inventory number, which acts as a general identifier for objects in a database. Most of the other columns in the STS dataset referred to data that was already existing in Wikidata. These columns were:
- Ort (municipalities / property: located in the administrative territorial entity)
- Kanton (canton / property: located in the administrative territorial entity)
- Land (country / property: country)
To allocate the entry to its corresponding Wikidata Q-number several steps were completed. First, the resources were retrieved from Wikidata using an appropriate SPARQL query. The query retrieves every object that is a Swiss municipality and additionally also shows its municipality code.
The retrieved data then was converted into a .CSV file and uploaded into OpenRefine. Finally it could be matched with the existing STS datafile using an OpenRefine command. Data concerning the municipalities was not tidy/unique enough, because the name of a certain municipality sometimes can be confused with other municipalities (e.g. Rüti b. Büren and Rüti ZH). Therefore, the reconcile-csv tool was used additionally, which provides a fuzzy-matching algorithm called dice (OpenKnowledgeLabs 2017). The result was the STC dataset with the corresponding Wikidata Q-numbers for all its entries except the item itself (in case of items being already present in Wikidata).
First a SPARQL Query was executed which retrieved every item that were instances of theatre troupes. However, the query just returned 40 results. Most of them were theatre troupes based in Russia or its predecessor states (i.e. Sovjet Union or Russian Empire) and none of them were Swiss troupes that matched with the STC data set. However, it was known to the author that there was an existing Wikidata item for the Swiss Comedy troupe Divertimento. The only statement it possessed was instance of "duo". While this statement is correct it does not actually describe the item's main character, which is theatre troupe. It turned out that several items from the STC dataset were already in Wikidata. They were discovered by using the following two approaches:
- using Wikipedia's category pages under the pages “Theaterensemble”, “Kabarett-Ensemble”: returned 21 items
- manual search after well known names: returned 8 items
Doing so allowed to match at least some of the existing objects from Wikidata to the dataset, however, the method used does not guarantee that no double entries were created. This could have only been prevented by searching manually after every item from the STC dataset, which would be very time consuming.
The source of the STC data set was the “Schweizersiche Theatersammlung”. There already was an existing object , which represents however the institution and not the actual data. Therefore, a new item was created which represents the data source (i.e. the archive) in the Schweizer Theatersammlung.
Most of the detected 29 existing Wikidata objects were poorly described, often with just one single statement. Sometimes there was just the label. However, there was hardly any erroneous data and it took only a small amount of time to correct it. The existing data was afterwards extended with further statements during step 8.
As in an earlier case study, the code for the QuickStatments Tool was created using the Microsoft MailMerge function. This time, however, QuickStatments V2 was used directly which resulted in a small change of syntax.
The ingestion was again done iteratively, with a small amount of items at first and a bigger amount in the subsequent steps so that errors could be handled progressively. As shown in the picture on the right hand side, QuickStatments V2 was unable to ingest items with the following statements:Object has an official name and the official name has an end time which has an unknown value. When adding a statement with end time in QuickStatements, it is mandatory to specify its exact time. While it is possible to add an official name with an end time of unknown value manually, this feature has not been implemented (yet) in QuickStatments V2.
Issues Related to the Ingest
- The references to the source data file should be corrected (see note regarding step 6).
- Some of the entries do not refer to theatre troupes (e.g. Radiotelevisione svizzera di lingua italiana (Q29948862) or Orchestra della Svizzera Italiana (Q29948851)); the entire list of ingested theatre troupes should be checked for erroneous attributions of <instance of> <theatre troupe>. Note that the original data file is supposed to contain professional performing arts companies, including dance troupes.