How can, do and should data quality aspects be integrated into current and future Wikidata workflows?

Data quality is a crucial feature of any collection of data. It has multiple dimensions that range from accuracy to completeness, and in this contribution, I will consider how it maps with workflows on and around Wikidata, e.g. the quality of

  • data models/ data/ metadata/ documentation at source
  • data models/ data/ metadata mapping between source and Wikidata, and documentation thereof
  • data models/ data/ metadata/ documentation in the respective area on Wikidata
  • workflows for any of the above, including quality checks and communication about issues with affected stakeholders

Since this talk is meant as a summary of the day's activities (which have yet to take place as I am preparing for it), there will likely be some quite last-minute changes to this page.

Tour plan[edit]

Some impressions from the tour[edit]

Screenshot of the Wikidata item Workshop on Data Quality Management in Wikidata (Q59426297). Note the constraint violation warning at the title statement.
Screenshot of constraint violation warning for title statement on item Workshop on Data Quality Management in Wikidata (Q59426297), as of 18 January 2019. Note the cache delay.
Screenshot of the Scholia event page for Workshop on Data Quality Management in Wikidata (Q59426297) as of 18 January 2019. Is the co-author network complete? Which related events are missing? Note that Wikidata (Q2013) and data quality (Q1757694) come up on top of the "Topic scores" panel, which makes sense. Let's have a look at them together.
Screenshot of Scholia topic comparison for Wikidata (Q2013) and data quality (Q1757694). All the presentations from the workshop are missing. Note the gap at "publication date" for our event, which hints at some issues in data modeling and/ or the query. The only actual work that combines both topics and is listed here is Recoin: Relative Completeness in Wikidata (Q51902990). Checking out its Scholia profile provides a link to the full text but is not very informative otherwise, except for highlighting that the citation graph around this work is rather incomplete, so we'll skip it and look at the two topics separately.
Screenshot of the Scholia topic page for Wikidata (Q2013). Nice mix of publication types. Some islands in the co-author graph indicate potential for collaboration. Some items could be annotated in more detail in terms of their topics, which would enrich the co-occurring topics graph, which looks at the verge of becoming useful. Vienna as the only geolocation associated with Wikidata seems odd, but that data point is valid, as per Querying Wikidata About Vienna’s Tram Lines: An Introduction to SPARQL (Q49264352). The top venues do not include journals. Would something like a Wikidata Journal make sense? Moving on to data quality (Q1757694).
Screenshot of the Scholia topic page data quality (Q1757694). Top panel shows that topic annotation is rather incomplete. Publication types are highly skewed towards journal articles. Authors are skewed towards people with ORCID iD (P496). The co-author graph is basically just a set of islands, with lots of connections presumably missing. The co-occurring topics mostly make sense, though Malawi is a bit of a surprise. The co-occurring topics graph also hints at rarther incomplete topic annotaiton, but the map shows that some basic annotation is there, though much more should probably be expected. Between the top authors and top venues, there is a lttle "missing" link that leads to the topic's "missing" page with guidance on where curation is needed. Let's have a look.
Scholia's missing page for data quality (Q1757694). The "Katherine" and "Kathy" with the same surname are probably the same person, so those papers could be mapped to the same item. Needs a bit of checking though, so we'll skip that for now and go back to the topic profile for data quality (Q1757694) to look at the top venue and top author that were identified in the panels just above and below the "missing" link. Clicking on the link for the top venue leads to a basically empty page where it is displayed as an event series, which hints at some data modeling issue. Clicking from there onto the "venue" button leads to the venue page for Studies in Health Technology and Informatics (Q15817805).
Scholia venue page for Studies in Health Technology and Informatics (Q15817805). It clearly shows that lots of authors remain to be identified, and that many of the identified ones do not have images associated with them. data quality (Q1757694) barely makes it into the top 10 topics, but quite a few related topics (e.g. interoperability (Q749647)) do. The two panels on the bottom also highlight gender bias. Venue pages also have "missing" pages associated with them, although the feature here is rather experimental and not yet linked. Anyway, let's go there too.
Scholia's missing page for the venue of Studies in Health Technology and Informatics (Q15817805). The top ten author name strings occur at least 42 times, so there is a lot of need for author curation. Let's now go back to the topic profile for data quality (Q1757694) to look at the top author.
Scholia author page for Simon de Lusignan (Q38640632). Some publications (usually preprints) are missing a published in (P1433) statement, and none of their publications have the number of pages annotated. No image of the author themselves, but the images that do pop up seem relevant. No information about their academic tree, and nothing in terms of affilaitions or awards in their timeline. Otherwise, the page seems to have some basic content, and data quality (Q1757694) features prominently, along with other data-related concepts and with medical ones. Author pages also have a "missing" page, and it is linked from the bottom, so let's go there next.
Scholia's missing page for the author Simon de Lusignan (Q38640632). 82 papers where author name strings need to be resolved. They all seem to refer to the same person, but come in several flavours, including a typo. Of note, the top panel lists 55 entries for the most frequent string, while the same string is listed 54 times in the third panel, which hints at the completeness of the citation graph around the author. Lots of co-authors and citing authors rmeain to be identified, and 171 of their works do not have a main subject (P921) statement, which means the topic visualizations on their author profile page are probably quite skewed.

This page hosts a contribution to the Workshop on Data Quality Management in Wikidata (Q59426297) held on 18 January 2019 in Berlin (Q64).