How can, do and should data quality aspects be integrated into current and future Wikidata workflows?
Data quality is a crucial feature of any collection of data. It has multiple dimensions that range from accuracy to completeness, and in this contribution, I will consider how it maps with workflows on and around Wikidata, e.g. the quality of
data models/ data/ metadata/ documentation at source
data models/ data/ metadata mapping between source and Wikidata, and documentation thereof
data models/ data/ metadata/ documentation in the respective area on Wikidata
workflows for any of the above, including quality checks and communication about issues with affected stakeholders
Since this talk is meant as a summary of the day's activities (which have yet to take place as I am preparing for it), there will likely be some quite last-minute changes to this page.
Screenshot of the Scholia topic pagedata quality (Q1757694). Top panel shows that topic annotation is rather incomplete. Publication types are highly skewed towards journal articles. Authors are skewed towards people with ORCID iD (P496). The co-author graph is basically just a set of islands, with lots of connections presumably missing. The co-occurring topics mostly make sense, though Malawi is a bit of a surprise. The co-occurring topics graph also hints at rarther incomplete topic annotaiton, but the map shows that some basic annotation is there, though much more should probably be expected. Between the top authors and top venues, there is a lttle "missing" link that leads to the topic's "missing" page with guidance on where curation is needed. Let's have a look.
Scholia's missing page for data quality (Q1757694). The "Katherine" and "Kathy" with the same surname are probably the same person, so those papers could be mapped to the same item. Needs a bit of checking though, so we'll skip that for now and go back to the topic profile for data quality (Q1757694) to look at the top venue and top author that were identified in the panels just above and below the "missing" link. Clicking on the link for the top venue leads to a basically empty page where it is displayed as an event series, which hints at some data modeling issue. Clicking from there onto the "venue" button leads to the venue page for Studies in Health Technology and Informatics (Q15817805).
Scholia author page for Simon de Lusignan (Q38640632). Some publications (usually preprints) are missing a published in (P1433) statement, and none of their publications have the number of pages annotated. No image of the author themselves, but the images that do pop up seem relevant. No information about their academic tree, and nothing in terms of affilaitions or awards in their timeline. Otherwise, the page seems to have some basic content, and data quality (Q1757694) features prominently, along with other data-related concepts and with medical ones. Author pages also have a "missing" page, and it is linked from the bottom, so let's go there next.
Scholia's missing page for the authorSimon de Lusignan (Q38640632). 82 papers where author name strings need to be resolved. They all seem to refer to the same person, but come in several flavours, including a typo. Of note, the top panel lists 55 entries for the most frequent string, while the same string is listed 54 times in the third panel, which hints at the completeness of the citation graph around the author. Lots of co-authors and citing authors rmeain to be identified, and 171 of their works do not have a main subject (P921) statement, which means the topic visualizations on their author profile page are probably quite skewed.