Wikidata:Events/Data Quality Days 2021
- The Data Quality Days are a series of gatherings that took place from September 8th to 15th, 2021, focusing on data quality on Wikidata. With presentations, discussions, editing sprints and more, the goals of this event were:
- to start some discussions about data quality and highlight this topic through various angles
- to explore what data quality means in different areas of Wikidata
- to bring together people who are working on data quality on Wikidata and who want to contribute
- to highlight and create tools that can be useful when working on data quality
- The Data Quality Days 2021 now concluded. You can find an overview of the events with their description, slides, notes and possibly recording below. On the outcomes page, you can find the interesting things that happened during the event. (if you participated, feel free to add yours!)
|📆 Day and time (GMT/UTC)||⏰ Duration||💬 Title & short description||ℹ️ Type||👥 Facilitator(s)||🌐 Main language||⏯️ Access/replay & 🖋 notes||Number of participants|
Data quality: what is it and why is it important?
|Presentation + discussion||Lydia Pintscher, Manuel Merz, Alessandro Piscopo||English||35|
|30min||Bibliometric-Enhanced Information Retrieval: A new alternative for the validation and enrichment of Wikidata Statements
In this brief presentation, I explain how bibliographic metadata of scholarly publications can be used to verify, validate and enrich specialized knowledge in Wikidata through several practical examples. I will also show RefB, a bot that adds reference support to Biomedical Wikidata Statements based on PubMed Central queries, as a practical example of how Wikidata can be enriched and sustained through leveraging open bibliographic data.
|Presentation + discussion||Houcemeddine Turki||English||Notes||18|
|60min||Structuring the world’s knowledge: Socio-technical processes and data quality in Wikidata.
In his talk, Alessandro presents his research about the socio-technical fabric of Wikidata and how this affects the quality of its data, looking in particular at three aspects: quality of provenance and ontological data; algorithmic (bot) contributions; emerging editor activity patterns.
|Presentation + discussion||Alessandro Piscopo||English||19|
Editing session to expand Help:Deduplication, about identifying and removing duplicates. Contribute to expand the help page by adding more information, expand its item or raise open questions on its talk page
|editing session||n/a (Jura1)||English||Help:Deduplication
|60min||ORES: Using AI for quality control in Wikidata
In this session we will explain what ORES is and how it can be used for data quality work.
|Presentation + discussion||Lydia Pintscher, Ladsgroup||English||Notes||22|
|60min||EntitySchemas and Shape Expressions on Wikidata
This session consists of three parts. First a 15 minute introduction to ShEx will be given. Followed by a 30 minutes hands on where a new entity schema will drafted (we take requests). In the final 15 minutes Eric will discuss possible directions.
|Presentation/Handson/Discussion||Andra Waagmeester, Eric Prud'hommeaux, Katherine Thornton, Jose Emilio Labra Gayo||English||Notes - Slides||29|
Editing session to review and expand Help:Ranking. Ranking is a key concept of Wikidata to integrate multiple and evolving views about reality. Contribute to expand the help page by adding more information or by commenting on its talk page.
|editing session||n/a (Jura1)||English||Help:Ranking
|60min||Mismatch finder and beyond: How can we incorporate feedback from our biggest data re-users at scale?
In this session we will take a look at why it is important to get more data quality feedback from outside and how we currently think about it. We'll show how the upcoming Mismatch Finder fits into it and discuss how to go from there.
|Presentation + discussion||Lydia Pintscher, Manuel Merz||English||Notes||9|
|45min||Bringing Czech authority files into 21st century: Integration with Wikidata
Fifteen years have already passed since the first collaboration between (then) Wikipedians and the Czech National Library. Their database of authority files stands at the crossroad of all data related to Czechia - bibliographical, personal, geographical etc. Over the years, we have learned how to link their entries to Wikidata items, display mutual links and enrich authority files with automatic links to ISNI and ORCID, export MARC files into a CC0-licensed wikibase and run various events aimed at promoting the cross-pollination between the worlds of libraries and Wikidata. Many ideas can likely be replicated worldwide.
|30 min Presentation + 15 min Q&A||Vojtěch Dostál||English||Video|
Editing session to review and expand Help:Dates. Adding and querying dates may seem simple, but available precision and changing calendars add complexity. Contribute to expand the help page by adding more information or formulate open questions on its talk page.
|editing session||n/a (Jura1)||English||Help:Dates
|60min||Wikidata Live Editing
Property Constraints: What to do when you see warnings, how to improve the constraints and how to query using constraints.
|Exploration and how-to||Abbe98, Ainali||English||YouTube, Facebook|
|all day||Checks after upload
Editing session to review and expand Help:Checks after upload. Once a dataset is uploaded to Wikidata, what to do? Contribute to expand the help page by adding checks you found useful.
|editing session||n/a (Jura1)||English||Help:Checks after upload
Help talk:Checks after upload
Setting property constraints. Follows on from the 'Live Editing' session above. This will focus on hands-on work by all attendees (bring your favorite property ID!)
|Editing (with brief introduction at the start)||Abián, Mike Peel||English|
|45min||Periodic editathons as a way to improve data quality: an experiment in Italy
From March 2021 the Gruppo Wikidata per Musei, Archivi e Biblioteche (GWMAB) has been organizing a series of monthly editathons, aiming to involve Italian-speaking users in the improvement of the data quality of items of authors whose works are present in Italian libraries. This presentation will show the organization of the editathons and the results achieved; ideas and proposals of improvements are welcome in the discussion.
|30 min Presentation + 15 min Q&A||Epìdosis||English|
|60min||Overview of ontology issues
We looked into different types of ontology issues in our data and tried to come up with a classification. We'll present the current state and would love your feedback to understand if it is meaningful and helpful for further work.
|Presentation and discussion||Silvan Heintze, Lydia Pintscher||English||Notes||29|
|30min||Clarifying property application for effective SPARQL queries
By way of SPARQL queries to inform and formulate best practices on properties and qualifiers application in Wikidata.
At the Smithsonian we are trying to model top leadership with the job title “director” -- may be director of a named museum or a unit within a museum or an independent organization such as the Center for Folklife and Cultural Heritage. Currently in Wikidata people with similar positions are modeled under “museum director” (Q22132694) and “director”(Q1162163) interchangeably, part of statements with properties including occupation (P106), position held (P39), and as qualifiers under the employer (P108) organization. There are multiple ways of representing the same information depending on the editor’s understanding of the definition of various related properties, and this complicates the writing of SPARQL queries. Attempts to compile a list of such directors proved very complex given the wide variety of ways these individuals are recorded in Wikidata. We are seeking to get a better sense of the consensus around how this type of individual’s job title should be described in Wikidata.
|Short presentation then discussions||Jackie Shieh (ShiehJ)
Smithsonian Libraries & Archives
|60min||Bug triage hour
We will be looking at tickets related to data quality and maintenance. Bring your tickets for bugs or new features you'd like to discuss.
|Discussion||Lydia Pintscher, Manuel Merz||English||Notes||10|
|60min||Cross-checking on-wiki: visibility, duplication, migration
We will talk about increasing data quality by increasing Wikidata's on-wiki visibility (e.g., through infoboxes). We will also discuss the benefits and drawbacks of duplicating data on Wikidata and other Wikimedia projects, and migrating information to Wikidata, with a focus on Commons category links. This will include hands-on editing of some tricky links between enwiki and commons via wikidata.
|Presentation/discussion (20 mins)
hands-on (40+ mins)
|60min||Discover patrolling & quality tools
Let's talk about tools! You're welcome to present the tools that you use or develop, or join to discover new interesting tools. Presentations/demos should be short, maximum 10min per tool. If you plan to present a tool, feel free to add it here.
|Discussion||Mohammed Sadat + anyone who wants to show a tool||English||Notes||19|
What did we learn during the Data Quality Days? Any outcomes to share? How do we want to move forward from there? What are your ideas and wishes to improve data quality on Wikidata?
|Discussion||Lydia Pintscher, Manuel Merz||English||15|
Feel free to add more useful documents, pages or videos here.
- Wikidata:WikiProject Data Quality
- Wikidata quality a data consumers' perspective (WikidataCon 2017)
- Workshop on Data Quality Management in Wikidata (Q59426297)
- Data quality panel (WikidataCon 2019)
- Kartik Shenoy; Filip Ilievski; Daniel Garijo; Daniel Schwabe; Pedro Szekely (30 June 2021), A Study of the Quality of Wikidata, arXiv:2107.00156, Wikidata Q107425133
- data quality (Q1757694)
Bibliometric-Enhanced Information Retrieval
- Houcemeddine Turki (11 November 2017). "Citation analysis is also useful to assess the eligibility of biomedical research works for inclusion in living systematic reviews". Journal of Clinical Epidemiology. doi:10.1016/J.JCLINEPI.2017.11.002. ISSN 0895-4356. PMID 29138103. Wikidata Q50017108.
- Houcemeddine Turki; Mohamed Ali Hadj Taieb; Mohamed Ben Aouicha (July 2018). "MeSH qualifiers, publication types and relation occurrence frequency are also useful for a better sentence-level extraction of biomedical relations". Journal of Biomedical Informatics. 83: 217–218. doi:10.1016/J.JBI.2018.05.011. ISSN 1532-0464. PMID 29857138. Wikidata Q61939808.
- Houcemeddine Turki; Mohamed Ali Hadj Taieb; Mohamed Ben Aouicha; Grischa Fraumann; Christian Hauschke; Lambert Heller (2021). Philipp Mayr (ed.). "Enhancing Knowledge Graph Extraction and Validation From Scholarly Publications Using Bibliographic Metadata". Frontiers in Research Metrics and Analytics. 6: 6. doi:10.3389/FRMA.2021.694307. ISSN 2504-0537. PMC 8194279 Check
|pmc=value (help). PMID 34124535 Check
|pmid=value (help). Wikidata Q107291918.
If you plan to join one or several events, or to work on projects related to data quality during this period, you can sign up here!
Feel free to indicate your username, the languages that you speak, and the topics you're interested in.
- Lea Lacroix (WMDE) - fr, en, de, it - I'd like to understand more about how the Wikidata community maintains and improve data quality
- Epìdosis - it, en - as of now I'm involved in some projects for improving the connection to Wikidata of authority files from libraries in Italy and other parts of Europe
- PKM - en - I'm interested in using tools to improve network connections among Wikidata items and between Wikidata and Commons, which makes it easier to identify data problems
- so9q - en,sv - I'm interested in trying out mismatch finder and find mismatches with tools written in Python 😃
- GoranSM - en,sr - Data Scientist for Wikidata. I would like to contribute any data products that might help us understand data quality in Wikidata!
- Mike Peel (talk) - en (and a little es, pt) - I've added a possible session to the etherpad.
- Oravrattas - en - interested in improving how we keep political data up-to-date
- Bouzinac - fr, en - commited to improving transport data up-to-date (airports, subways...) and general consistency (finding/merging duplicates...)
- Csisc - aeb, ar, fr, en, it - I am interested in improving the data validation of Wikidata through several novel methods.
- Jura1 - en - TBD
- ShiehJ- en - interested in best practices for applying property and constraints, data quality assurance workflow, the potential of Wikidata:Schemas for more granular user account control
- LAP959- en - interested in ensuring stable, harmonious and robust data that are inclusive, easy to use and visualise
- Akbarali- ml, ar, en – I would like to know more about how the Wikidata files can be used in academics.
- Luckyz - en, it - TBD
- Vladimir_Alexiev - en,bg,ru - I've made 3.5M edits and I'm very frustrated with WD's update performance. There can be no quality without the ability to easily make edits! https://phabricator.wikimedia.org/T290061
- Kpjas - en,pl - I believe that referenced data is the crux of what Wikidata represents.
- Lydia Pintscher (WMDE) - en, de - I'd love to discuss all things data quality, show what we already have and especially understand what's still missing.
- Justin0x2004 - en - I'm interested in increasing modeling/representation uniformity.
- MisterSynergy - en, de - countervandalism, patrolling, ORES; ideas how to discuss as well comprehensively document and visualize preferable data models for a given problem/field
- Azertus - en, nl, fr - data quality, property constraints and documentation, meta documentation in general
- Score Beethoven - en, nl, fr - data quality, data bulk input, output, scripts for manipulation, etc
- Loz.ross - en, de, bg - interested in the use of Entity Schemas and related tools in maintaining data quality in Wikibase instances in general (as well as Wikidata); also: ontology standards; bulk input; synchronization across knowledge graph repos, etc
- Jmkeil - de, en - interested in RDF data quality, especially RDF dataset comparison for quality evaluation and want to learn about the mismatch finder tool to get an idea of possible incorporation with my (in development) RDF dataset comparison tool
- Ambrosia10 - en - I'm likely to just concentrate on actual editing, using tools such as the author disambiguator, mixnmatch and citeunseen gadgets to improve the quality of existing items. I'm interested in any other tools or gadgets I can use to improve the quality of existing Wikidata items.
- Dnshitobu - en, dag -Interested in learning about data quality, bulk data modeling and how to organize government institutional data to allow for free but accessible information
- Sradovsk - en - Interested in learning, period, through watching presentations, participating in hands-on activities.
- Jelabra - es, en - I am interested in the use of ShEx and Entity Schemas in Wikidata
- 99of9 - en - I want more identifier properties, which are all natural references and quality improvers. I very much use and value constraints, and want more of them. I want to be able to write a custom SPARQL query that will be run regularly at a set interval with the result tabulated in wikitext. I happily contribute and appreciate the quality contributions of others.
- Daniel Mietchen - ru, fr, de, en - interested in (i) internal consistency (logically and across languages or knowledge domains), (ii) consistency with external resources, (iii) workflows to propagate curation events between Wikidata and external resources, (iv) tools, workflows and documentation of any of that
- Mccoyle55 - en - TBD
- Sbae2020 - en - TBD
- Fuzheado - en - data quality on objects, specifically prints, photographs, status and reproductions, where we are not doing so well
- Antoine2711 - fr, en - TBD
- Memathieu - fr, en, es - TBD
- Girassolei - pt, en
- GiFontenelle - pt-br, pt, en, es - GLAM-related sessions, especially with bibliographic data.
- Dbigwood - en - tools to use to ensure better bibliographic data
- Hjfocs - it, fr, es, en - feedback loops with data providers and re-users
- Aisha Khatun - en, bn - interested in understanding how ORES is used in Wikidata
- Manuel Merz (WMDE) - en, de - I would like to get to know other people working on data quality and better understand what WMDE can do to help.
- Oronsay - en - I am merging/adding P1889, adding gender to humans, doing mixnmatch and using the Bargioni add more identifiers script
- Hsarrazin - fr, en - working on biographic data and portraits, authorities, book editions (cross-wiki with wikisource.