Wikidata:Mismatch Finder/Collaboration/Purdue Summer of Data 2024
Wikidata, with over 2 billion edits, has become the most edited wiki globally. With its rapid growth has also come errors that are hard to keep track of. Discrepancies between Wikidata and trusted external sources have been identified that could propagate to downstream projects including applications, national records, search engines and other Wikimedia projects like Wikipedia.
To tackle this, the Mismatch Finder was established as a space where downstream projects and the broader Wikimedia community can report disparities between Wikidata and other trusted data sources. Mismatches are then visible on Wikidata, allowing editors to verify and correct them on Wikidata or the source.
This project’s goal is to deliver new useful mismatches for the Mismatch Finder.
The Data Mine Student Project
Data Science students at The Data Mine will work as a team to identify and address differences between Wikidata and external data sources. All work will be open source and released under open licenses. The project is supported by Wikimedia Deutschland.
Participants
- Manuel Merz (WMDE) (Mentor)
- Andrew McAllister (WMDE) (Mentor)
- Seth Deegan (Lectrician1) (Teaching Assistant @ Purdue)
- Meredith Steever (Msteever) (Project Member @ Purdue)
- Henry Lee (LofiTea) (Project Member @ Purdue)
- Ethan Dawes (Funblaster22) (Project Member @ Purdue)
- Maggie Gao (Mgaoann) (Project Member @ Purdue)
- Nanda Binod (Nandasbinod) (Project Member @ Purdue)
Timeline and updates
- Onboarding: January 8–21, 2024
- Participant introductions
- Setup and learning the tech
- Exploring the data sources
- Weekly Sprints: January 22 – April 07, 2024
- Deep dives into selected data sources of interest
- Generating mismatches and modeling them against Wikidata’s data
- Sending generated mismatches to the community for their feedback
- Closing: April 08 – April 26, 2024
- Documentation of mismatches found and the processes used to derive them
- Write a tutorial on how to find mismatches using what we’ve learned
- Final project roundup
Please help us to find the right focus!
We invite the Wikidata community to actively participate in identifying data sources that when compared with Wikidata could generate numerous and significant mismatches. Your insights will guide our focus and contribute to the success of the Mismatch Finder project.
We are looking for datasets that are free to use, easily accessible, and ideally helpful for data that is used on Wikipedia. These are potential data sources that we could work on (based on T304448):
Please let us know which of the datasets you would be most interested in, and what types of discrepancies we should focus on. Suggestions of other data sources would be welcome!
The Outcomes
Stay tuned for updates on the outcomes of the Purdue Summer of Data 2024: Mismatch Finder project. We will provide comprehensive details once the project concludes.