Wikidata talk:Events/Data Quality Days 2022

From Wikidata
Jump to navigation Jump to search

Discussions related to the Data Quality Days are welcome here. If you have any question or need support (for example, to propose a session), feel free to contact Lea Lacroix (WMDE) directly ( or @Auregann on Telegram).

Suggestions of topics for the facilitated discussions[edit]

During the Data Quality Days, we will have several slots in the program for discussions around data quality. During these sessions, facilitated by the Wikidata product managers (Lydia and Manuel), we will cover various topics, big or small, especially related to processes around data quality as well as identifying and fixing incorrect data on Wikidata. If you have any idea of topic, question, or suggestion, that you would like to see discussed, please add it in the list below, indicating your username. We will then group these discussions topics by broader themes and add them to the schedule.

  1. How can we fight vandalism more efficiently? (Lea Lacroix (WMDE))
  2. How can we improve our relationship with VIAF and big authority files, so that an efficient and uniform way of reporting their mistakes is created and we could avoid importing mistakes from their records once they get noticed for the first time? (Epìdosis)
  3. How can we reduce the number of duplicate items, both those deriving from one-Wikipedia-sitelink items and those deriving from items created in big data imports? (Epìdosis)
  4. How do we deal with concurrent uses of different properties (i.e. if we decide that a certain class of values should always be used with property X and not with property Y, we can discourage the use of the wrong property through constraints, but we don't have efficient ways of periodic replacement)? (Epìdosis) - see below #Concurrent ways to model data (mainly for humans)
  5. How do we encourage Wikipedia users interacting with Wikidata if they create an article for a subject not yet having a Wikidata item, in order to avoid it becoming a one-Wikipedia-sitelink item with 0 statements? (Epìdosis) - subtopic of #Discussion: Matching new Wikipedia articles to Wikidata items

Don't hesitate to add a discussion topic in the list, even if you're not yet sure to attend the event! We will take notes and share the documentation of the discussions.

Template for session proposal[edit]

Here's a template you can use for proposing a session for the Data Quality Days 2022. Feel free to copy its content and create a new section below! (use the session title as the section title). Please note that sessions are not automatically accepted: because we have limited slots in the schedule, we will make a selection of sessions that can make it to the program of the event.

  • Session title: ...
  • Format: (suggestions: lightning talk (5min), presentation + discussion (50min), workshop (90min) - we encourage you to favour interactive formats, such as discussions and workshops)
  • Speaker(s) or facilitator(s): ...
  • Short description of the session: ...
  • Audience: (do participants need any specific knowledge or experience to join?)
  • Suggestions of time and date: (in UTC/GMT - on July 8-10, see the available slots in the schedule)
  • Language (if other than English)

Here you can add additional comments or requests, and ping @Lea Lacroix (WMDE): to make sure I reply to your proposal as soon as possible.

Related discussions[edit]

Discussion: Matching new Wikipedia articles to Wikidata items[edit]

  • Session title: Matching new Wikipedia articles to Wikidata items
  • Format: Discussion: perhaps a short (~5-minute) overview presentation followed by a 50-minute discussion?
  • Speaker(s) or facilitator(s): Mike Peel (facilitator)
  • Short description of the session: New Wikipedia articles are continually being created, but they frequently aren't matched up with existing Wikidata items, leading to missing interwiki links, and duplicates on Wikidata. This session will discuss the status quo - including existing matching mechanisms, mass new item creations, and merging work - and identify ways to improve this work in the future.
  • Audience: Experience with doing this kind of work is helpful, but new perspectives are very welcome!
  • Suggestions of time and date: Flexible - perhaps early afternoon in UTC on Saturday or Sunday?
  • Language (if other than English): English

(@Jura1, M2k~dewiki, GZWDer: I was wondering if you in particular might be interested in joining a discussion like this?) Thanks. Mike Peel (talk) 08:39, 7 June 2022 (UTC)Reply[reply]

Oops, or should this be part of "Suggestions of topics for the facilitated discussions" above @Lea Lacroix (WMDE):? Thanks. Mike Peel (talk) 18:40, 7 June 2022 (UTC)Reply[reply]
@Mike Peel: Thanks a lot for starting this session and pinging people! I'd say, depending on the topic and the scope, we can schedule what works best for you: either a 50min slot only for the discussions and experiment on the topic of matching WP new articles with WD, or including this topic in a broader discussion session.
If we end up booking a separate slot, one thing that is important to us is that the discussions are followed by suggestions on how to move forward with this topic, and what could be the next practical steps (typical examples of next steps: create/improve documentation, create a tool to solve a problem, start a community discussion to make a change on a process).
Let me know what you think :) Lea Lacroix (WMDE) (talk) 08:46, 8 June 2022 (UTC)Reply[reply]
Hi @Mike Peel:, thanks again your proposal! We'll be happy to dedicate a 50min slot for this discussion in the Data Quality Days program. To move forward, I have 2 questions for you:
  • For the day and time of the session, I can offer you Saturday 9 at 12:00 UTC (14:00 CEST) or 14:00 UTC (16:00 CEST). Do you have any preference?
  • For the documentation, do you prefer a video recording or a pad? For presentations, we would typically offer to record the session, but if it is mostly a discussion, people may be more comfortable if there is no video recording.
As soon as we have cleared these points, I'll add your session in the schedule. Thanks, Lea Lacroix (WMDE) (talk) 07:17, 21 June 2022 (UTC)Reply[reply]
@Lea Lacroix (WMDE): Thanks! I think the 14h UTC slot would work best, but either should be OK. Let's go with a pad for documentation. Thanks. Mike Peel (talk) 07:26, 21 June 2022 (UTC)Reply[reply]
@Mike Peel: Great, I added the session in the schedule at 14:00 UTC. Lea Lacroix (WMDE) (talk) 09:02, 22 June 2022 (UTC)Reply[reply]

Might be also interesting for @ ArthurPSmith, Kwgulden: and @Olaf Studt, Lantus, Bahnmoeller:.

Me own experiences have been documented for example at

Interesting could be a general approach to process the backlog with unconnected articles on a regular base, e.g.

Also see 5. (Autosuggest linking Wikidata item after creating an article) above. --M2k~dewiki (talk) 19:05, 7 June 2022 (UTC)Reply[reply]

Sure, I guess I'm interested. I keep running into this for example with articles about physicists - dewiki has far more complete coverage of this area than enwiki, even for American physicists, and so when somebody creates a new enwiki article and they don't even think to try to link it to other language coverage or wikidata we get duplicates. But isn't the main problem that the language wikipedias are not giving users a clue that something for this may already exist elsewhere? ArthurPSmith (talk) 20:51, 7 June 2022 (UTC)Reply[reply]
Yes, therfore it has been suggested to show the authors a pop-up after creating a new article like this:

--M2k~dewiki (talk) 20:55, 7 June 2022 (UTC)Reply[reply]

Yes indeed, very interesting for me. Please keep me informed and pinged. —Lantus 05:21, 8 June 2022 (UTC)Reply[reply]

Present a patrolling/vandalism-fighting workflow[edit]

Hello @Mike Peel, MisterSynergy, Emu, M2k~dewiki, ArthurPSmith:

During the Data Quality Days, we would love to present examples of patrolling workflows: how people involved in checking data quality or fighting vandalism operate on Wikidata, what are their routines, the tools they use, etc.

I was wondering if some of you would be interested in running such a session?

We could imagine different formats: one person sharing their screen and showing what they are doing, or an interview format, where we would ask you questions about how you're performing this or that task. The session could take place in a Jitsi video conference room, directly with participants who could ask questions and share their own tips. We would record the session for documentation purposes, but you wouldn't have to show your face, only use the microphone and share your screen.

If you're interested, I would gladly continue the discussion and go through the practical details with you :) Thanks in advance, Lea Lacroix (WMDE) (talk) 07:55, 10 June 2022 (UTC)Reply[reply]

That's not something I can help with, sorry, I only tend to revert vandalism when I spot it while doing other things. Thanks. Mike Peel (talk) 11:14, 10 June 2022 (UTC)Reply[reply]
Oh, and Wikidata:Requests_for_deletions, I could talk about that if that would fit in. Thanks. Mike Peel (talk) 22:06, 10 June 2022 (UTC)Reply[reply]
Discussing Wikidata:Requests_for_deletions sounds very interesting to me! What did you have in mind? --Manuel (WMDE) (talk) 08:31, 14 June 2022 (UTC)Reply[reply]
There is still one week until we reach the deadline for contributions (June 19th). I need a couple of days to figure out whether I can make it, or contribute something else for this topic. —MisterSynergy (talk) 09:03, 11 June 2022 (UTC)Reply[reply]
Amazing! Let me know if I can help answering any questions. Lea Lacroix (WMDE) (talk) 13:57, 14 June 2022 (UTC)Reply[reply]
Hi @MisterSynergy:, did you have the chance to think about your participation in the event, and some things from your patrolling workflow that you could present? Thanks a lot! Lea Lacroix (WMDE) (talk) 06:56, 20 June 2022 (UTC)Reply[reply]
@Lea Lacroix (WMDE): Unfortunately, it looks like I can't make it this time, due to lack of time. I'm sorry.
In general, I would be willing to present a talk-like contribution regarding patrolling to an event such as this one. As one of the most prolific users of the patrol functionality, I have a deeper insight than most others here that would be valuable to share. Such a talk would roughly include (1) a general introduction to patrolling, (2) some high-level statistics regarding its use at Wikidata, (3) strategies and Wikidata-specific challenges in practice, and (4) an overview of useful tools. You can keep me on your list for the next event where you think something like this would be interesting, since I would consider to attend there. —MisterSynergy (talk) 23:22, 20 June 2022 (UTC)Reply[reply]
@MisterSynergy: Thanks a lot for your reply, and no worries, we will have other occasions! I think the concept you presented sounds great, maybe we could plan something outside of the Data Quality Days. I'm thinking that this could be a good fit for the weekly Wikidata Editing livestream run by @Ainali, Abbe98:, but we could also organise a simple meetup on Jitsi to talk about patrolling. Lea Lacroix (WMDE) (talk) 16:04, 22 June 2022 (UTC)Reply[reply]
@Epìdosis, Infovarius, Matěj Suchánek: Would you be interested in presenting your "patrolling routine" during the event? @Bencemac: would you consider presenting how cross-wiki vandalism fighting works? Thanks in advance! Lea Lacroix (WMDE) (talk) 07:01, 20 June 2022 (UTC)Reply[reply]
My patrolling routine just consists in checking my watchlist one or more times a day, so maybe it's not the most interesting. Anyway, I'm surely available to show it for some minutes! --Epìdosis 07:04, 20 June 2022 (UTC)Reply[reply]
Sorry, but I am not interested. In fact, I haven't patrolled Wikidata for quite some time. Still, when I am available and remember, I at least try to review changes done in languages I understand (cs, sk) using this tool (which implements the language filter not available in Special:RecentChanges, click "terms" and "enter language code"). --Matěj Suchánek (talk) 13:22, 20 June 2022 (UTC)Reply[reply]
@Lea Lacroix (WMDE): I would be happy present. Could you please clarify which format would suits the best? I think I would focus on how to manage cross-wiki spam on Wikidata (RfD, nominating local pages for deletion, useful gadgets etc.) if it is what you are looking for. Bencemac (talk) 13:57, 21 June 2022 (UTC)Reply[reply]
@Bencemac Amazing, thanks a lot!
We could either do a 50min session specifically about cross-wiki spam, where you present your workflow in details, or something more general like "share your patrolling process, tips and tricks" where you could start with a 15-20min demo of your favorite tools, then we leave the floor open to other people to present spontaneously. What would you prefer?
Would the slot at 17:00 UTC on Saturday 9th work for you? Lea Lacroix (WMDE) (talk) 07:38, 22 June 2022 (UTC)Reply[reply]
The time slot would be suitable for me. I think we can do the first (titled as Dealing with cross-wiki span on Wikidata?), and I would present the most important tools and how to use them in maximum 30 minutes, and then we would hava time for questions. Bencemac (talk) 15:34, 22 June 2022 (UTC)Reply[reply]
Sounds great! I booked the slot for you with a short description based on what you wrote above, feel free to edit it.
One more question: to document the session, do you prefer video recording (we would take care of it and publish it after the event), or collaborative notes-taking on a pad?
I'm going to send some information and support to the speakers by email, would you mind sending me an email at so I have your address? Thanks a lot! Lea Lacroix (WMDE) (talk) 15:59, 22 June 2022 (UTC)Reply[reply]
Forgot the @Bencemac: Lea Lacroix (WMDE) (talk) 16:04, 22 June 2022 (UTC)Reply[reply]
I've just sent you an e-mail. I think the second version would be better because to find/try the mentioned resources, the participants would prefer a clickable documentation. Regards, Bencemac (talk) 07:03, 23 June 2022 (UTC)Reply[reply]
Received and noted, thanks! Lea Lacroix (WMDE) (talk) 07:35, 23 June 2022 (UTC)Reply[reply]

Concurrent ways to model data (mainly for humans)[edit]

  • Session title: Concurrent ways to model data (mainly for humans)
  • Format: presentation + discussion (50min)
  • Speaker(s) or facilitator(s): Epìdosis
  • Short description of the session (see my point 4 in #Suggestions of topics for the facilitated discussions): the same information can sometimes be stored in different statements, thus making it very difficult to be found through a single query; I will show some examples, mainly from instance of (P31)human (Q5) items, and I will survey the existing ways of dealing with the problem and possible future solutions
  • Audience: no specific knowledge, I think (examples of the issue will be shown)
  • Suggestions of time and date: July 8, 19-20 UTC; or July 9, 8-11 UTC; or July 10, 8-11 UTC
  • In English

Note: the problem, with specific focus on curricula of academicians, was well analysed in 2021: Clarifying property application for effective SPARQL queries. --Epìdosis 07:32, 13 June 2022 (UTC)Reply[reply]

Ciao @Epìdosis:, thanks a lot for your proposal! We'll be delighted to have it in the Data Quality Days program. To move forward, I have 3 questions for you:
  • For the day and time of the session, I can offer you Saturday 9 at 12:00 UTC (14:00 CEST) or 14:00 UTC (16:00 CEST). Do you have any preference?
  • I was thinking or changing slighly the title, to make it more catchy. I was thinking of taking the phrasing you used in the questions section, and have something like: "How do we deal with concurrent uses of different properties? The example of modeling data for humans". What do you think?
  • Are you fine having your session recorded in video, so it can be published in replay later? If not, it's fine, we'll make sure to document the session in a written form instead.
As soon as we have cleared these points, I'll add your session in the schedule. Thanks, Lea Lacroix (WMDE) (talk) 07:12, 21 June 2022 (UTC)Reply[reply]
@Lea Lacroix (WMDE): It's ok for changing the title and for the recording, of course. For the hour, I have a slight preference for the second slot. Thanks! --Epìdosis 07:16, 21 June 2022 (UTC)Reply[reply]
@Epìdosis: Perfect, thanks! I added your session at 12:00 UTC and the one from Mike at 14:00 UTC, I hope it can work for you. Lea Lacroix (WMDE) (talk) 09:01, 22 June 2022 (UTC)Reply[reply]
@Lea Lacroix (WMDE): Instead of using PPTs, I usually list the links that I want to show in a Wikidata page and then share my screen showing them one by one. Can I create a subpage of the event page (e.g. Wikidata:Events/Data Quality Days 2022/Modeling data or another title of your choice)? Otherwise I will use a user subpage, or a subpage of Wikidata:WikiProject Data Quality (e.g. Wikidata:WikiProject Data Quality/Modeling data or another title of your choice). Let me know your preference. Thanks, --Epìdosis 17:37, 22 June 2022 (UTC)Reply[reply]
@Epìdosis: Of course, do as you please, slides are not mandatory. A subpage of Wikidata:Events/Data Quality Days 2022 is totally fine. Lea Lacroix (WMDE) (talk) 04:40, 23 June 2022 (UTC)Reply[reply]

Using Scholia in curation workflows[edit]

  • Session title: Using Scholia in curation workflows
  • Format: workshop (90min)
  • Speaker(s) or facilitator(s): Daniel Mietchen
  • Short description of the session: We will use Scholia to engage in various curation workflows that address several kinds of data issues, from inconsistencies to incompleteness, lack of references and lack of updates.
  • Audience: anyone with an interest in curating research-related content
  • Suggestions of time and date: 16:15-17:45 UTC on July 9
  • Language: English

@Lea Lacroix (WMDE): Daniel Mietchen (talk) 18:44, 20 June 2022 (UTC)Reply[reply]

Hi @Daniel Mietchen:, thanks a lot for your proposal, we would be glad to have it in the Data Quality Days program!
However, we're trying to keep the sessions at 60min maximum (50min if there's another session directly after), in order to keep online participants' attention.
Would you be fine with the 16:00-16:50 UTC slot on Saturday 9th? Or do you prefer starting at 16:15? Lea Lacroix (WMDE) (talk) 07:05, 21 June 2022 (UTC)Reply[reply]
Thanks, @Lea Lacroix (WMDE): Yes, 16:00-16:50 UTC is fine. --Daniel Mietchen (talk) 21:20, 21 June 2022 (UTC)Reply[reply]
@Daniel Mietchen: Great, thanks! I added the session to the schedule. One last question: do you prefer video recording or collaborative notes-taking? Lea Lacroix (WMDE) (talk) 08:59, 22 June 2022 (UTC)Reply[reply]

Selection of topics for the facilitated discussions[edit]

Hello all,

Thanks a lot for submitting proposals and topics of discussions! We analyzed them and combined them with our own suggestions of topics that are important for data quality processes, and we selected 6 themes that we will present you more in details during the opening session.

At the start of the Data Quality Days today, you will be able to vote for your favorite topics (2 votes per person, vote closes at 20:00 UTC), and we will select the 3 most supported topics, that we will bring to our 3 "structured conversation" slots in the program (Saturday at 09:00 and 15:00 UTC, Sunday at 08:00 UTC).

  • A. How to tame rogue robots?
    • Bots and other tools are super helpful to edit Wikidata. But sometimes they are used in a way that is causing problems and decreases data quality.
    • Goals of the session:
      • Collect some problematic trends and patterns
      • Discuss possible solutions  
      • Find allies to improve the status-quo
  • B. Rules and anarchy
    • Some policies and guidelines on Wikidata are not fully enforced in practice (e.g. Wikidata:Bots, Wikidata:Notability). Ignoring important rules can have negative consequences for the Community and data quality.
    • Goals of the session:
      • Collect examples for policies and guidelines that are currently ignored with negative consequences
      • Discuss possible solutions
      • Find allies to improve the status-quo
  • C. Round-tripping data
    • There are excellent gold-standard sources out there that Wikidata can use, but even those make mistakes. The same is true if you use data from Wikidata for your own project. Therefore, syncs from and to Wikidata should ideally go in both directions (so called “data round-tripping”). Unfortunately, it is currently not as simple as it should be to set this up sustainably.
    • Goals of the session:
      • Collect existing hurdles for setting up round-tripping
      • Additional collection (likely no focus in the discussion)
        • Share examples of where this works particularly well already (we could use these for sharing best practices)
        • Collect examples where building up new syncs with external sources would be of great benefit to data quality
      • Discuss how we can help users who want to set-up sustainable round-tripping
      • Find allies to improve the status-quo
  • D. Why isn’t there more guidance on this?
    • Doing the right thing on Wikidata is often harder than necessary. Some defaults and processes are annoying to the best of us and for new editors a lack of guidance can even be an unsolvable problem.  
    • Goals of the session:
      • Collect examples where missing guidance and bad defaults cause editors problems which eventually leads to lower quality data
      • Discuss possible solutions
      • Find allies to improve the status-quo
  • E. Not everything Wikipedia works for Wikidata!
    • Some concepts and processes that Wikidata has taken from Wikipedia do not work well for Wikidata (e.g. recent changes and talk pages on Item level, requests for deletions process). Maybe there are alternatives?
    • Goals of the session:
      • Collect the biggest issues we currently face from using Wikipedia-solutions for Wikidata-problems
      • Discuss possible alternatives
      • Find allies to improve the status-quo
  • F. Too many ways to express the same thing (follow-up session)
    • This is a possible follow-up session to the session “How do we deal with concurrent uses of different properties?“ about increasing standardization of the use of properties in Classes.
    • Goals of the session:
      • Collect practical steps towards better standardization on Wikidata
      • Find allies to improve the status-quo

If you have any questions, feel free to ask here, or to join the Data Quality Days program. See you there :) Lea Lacroix (WMDE) (talk) 06:38, 8 July 2022 (UTC)Reply[reply]

Update: thanks a lot for your votes! The 3 topics we selected and assigned to a discussion slot are:
  • C. Round-tripping data -> Saturday 9th at 09:00 UTC
  • B. Rules and anarchy -> Saturday 9th at 15:00 UTC
  • D. Why isn’t there more guidance on this? -> Sunday 10th at 08:00 UTC
See you around! Lea Lacroix (WMDE) (talk) 20:37, 8 July 2022 (UTC)Reply[reply]

End of the Data Quality Days 2022 - Thank you very much![edit]

Hello all,

Once again, a big thank you to all participants, to the speakers and facilitators who contributed to the program, as well as people who helped with notes. Lydia, Manuel and I are really excited about all the interesting discussions that the event enabled, and the action it already generated.

You can find the slides, notes and videos if any directly from the schedule.

We collected the main outcomes of the event on the outcomes page, feel free to contribute and add more details.

If you have any feedback on the event, suggestions on how the format could be improved, feel free to reach out to me. Lea Lacroix (WMDE) (talk) 06:50, 18 July 2022 (UTC)Reply[reply]