Topic on User talk:Magnus Manske

Jump to navigation Jump to search

Reduce duplicates by mixnmatch_people_creator

Marsupium (talkcontribs)

I've proposed to agree on a threshold for the rate of duplicates of mass item creations at Wikidata:Project chat#What duplicate rates should we tolerate? Nevertheless I think mixnmatch_people_creator is a great approach to make item creation automatic that can take a lot of time when done manually. But I think duplicates have to be reduced by automatically doing steps that a human editor should do as well. All the duplicates I've found could have been prevented by:

  1. calculation of name variants:
    1. create version of first and last non-whitespace-parts of labels/aliases: would have prevented Q52158632 (in combination with 2.).
    2. for Spanish/Catalan names create versions with and without "y" or "i" and with and without matronymic, e.g. "Gregori Minobis" from "Gregori Minobis i Puntonet": would have prevented Q52158641, Q52158636.
    3. probably more
  2. import of name variants from databases (here CANTIC, Library of Congress authority, ULAN) to Mix'n'match, e.g. "Gregori Minobis i Puntonet" from would have prevented Q52158641, Q52158636, Q52158634, Q52158623, Q52154873.

I'm aware that this isn't trivial and 1. might impair performance, but I think that the current duplicate rate has to be reduced before further operations of mixnmatch_people_creator. Thank you for your work, M.

Reply to "Reduce duplicates by mixnmatch_people_creator"