I've proposed to agree on a threshold for the rate of duplicates of mass item creations at Wikidata:Project chat#What duplicate rates should we tolerate? Nevertheless I think mixnmatch_people_creator is a great approach to make item creation automatic that can take a lot of time when done manually. But I think duplicates have to be reduced by automatically doing steps that a human editor should do as well. All the duplicates I've found could have been prevented by:
- calculation of name variants:
- create version of first and last non-whitespace-parts of labels/aliases: would have prevented Q52158632 (in combination with 2.).
- for Spanish/Catalan names create versions with and without "y" or "i" and with and without matronymic, e.g. "Gregori Minobis" from "Gregori Minobis i Puntonet": would have prevented Q52158641, Q52158636.
- probably more
- import of name variants from databases (here CANTIC, Library of Congress authority, ULAN) to Mix'n'match, e.g. "Gregori Minobis i Puntonet" from http://cantic.bnc.cat/registres/fitxa/44836: would have prevented Q52158641, Q52158636, Q52158634, Q52158623, Q52154873.
I'm aware that this isn't trivial and 1. might impair performance, but I think that the current duplicate rate has to be reduced before further operations of mixnmatch_people_creator. Thank you for your work, M.