Topic on User talk:Magnus Manske

Jump to navigation Jump to search
Nono314 (talkcontribs)

Hi Magnus,

Thank you very much for this: it helps a lot by automating the comparison I was doing manually ;-) Feeding person_dates remains almost the last task where we can't do without you, as I was able for some time now to leverage the aux part of the scraper to inject external ids and wikipedia backlinks to boost the matching.

I'm a bit puzzled however about how auxiliaray matcher determined this? It doesn't look like the right thing to me...

Magnus Manske (talkcontribs)

I have thought about outsourcing the date parsing to users, but didn't find a nice solution, yet...

As to the automatcher results, that's probably an old bug, I'll see if I can chase it down.

Nono314 (talkcontribs)

Thanks for un-matching them, pending bug chasing ;-)

As much as I am missing the date parsing feature, I fully understand how sensitive it is, especially given it is not just used for matching but eventually added to WD by Reinheitsgebot... May I ask you if you can add regexes to catalogs such as 2059, 1809 or 1824?

Relations also have a huge potential, but are currently under-used: that would be a big improvements where name matching does not make sense at all like for artworks. It's a pity that they get such poor matches even with aux containing everything needed to find the right one. We could then consider for example having RKD to match on steroids!

Magnus Manske (talkcontribs)

Catalog 2358 is the only one with both creator and inception date that I know of. I wrote a special script to do high quality automatches, running now.

Nono314 (talkcontribs)

Sorry for not having been too clear with that one. I was meaning using values in aux data to narrow search, somewhat like the way you currently use the type to restrict search to plausible entities. We have many cases where we could easily have creator and/or collection for that purpose (RKD is a good example as relations could be leveraged with 13 and 2382).

I appreciate what you did but, sadly, it seems your special script is now ignoring the title, so I'm not sure this is really better than the previous iteration, except it is now clearly wrong and even a casual user wouldn't validate the match...

Looks like dates are not supported by CirrusSearch, but at least something like this is giving what I would call high quality results that could be offered to the user for picking instead of just selecting a random entry by this name as was previously the case. Hope you're not too fed up to try it ;-) If so I can work on providing more aux data for some catalogs. And by the way having P276|P195 in autodesc would also help assessing many matches!

Reply to "automatching for catalogue 3024"