Topic on User talk:Hjfocs

Jump to navigation Jump to search

Soweego bot adding invalid GND ids

4
Summary by Hjfocs

[soweego 2] MusicBrainz (Q14005) URLs validation: bad ID extraction

CamelCaseNick (talkcontribs)

VIAF contains GND entries, and the IDs listed there are usually the GND IDs themselves. However, some GND IDs contain hyphens, and those are replaced by numeric IDs usually starting with a 0 or a 9.

Those are not GND IDs. Your bot is importing those invalid ones from MusicBrainz. (e.g. see Special:Diff/1483617303) They do not match the regular expression. The easiest approach would be to ignore them and maybe list them somewhere to fix them here and in MusicBrainz.

Another more advanced and complex approach would be to check for a VIAF processed entry and extract the correct ID, that is in there.

Hjfocs (talkcontribs)

Thanks again for your precious help on these troublesome IDs: I have stopped the bot, will fix the bad IDs, and will prevent their addition in future runs. I'll stick with the easy approach you propose, since I believe it would also be very fruitful in terms of feedback loops with the MusicBrainz maintainers. By the way, this is one of the main goals of the soweego project, see m:Grants:Project/Hjfocs/soweego_2#Goals. Cheers!

CamelCaseNick (talkcontribs)

I have found another malformed identifier: the LoC authority file web access uses the auth URI with an .html suffix, that shouldn't be part of the ID. see Special:Diff/1483536843

Hjfocs (talkcontribs)

I checked the bot datasets and found 2 such malformed IDs in total. It's great that you have already fixed them, thank you so much for your work! I deleted all the problematic GND IDs, too. Cheers!