Hi Magnus - will you be attending the Wikicite meeting in San Francisco the week after next? The final day is a "hack day" and I had some thoughts about potential improvements to SourceMD. Whether or not you're there, what would be a helpful process to implement improvements? In particular I was looking at the "new_resolve_authors" piece, which I'm not really sure of the status of, but it looks like it is close to doing something significant for author disambiguation that I've been wanting to work on for ages...
Topic on User talk:Magnus Manske
Wikicite? and SourceMD?
Hi! Sorry, I won't be in SF. That said, I'm always happy to talk about new/improved tools, time permitting. "new_resolve_authors" is a crude solution to a real problem, but I'm not sure the codebase should be continued; rather, I'd like to see a different setup, based on better clustering. Also, ideally, a bot that, in the background, creates new high-confidence author items, and changes the statements in the publications accordingly. Having access to high-quality author databases other than ORCID would be a boon.
I don't think there's anything that compares to ORCID (VIAF and ISNI have some records for scholarly authors, but very incomplete from what I've seen). I expect there are internal publisher databases that have more useful data (like author email addresses) but I don't know how we get at those. The problem I'd like to address is handling authors from before ORCID existed, or who have never engaged with ORCID (and may now be deceased). I'd like to facilitate human curation, not sure if we're ready for bots to do this. My idea was related to just adding some simple features to help with deciding on what papers are associated with an author: I believe your tool right now looks at names of coauthors - which is actually a good start. I'd like to add in journal title, publication date, possibly citations and affiliations where we have any data on that.
The way I was thinking of proceeding was to clone the "new_resolve_authors" piece, edit it to just spit out a list of Quickstatements that can be run separately (rather than directly feeding it to the bot), and then try to recruit some people to test it and figure out ways to make it work better... Does that seem sensible? I guess I'll report back on how it goes...
Journal title etc. would work, especially for authors with "common" names, but it will miss out on some authorships (e.g. someone had a kitchen chat with you and puts you on their paper in some completely unrelated discipline/journal).
One reason I don't just start up QuickStatements myself is that after author item creation, the QID of that new item needs to be used as a value for the author statement changes. Theoretically that should work using the LAST keyword, but I'm not quite sure that works...
One could create the author item internally, and then run the rest in QuickStatements. Assuming there are no two identically named authors on that paper ;-)
Something that could improve "tasking" people with this could be to pre-generate likely candidates in a separate database, which would allow for quick serving of a set to work on.
Some heuristics relating to name frequency might also help... Anyway, good suggestions, thanks!
First stab at this up here: https://tools.wmflabs.org/author-disambiguator/ and https://github.com/arthurpsmith/author-disambiguator/ - not really expecting you to do anything on this, just to keep you informed! Thanks. I am aware some things are broken, but there's some basic functionality which works so... good!
Nice! I tried it with myself, and even found a paper where I'm "just" string author. QS statement to add the P50 was generated correctly.
- This needs an additional QS command to remove the string author
- If the string author contains a reference (some do), that reference needs to be added to the P50 statement before removing the string author, otherwise people will start yelling at you (guess how I know that?)
- You can open QuickStatements with all the commands pre-filled by doing a POST request (recommended, GET gets unpredictably chopped by the browser) like:
That will save the user from copy/pasting the whole thing, and looks a lot less messy :-)
Thanks! Yes, I was thinking about the references issue (and removing the string statements) ...
I've updated it to do the QuickStatements post, and to add the references. Do you think we need to check for other qualifiers besides series ordinal on the original statement? I'll do a bit of testing and then add back in the delete statements...
FYI this was rather popular at WikiCite - I had half a dozen or so people trying it out - thanks for getting it started and your suggestions so far! We fixed some bugs and I've done a bunch of refactoring; I have a few more ideas on the clustering angle I'm going to try out too.