Topic on User talk:Magnus Manske

Jump to navigation Jump to search

SourceMD problem with some DOI's?

32
ArthurPSmith (talkcontribs)
Sic19 (talkcontribs)
ArthurPSmith (talkcontribs)

Ah, yes indeed that does look like the issue - I didn't realize there was a special SourceMD talk page! Once the DOI's have been corrected does SourceMD then no longer try to repeatedly add these?

I've also run across something strange where it looks like the same article title and authors is repeated but with different DOI's - and sometimes slight adjustment of the title (capitalization). At first I thought it was just the authors repeatedly submitting articles with the same title, but I've seen too many like this to think that's true now. If I run across it again I'll dig further but I was wondering if that's also a known problem?

Sic19 (talkcontribs)

Correcting the DOI seems to stop the creation of duplicates. I'm not absolutely certain this is the case but I haven't seen anything that has given me cause to be concerned about duplicates being created when the correct DOI exists.

Could the repeated titles be things like replies or errata that do not include that information in the label? @Trilotat has drawn our attention to an issue with titles being created that are truncated at the second or third colon.

Trilotat (talkcontribs)
ArthurPSmith (talkcontribs)
Pintoch (talkcontribs)

Hi Magnus,

Given that this issue has generated repeated complaints on WD:ANI and elsewhere, I think it would be best to deactivate the tool until this is fixed. If you do not have time to do that, we can set up an abuse filter to block it.

Cheers,

ArthurPSmith (talkcontribs)

I plan to track down all the journals with < characters in DOIs to see how bad this problem is... might not be too hard to fix/

ArthurPSmith (talkcontribs)

Going through the Crossref data - so far found 41,000 '<' DOI's out of 24 million DOI's checked. That's a higher percentage than I expected; I haven't tracked down all the associated journals yet.

Sic19 (talkcontribs)

Have you taken note of the publisher responsible for the DOIs you've identified? I suspect it is primarily a Wiley related problem but I could be wrong about that of course. Another thing I noticed, this issue seems to be constrained to DOIs for articles published before 2006 or 2007 - have you found any articles later than that?

ArthurPSmith (talkcontribs)

It's not just 10.1002 DOI's if that's what you mean - I've seen 10.1175 (American Meteorological Society), 10.1130 (Geological Society of America), 10.1206 (BioOne?), 10.1666 (BioOne?), 10.1562 (BioOne?), 10.1672 (Springer?), 10.1899 (U. Chicago Press), etc.

ArthurPSmith (talkcontribs)

They do seem to be all 10+ years old though.

ArthurPSmith (talkcontribs)

Final count is 431753 DOI's with a '<' character (from the August 2018 Crossref metadata files). Also 431757 with a '>'. It may be a bit more work than I thought to sort out the relevant journal lists and link to Wikidata ID's to see the full scope of what needs fixing...

ArthurPSmith (talkcontribs)

Ok, down to 423 journals on Wikidata potentially with this problem, and 18432 items under those journals that I can find with DOI's that look like they have problems. I think I can fix them all...

Sic19 (talkcontribs)

Brilliant! If you want to run a test batch I would suggest using the Geology DOIs because Trilotat is working on them and would be able to report any problems.

I'm wondering, would it make sense to ingest the all of DOIs using this format that are not yet in Wikidata and then run a big fix for them all? In theory, no further duplicates could be created afterwards.

Ping me if you need any help

ArthurPSmith (talkcontribs)

Batches to fix the DOI's are running now - however, these were based on partial DOI + title matches; there were 426 cases where those were not sufficient to get a unique matching DOI (often short titles like "Errata" or "Reply", or where the title didn't match anything at all), so I think I'm going to have to do something fancier to match those - or maybe just go through them by hand.

No, I don't think it would be a good idea to ingest all the remaining 400000+ articles! SourceMD needs to be fixed to handle this correctly...

ArthurPSmith (talkcontribs)

Here's the problem in the SourceMD code - all the data is being run through this:

function fixStringLengthAndHTML ( $s ) {

$s = str_replace ( html_entity_decode('&#160;') , ' ' , $s ) ;

$s = str_replace ( '&amp;' , '&' , $s ) ;

$s = str_replace ( '&quot;' , '"' , $s ) ;

$s = preg_replace ( '/<.+?>/' , ' ' , $s ) ;

$s = preg_replace ( '/\s+/' , ' ' , $s ) ;

$s = trim ( $s ) ;

if ( strlen($s) > 250 ) $s = substr ( $s , 0 , 250 ) ;

return $s ;

}

- the 4th line of replacements is what's killing these DOI's. This function should probably just be skipped for the DOI.

Sic19 (talkcontribs)

So, I'd misread the count of these DOI as 43,000 and suggested a complete ingest on this basis. Completely agree that it would not be sensible to try this with 400,000 articles.

In the batches I have fixed, I encountered the same issue of partial DOI + title matches becoming insufficient. With the addition of lead author and pagination it was possible to successfully match another good quantity of items to DOIs. The starting page number follows < in the DOI (I am unsure if this can be generalised to all DOIs with this format though). A limitation appears to be pagination that uses both numerals and a letter e.g. 742B, which is not always correct in Wikidata. Unfortunately, replies and comments etc. frequently use this format.

I'm going to start a new topic on the SourceMD talk page in an attempt to connect the problems that have been discussed in various different places over recent weeks.

Trilotat (talkcontribs)

I'm excited to see this work. All, as an obsessed and lowtech DOI fixer (one and a time with multiple copies/pastes to fix each), I sure appreciate it. I used SourceMD to load most of the Geology journal, so mea culpa. Thanks for the remedy.

Trilotat (talkcontribs)

I see many corrections have happened. Thanks! I still have many that didn't get resolved (or haven't yet). I have started to work on them. They appear to be the DOI+title match problems as described above. You've really helped, both of you, so thank you.

ArthurPSmith (talkcontribs)

Trilotat, it might help if you hold off - the last 7000 fixes have only just started (I had to split it into 3 batches for quickstatements)

Trilotat (talkcontribs)

Oops. Will do. Sorry if I goofed anything up by the edits I've just done. Thanks.

ArthurPSmith (talkcontribs)

It's ok - all done now anyway! Now I'm running a merge process to group together all the duplicates in the ones I just fixed (might have made sense to run that first, but I didn't think of that...)

Trilotat (talkcontribs)

Okay. I'm back to plugging away at this list. There are only 607 in the list. I'm pretty sure they are all articles with title issues.

ArthurPSmith (talkcontribs)

Merging is completed. I'll probably take a look at handling the remaining bad DOI cases over the next few days; feel free to move ahead by hand on any of them though!


Trilotat (talkcontribs)

ArthurPSmith, might your batches merge articles with the same name though they have different pages? I might have discovered one such merge (or it might be an example that I fouled up during my edits.) I might have made it impossible to determine since I've already fixed Q59569886 and Q59569882... sorry.

Trilotat (talkcontribs)

aarrgh. Look at Q59569842. I won't edit that. It's merged two different articles from two different issues. To be clear, even if these batches generate a few merges in error, I'm extraordinarily grateful for the work. I can clean them up manually if necessary.

ArthurPSmith (talkcontribs)

Yes, there can't be very many of them, but my assumption that the combination of "partial DOI" (outside of the <>), title, and last name of first author would be unique was apparently not quite true! Feel free to restore any others you see like that, I'll see if I can track them down too.

I'm hoping we are down to the very last of these - however, I discovered another problem with the sourceMD imports - it transforms the character '#' into '%23' which would I believe also cause the lookup check to fail and result in duplicates. There are 8640 such DOI's in the full list I have, so I imagine there are a few hundred at least in Wikidata with this problem. I'll be taking a look at that I guess too.

@Trilotat I did notice some of the DOI's you had added were mixed case - that will also cause trouble for SourceMD. I believe it needs any alphabetic characters to be either all uppercase (which it defaults to when inserting) or all lowercase; mixed case will also fail to match. I noticed some you entered looked like <567a:BR> - it needs to be <567A:BR> if you are going with uppercase. Note that the Crossref source data I have has DOI's all lowercase so I've left them that way where I've been replacing them.

Trilotat (talkcontribs)

All, thanks so much for assistance (doing the lion's share of the work, really). That list has no more bad DOIs in it. I cannot speak for the quality of the data within each item, but at least the DOIs aren't duplicated. I must now check that all the titles are accurate and that there are no constraint violations, i.e. two DOIs in one item, or two issues listed in one. Yippee... thanks so much.

Magnus Manske (talkcontribs)

I have changed the sourcemd code to treat DOIs differently. Thank you guys for cleaning up!

Trilotat (talkcontribs)

Thank YOU.

ArthurPSmith (talkcontribs)

Thanks Magnus!!