Topic on User talk:IagoQnsi

Jump to navigation Jump to search
Peteforsyth (talkcontribs)

Hi, I'm delighted to see you have added so many items for local newspapers. I've found several so far that are redundant of existing items (The Bend Bulletin, the Hood River Glacier, and the Klamath Falls Herald). Do you know of a good way for search for and resolve duplicates, or should I just continue searching for them manually?

Also, you may be interested in a campaign this connects to: News On Wiki. We are trying to create a few hundred new English Wikipedia articles about small newspapers by the end of February. -~~~~

IagoQnsi (talkcontribs)

Hi Pete. Apologies for those duplicates; it just wasn't feasible for me to go through and dedupe all the ones that the automatching missed, as there were some 19,000 newspapers iirc. I don't know of a good way to find and merge duplicates. I'd be happy to give you my OpenRefine project files if you think you could do something with those. I suspect the duplicates aren't too numerous, as many of the papers I imported have been defunct for decades, and many of the ones that still exist did not already have items anyway. I figured editors would gradually stumble upon the duplicates over time and whittle them away.

Peteforsyth (talkcontribs)

Hi, I've delved into this in some more depth now. I resolved most of the duplicates in the state of Washington, I've dabbled in Oregon and Wisconsin, and I think I have a pretty good sense of where things stand. It seems to me that probably well over half of the ~19k items you imported were duplicates. There are two things going on; the first is pretty straightforward, the second has more nuance and room for interpretation.

First, there appear to have been three major imports of newspaper items: Sk!dbot in 2013, 99of9 in 2018, and items for which a Wikipedia article existed (rolling). A whole lot of your items were duplicates of existing items, that had come about by these processes. (An example is the Capital Times. But I've merged yours, so you'll have to look closely.) I know that the 2018 import was based on the USNPL database (and a handful of other web databases). I have no idea what Sk!dbot used as a source.

Second, there are items that some might consider duplicates, while others wouldn't. Consider a case like this:

  • The (a) Weekly Guard changed its name to the (b) Guard.
  • The Guard and the (c) Herald (which had both daily and weekly editions at various times, and went through three different owners) merged.
  • The (d) Herald-Guard has continued.

Many newspaper databases (Chronicling America, Newspapers.com, etc.) consider a, b, c, and d four or more distinct items, and may or may not do a great job of expressing the relationships among the items. In WikiProject Periodicals, we discussed it in 2018, and concluded that we should generally consider a, b, c, and d one item, and attach all four names to it (as alternate labels). See the items merged into the Peninsula Daily News for an example of how your items relate to this principle.

Peteforsyth (talkcontribs)

Unfortunately, all this adds up to a setback to the News On Wiki campaign, which has relied on reasonably tidy and stable Wikidata data (and the in-progress PaceTrack software) to track our progress (in addition to having improvement to relevant Wikidata content as a core component of our mission). There are two impacts:

  • Prior to your import, this query returned about 6,000 results. It was a little cumbersome, but it was possible to scroll and zoom around. Now, it returns about 25,000 results, and it's sluggish.
  • Prior to your import, the green items (indicating that there was a Wikipedia article) were pretty prominent. But now, the map looks mostly red, making it less useful as a visual indicator of how thoroughly English Wikipedia covers U.S. newspapers.

The second problem results in part from some stuff that predates your import, and maybe I can figure out a way to address it. If a city has 4 papers and one of them has a Wikipedia article, it would be better to have a green dot than a red dot (indicating that at least one newspaper in that city has a Wikipedia article). But unfortunately it goes the other way. I bet I can adjust the code to make that change, or maybe even find a more graceful way of handling it than that.

Anyway, just wanted to give you an overview of what I'd learned. I don't know whether you discussed this import at WikiProject Periodicals (or a similar venue) prior to performing it, but if not, I'd urge you to do that in the future, to at least have a chance of detecting these kinds of issues ahead of time. I know it's a learning process for all of us, so please don't take that as anything but a suggestion on how to improve future imports.

If you do have thoughts on how to address any of the issues I brought up, I'd be very interested to hear.

IagoQnsi (talkcontribs)

Hi Pete, thanks for the detailed message and the time you've put into this. My apologies for the high duplicate rate -- I had expected it to be much lower. I think the core issue is really that it's just hard to de-dupe newspapers due to how they're named; many newspapers have similar or identical names, and newspapers are often known by several name variants. My goal wasn't really to import new newspapers so much as to import newspapers.com links -- it just worked out that I wasn't able to automatically match that many of the links to existing items.

I don't know that there's an easy solution to this situation. Perhaps we could have better tooling for identifying likely duplicates, but I think this is fundamentally a problem that requires lots of manual cleanup.

IagoQnsi (talkcontribs)

I also do wonder if the rate of duplicates you've found stands up across the entire dataset, as Newspapers.com's collection isn't evenly distributed across the country. They seem to have a particular emphasis in the middle of the country -- in Kansas, Nebraska, and Oklahoma. When I was working on the import, I found that a lot of these newspapers were very obscure; perhaps they existed for two years in a town that existed for ten years but has now been abandoned for a century. I actually had to create a surprising number of new items for the towns these newspapers existed in, as they had not yet made their way into Wikidata. This is why I went forward with the import despite the volume of new items -- it seemed likely to me that a majority of them were indeed completely new.

IagoQnsi (talkcontribs)

By the way, how are you finding all these duplicates? Do you just use that map? I'd be happy to help out in the de-duping process.

Matthias Winkelmann (talkcontribs)

I found a cool 400 duplicates with this most straightforward query: Identical places of publication, Identical labels, and no dates of inception/dissolution that would differentiate them:

SELECT distinct ?item ?other ?itemIncept ?otherIncept ?itemPubLabel ?otherPubLabel ?np ?label ?otherLabel WHERE {

 ?item wdt:P7259 ?np.
 ?item rdfs:label ?label. 
 FILTER(LANG(?label) = 'en').
 ?other rdfs:label ?label.
 ?other (wdt:P31/wdt:P279*) wd:Q11032 .
 FILTER(?other != ?item).
 OPTIONAL { ?item wdt:P291 ?itemPub.
          ?other wdt:P291 ?otherPub. }
 OPTIONAL { ?item wdt:P571 ?itemIncept. 
           ?other wdt:P571 ?otherIncept. }
 FILTER(?itemPub = ?otherPub)
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }

}

Will probably not show much after I've merged these. But you can find thousands more with a it of creative, such as dropping "The" from labels etc. Example Chadron Record (Q55667983) The Chadron Recorder (Q100288438)

Checking for duplicates at that level is what I would consider the bare minimum level of care before creating thousands of new items.

Duplicate items are far worse than missing data, because they create an illusion of knowledge/truth, i. e. the consumer of such data will wrangle with "unknown unknowns" instead of "known unknowns". Witht that in mind, it's simply unacceptable to create tens of thousands of items when the work is shoddy enough to warrant a disclaimer in the description: "Created a new Item: adding Sports-Reference college basketball players (likely contains some duplicates)" (see here for 500 out of x usages of that: ).

About half of the duplicates I cleaned up were instances where you created both the original and the duplicate, meaning you didn't even deduplicate within your own data. Simply sorting alphabetically would have made it easy to sort this out... Is something I would usually say. But many of these cases have (had) consecutive IDs, meaning they were sorted alphabetically. You just didn't care enough to quickly scroll through the data?

IagoQnsi (talkcontribs)

You're right, I should have caught those. I assumed that Newspapers.com would have done such basic de-duping, as their website presents each newspaper as being a distinct and complete entity. Clearly I was mistaken in this assumption. Mea culpa.

The batch of basketball players I tagged as "likely contains some duplicates" was the set of players in which OpenRefine had found a potential match, but with a very low level of confidence. I manually checked a number of these and found that the matches were completely wrong most of the time, but occasionally there was a correct match. To me the rate of duplicates seemed fairly low and so I figured it was worth having rather than leaving a gap in the SRCBB data.

Although I agree that I could and should have done more to clean up the Newspapers.com batch, I disagree that no data is better than data with duplicates. Duplicates are easily merged, and de-duping is a job that is excellently handled by the crowdsourcing of a wiki. It's very difficult for 1 person to solve 5000 potential duplicates, but it's very easy for 5000 people to stumble upon 1 duplicate each and de-dupe them.

Matthias Winkelmann (talkcontribs)

I just noticed you also added 16k+ aliases that are identical to the labels, which is a complete waste of resources, among other things. As to who should clean that (and everything else) up, I disagree with the idea that "it's very easy for 5000 people to stumble upon 1 duplicate each and de-dupe them", except in the sense that it is easier for you. I'll also cite this:

"Users doing the edits are responsible for fixing or undoing their changes if issues are found."

(from Help:QuickStatements#Best practices))

IagoQnsi (talkcontribs)

Certainly I agree that problems should be dealt with by the original editor. I should have deduped more and I take responsibility for that. I was talking about deduping of the type that can't be done automatically -- things that require manual inspection to determine that they are the same item. I had thought when I uploaded the dataset that the duplicates would be primarily of this type. I'm sorry that my data upload was lower quality than I had thought and intended it to be.

IagoQnsi (talkcontribs)

@Matthias Winkelmann Here's an example of the kinds of complex cases that I didn't want to systematically apply merges to. The town of Cain City, Kansas has three similarly named newspapers on Newspapers.com: Cain City News, The Cain City News, and Cain-City News. I initially merged all three of these into one item, but upon further investigation, I discovered that only the 2nd and 3rd were duplicates; the 1st entry was a different newspaper. Cain City News (Q100252116) had its Vol. 1 No. 1 issue in 1889, while The Cain City News (Q100252118) had its Vol. 1 No. 1 in 1882 (and its archives end in 1886). In merging newspapers just on the basis of having the same title and the same location, you bulldoze over these sorts of cases. This is why I was so hesitant to merge seeming duplicates -- they often aren't in fact duplicates.

Peteforsyth (talkcontribs)

Oh my, I thought I had replied here long ago, but must have failed to save.

First, I want to say, while it's true this issue has been pretty frustrating to our campaign and ability to access information about our subject matter and our progress, I understand that it came about through good faith efforts. I do think there are some important lessons for next time (and I think it's worth some further discussion, here and maybe at a more general venue, to figure out exactly what those lessons are -- as they may not be 100% clear to anybody yet.)

Specifically in response to the comment immediately above, about the Cain City News: Personally, I strongly disagree with the conclusion; I understand that such merges would be sub-optimal, but in the grand scheme of things, if the choice is:

  • Create tens of thousands of items, of which maybe 40% are duplicates, or
  • De-dupe, potentially merging some hundreds or even thousands of items that are not actually the same, and then create far fewer items

I think the second option is VASTLY superior. These are items that did not previously exist in Wikidata; to the extent they are considered important, they will be fixed up by human editors and/or automated processes over time.

Furthermore, with many smart minds on the problem, it might be possible to substantially reduce the number of false-positives in the de-duping process, so that most items like the Cain City News get caught and fixed ahead of time. (Maybe.)

Which is all to say, it's my understanding that best practice in these cases is to make meaningful consultation with other Wikidata editors prior to importing thousands of items, and to allow some time for the discussion to unfold and ideas to emerge. I think this particular instance really underscores that need. Maybe others would agree with your assessment about items like Cain City, or maybe they would agree with me; but without asking the question, how do we know? We don't. It's important in a collaborative environment to assess consensus, prior to taking large-scale actions that are difficult to undo.

Anyway, I think it would be a good idea to bring some of the points in this discussion up at WikiProject Periodicals or similar, and get some other perspectives that could inform future imports.

Reply to "Newspaper items"