Topic on User talk:Hjfocs

Jump to navigation Jump to search

Lots of error'd twitter handles

8
Summary by Hjfocs

[soweego 2] in situ evaluation, see M3.3. in m:Project/Hjfocs/soweego_2#Work_package

BrokenSegue (talkcontribs)

Can you explain how your bot matched twitter handles to items? I'm seeing lots of very wrong matches. For example in Q45526049 it added a twitter handle to someone from ancient china. I've found hundreds of such examples.

BrokenSegue (talkcontribs)

huh, the bot has even added the same twitter handle to multiple different incorrect items e.g. Q45502815 and Q45607716 which should be impossible.

BrokenSegue (talkcontribs)
Hjfocs (talkcontribs)

Hi BrokenSegue (talkcontribslogs), thanks a lot for spotting and fixing those obvious errors: that's very valuable for the bot, as it can learn on the mistakes it makes. The bot uploads Twitter identifiers that are considered confident by the underlying system, SocialLink. More specifically, we tried to filter out non-living individuals in the process, but unfortunately death dates are not always available, which is probably the main reason behind the obvious errors in the Ming dynasty you pointed out. On the other hand, I agree that Alex Gough (Q2114948) looks like a more reasonable error. Before reverting the whole batch of edits, could you please share more detailed information on the errors? Do you have any references of the hundreds examples you found? That would be really really useful. Thanks again for your time. Cheers!

BrokenSegue (talkcontribs)

Ah, ok. Yeah so I may have miscounted the number of obvious errors. I've reverted at least a hundred through quickstatements though (e.g. this batch). There's also some really difficult cases like Q51166600 where we have a twitter username ("WeiYinChen16) that is used across 10 items. I'm not sure what kind of ML social link is using but it's clearly being too aggressive and I'm guessing it doesn't attempt any global optimizations (e.g. "well this twitter account matches these 5 items but it matches this one best").


There's also a ton of twitter account statements with the reference "stated in Twitter" which is very unhelpful. I see at one point you went back and swapped some out for a different reference but there's still a ton with the old incorrect one (optimally the reference would mention social link somehow). It would be helpful to know which entries were done using name matching / ML. Is there a reason they haven't all been changed?

Hjfocs (talkcontribs)

First, I'm really grateful that you reverted the ancient China batch.

Wei Chen (Q51166600) is probably wrong: as I can't understand Chinese, I can't judge the Twitter ID. But if we follow the Facebook link inside the Twitter one, the profile picture is clearly a baseball player, so Wei-Yin Chen (Q708040) should be the right one. I've removed the identifier from the other items.

On top of my head, this might be a corner case, where SocialLink's confidence score across the items is identical. Said that, while I agree we should avoid such 1-to-n links, Was a bee (talkcontribslogs) proposed an interesting alternative to keep them, see https://github.com/Wikidata/soweego/issues/374

In addition, there's an open ticket that aims at intercepting the constraint check reports, see https://github.com/Wikidata/soweego/issues/266.

Regarding the stated in (P248) reference node, it seems that the process in charge of converting into based on heuristic (P887) stopped unexpectedly. I'll have to investigate why.

BrokenSegue (talkcontribs)

Here is an example of an item that had a date of birth in the 5th century but was still assigned a twitter handle: Q3734999. Clearly something went wrong.

Hjfocs (talkcontribs)

Totally agree, of course. This is due to the lack of date of death. I acknowledge that we should apply a less trivial filter.