Wikidata:Requests for permissions/Bot/SilentSpikeBot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 19:48, 26 February 2020 (UTC)[reply]
SilentSpikeBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)}}
Operator: SilentSpike (talk • contribs • logs)
Task/s: Add X numeric user ID (P6552), number of subscribers (P3744), has characteristic (P1552) and point in time (P585) qualifiers to X username (P2002) statements (Wikidata:Bot_requests#Add_numeric_id_to_Twitter_username_(P2002)).
Code: Not provided. Using Pywikibot to create SPARQL generator for items with P2002 with no P6552 qualifier, then fetching data (numeric ID, subscriber count and verified status) from Twitter API (developer account approved specifically for this task).
Function details:
- Add missing numeric IDs (assumption that the username still maps to the account originally intended, see extended discussion below).
- Cleanup step: Replace any existing has characteristic (P1552)-qualifier value of verified badge (Q48799541) with verified account or profile (Q28378282) (more appropriate as per discussion below). Homogenises data.
- Cleanup step: If there's a source with a single retrieved (P813) statement, add this as a point in time (P585)-qualifier if no existing qualifier dates the data (minor assumption that when it was retrieved was when the data is true).
- If data is dated (P585, start time (P580) or end time (P582)) at this point, move on to next statement (step 1)
- Add significant follower counts (>500,000) and verified status to statements (replacing any of these as existing qualifiers since the information was not dated)
- Date the data with P585 so it is meaningful
- Write any difficult cases to userpage User:SilentSpikeBot/TwitterForReview for human review (suspended account, etc.)
--SilentSpike (talk) 01:46, 1 February 2020 (UTC)[reply]
Discussion:
Have successfully tested all expected functionality on test item https://test.wikidata.org/wiki/Q162167 with stand-in properties and item values. Will now perform a test run of 50 edits (starting with just 1 first) here and then link to results. --SilentSpike (talk) 15:12, 1 February 2020 (UTC)[reply]
- Sounds good. Thanks for doing this.
- Maybe for accounts above a threshold, you could create and link an item like Cristiano Ronaldo Instagram account (Q65676176) and link that. Such item could then receive periodically new subscriber numbers. This would be option (D1) on Wikidata:Property_proposal/subscribers (note that participants there favored option (C), but I guess most contributors do (B). Personally I still like (A).
- Good idea to complete has characteristic (P1552) as well. --- Jura 15:31, 1 February 2020 (UTC)[reply]
- Thanks for pointing me in the direction of that discussion, knew I'd seen something like this before. Will investigate better handling for existing subscriber count qualifiers. --SilentSpike (talk) 15:49, 1 February 2020 (UTC)[reply]
- Having read the discussion, this is a bit of a rabbit hole and current data is a bit of a mess since without having a point in time (P585) qualifier it's somewhat meaningless because even the username itself can change. It also seems like many existing statements are using a retrieved (P813) source in place of P585. I've updated the function details above to reflect a simple enough method of avoiding messing with existing historic data while also cleaning things up a bit. Probably easiest to handle bringing in new data for cases with existing data as a separate bot task. --SilentSpike (talk) 20:14, 1 February 2020 (UTC)[reply]
50 successful test edits can now be seen on the contributions page. Did illuminate that I need to be more specific when checking for an existing source as an "imported from" source prevented the bot adding a retrieval date for the subscribers on European Central Bank (Q8901) --SilentSpike (talk) 15:49, 1 February 2020 (UTC)[reply]
Have updated the bot slightly to be much more cautious about adding data (other than the numeric ID) and to only do so where the information is not clearly dated (and thus doesn't need to be preserved as historical). See discussion above for motivations. Went through the first 50 test edits manually and removed additions to reflect this change. Will run a few more test edits just to demonstrate the updated code. --SilentSpike (talk) 21:57, 1 February 2020 (UTC)[reply]
- Ran a few more to demonstrate, everything past François-Xavier Villain (Q12964) is added by the updated bot. --SilentSpike (talk) 22:35, 1 February 2020 (UTC)[reply]
- Looks ok. Personally, I'd probably drop or skip the "retrieve date" when "point in time" is added. I wonder what would be a good threshold for (D1). How about 1 million? --- Jura 12:55, 2 February 2020 (UTC)[reply]
- My understanding was that the source is needed to reflect when the data was retrieved (existing old values misused - which is why I'd convert those to qualifiers), whereas "point in time" is when the data was accurate. Just happens that those are both the same value for these cases. As for D1, I think it's currently too controversial and would result in this bot task being rejected. If there was some consensus somewhere then I'd be happy to follow it up as a separate task. --SilentSpike (talk) 13:26, 2 February 2020 (UTC)[reply]
- If you store the data somewhere on Wikidata (e.g. a user page), I can take care of it. --- Jura 01:26, 10 February 2020 (UTC)[reply]
- My understanding was that the source is needed to reflect when the data was retrieved (existing old values misused - which is why I'd convert those to qualifiers), whereas "point in time" is when the data was accurate. Just happens that those are both the same value for these cases. As for D1, I think it's currently too controversial and would result in this bot task being rejected. If there was some consensus somewhere then I'd be happy to follow it up as a separate task. --SilentSpike (talk) 13:26, 2 February 2020 (UTC)[reply]
- Looks ok. Personally, I'd probably drop or skip the "retrieve date" when "point in time" is added. I wonder what would be a good threshold for (D1). How about 1 million? --- Jura 12:55, 2 February 2020 (UTC)[reply]
- @Maxlath: what do you think? --- Jura 13:25, 3 February 2020 (UTC)[reply]
- @Jura1: as I think I mentioned in some previous discussion, I would now be in favour of **not** tracking subscribers count in qualifiers at all, as no elegant solution could be found imho. I'm thus in favour of deleting them (which I'm currently doing for P3984 statements). Having those count as statements' main snak seems acceptable, but that is only possible for items that are the social media account such as Cristiano Ronaldo Instagram account (Q65676176), which I don't find very useful, but that's another topic. -- Maxlath (talk) 13:50, 3 February 2020 (UTC)[reply]
What are you doing in the cases where a twitter handle 404s or is suspended? Should we mark these cases in some way? BrokenSegue (talk) 17:41, 6 February 2020 (UTC)[reply]
- @BrokenSegue: Good question, currently I just skip over them and do nothing --SilentSpike (talk) 00:53, 7 February 2020 (UTC)[reply]
- @SilentSpike: perhaps add the "end time" property as once you are running regularly you won't be far off Back ache (talk) 10:07, 19 February 2020 (UTC)[reply]
- @Back ache, BrokenSegue: What about a has characteristic (P1552) Q49776568 for suspended accounts? This does run into the same complication as follower counts and verified badges as any existing temporal data isn't necessarily valid for added qualifiers. Perhaps I should simplify this bot task for now and just focus on adding the numeric IDs and cleaning up existing data, then in future I could investigate a separate task to handle other metadata once temporal complications have an agreed upon solution. --SilentSpike (talk) 20:57, 19 February 2020 (UTC)[reply]
- @Back ache, SilentSpike: So yeah I think maybe just making a list of "suspended" accounts is the right action for now. There's another wrinkle which is that suspensions are sometimes temporary. Some are probably just the wrong username for various reasons (e.g. the person swapped to a different screen name and the old one got suspended). BrokenSegue (talk) 21:53, 19 February 2020 (UTC)[reply]
- @Back ache, BrokenSegue: What about a has characteristic (P1552) Q49776568 for suspended accounts? This does run into the same complication as follower counts and verified badges as any existing temporal data isn't necessarily valid for added qualifiers. Perhaps I should simplify this bot task for now and just focus on adding the numeric IDs and cleaning up existing data, then in future I could investigate a separate task to handle other metadata once temporal complications have an agreed upon solution. --SilentSpike (talk) 20:57, 19 February 2020 (UTC)[reply]
- @SilentSpike: perhaps add the "end time" property as once you are running regularly you won't be far off Back ache (talk) 10:07, 19 February 2020 (UTC)[reply]
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘ I've since added functionality to list complicated cases (suspended, non existing, multiple numeric ID qualifiers, etc.) over at User:SilentSpikeBot/TwitterForReview for human review. You can see my test edit for this [1]. --SilentSpike (talk) 23:16, 21 February 2020 (UTC)[reply]
Latest Update: Have updated the bot task request and bot code to clean and add Twitter data in a way that is non-destructive to any existing dated data. Ran some test edits on Q8093#P2002 where you can see data was added and preserved where appropriate. Across my various batches of test edits believe I have shown myself competent and cautious and this task is now ready to run. Property discussion below is somewhat unresolved, but for the purposes of this task status quo data structuring can be used and everyone who has responded seems approving or indifferent to this bot task and finalised function details. --SilentSpike (talk) 23:48, 23 February 2020 (UTC)[reply]
Made the bot more efficient by uploading all Twitter data changes to an item as a single edit. See latest 3 bot contribution test edits. --SilentSpike (talk) 12:39, 24 February 2020 (UTC)[reply]
Relevant property discussion
[edit]@Jura1, Andrew_Gray, Trade, BrokenSegue, Back ache, Maxlath: Okay, so I'm pinging users who have been involved in discussion on talk pages of X username (P2002) and X numeric user ID (P6552) as it seems that the way these properties should be used was changed with little consensus. Also users who have commented here as more input is always good and they may be interested.
I'm going to simplify this bot task request to simply cleaning up existing twitter data. That is:
- Adding numeric IDs where missing
- Changing single "retrieved" source statements to "point in time" qualifiers (more appropriate)
- Changing "has quality" -> "verified badge" to "has quality" -> "verified account" (homogenise data)
- Listing encountered suspended/non-existent usernames on a userpage somewhere for review
Before I do this though, it seems to me that really the numeric twitter ID should be the main statement and the username a qualifier on that statement (rather than the other way round as currently done). Why? The numeric ID is the real identifier used by Twitter (as can be seen when using their API). The display name is not stable data and requires the addition of the user ID and a temporal qualifier ("point in time", "start date" or "end date") to be truly useful since it can be changed and there would be no way of knowing if it's still the same account (or when it was) without this data. I would go so far as to say it's not a true external identifier and rather just string data.
If the user ID becomes the main statement, it seems to me that life becomes easier in terms of data quality. If any of the account metadata (username, verified status, suspended status, follower count) changes then this can be tracked in time using temporal qualifiers and multiple statements for the same user ID. Whereas multiple statements for the same username could be different in both time and an actual different account (making life complicated). Plus you can always get the latest account metadata from the user ID alone for comparison to any qualifiers.
If we can get some consensus here that this approach seems more desirable (in terms of the data), then I would also update this bot task request to convert username statements into qualifiers on user ID statements (and personally update both properties for this usage). Please vote/discuss below. --SilentSpike (talk) 12:40, 20 February 2020 (UTC)[reply]
- @Nikki, Mbch331, Pigsonthewing, Edoderoo, Shisma, Prefall: @Jeluang_Terluang, Deansfa, Jc86035, NMaia, Tinker_Bell, Mahir256: Also pinging users involved in creation of both properties for input on this task. --SilentSpike (talk) 13:36, 20 February 2020 (UTC)[reply]
- Why would you do "Changing 'has quality' -> 'verified account' to 'has quality' -> verified badge'? The accounts are veriifed; they are not badges. Please do the reverse of this step. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:51, 20 February 2020 (UTC)[reply]
- @Pigsonthewing: To me it does not matter which way round the conversion is, would be happy to to the inverse so long as data is homogenised. Was just going off of the example property usages. Do you have thoughts on usage of the numeric ID as the main property? --SilentSpike (talk) 14:26, 20 February 2020 (UTC)[reply]
- I'm in favour of using X numeric user ID (P6552) as the main property instead of X username (P2002). The practice of using the latter as the main property has always baffled me from the very start. — Jeluang Terluang (talk) 14:08, 20 February 2020 (UTC)[reply]
- The logical thing to do would have been to include only the numeric id, but historically, we started out with the username. Besides, for most of our contributors, it might be hard to figure out the numeric one. Maybe we could have a bot running that regularly does the conversion. @Mbch331: who does or did that for Utube channels.
I do think it's a bad idea to list username and numeric idea in separate statements. We wouldn't really know if they belong together, if one can be used to update the other if it changes nor how they relate if there are several.--- Jura 14:28, 20 February 2020 (UTC)[reply]
- I think the fact that it may be hard for an editor to find should not influence the way we structure data. I would again appeal to the idea that the username is not truly an identifier, but just string metadata associated with the numeric identifier. This bot task in itself is inherently somewhat flawed as we cannot guarantee that the username data in Wikidata still matches the intended numeric ID retrieved using that username. Whereas if the data was the other way around this would not be an issue. The longer this takes to change the more stale the data becomes. --SilentSpike (talk) 14:39, 20 February 2020 (UTC)[reply]
- Agree. Can you advertise this debate on both property talk pages? --- Jura 14:47, 20 February 2020 (UTC)[reply]
- Done If consensus here is to reverse these property roles than I will also mark old conversations as outdated. --SilentSpike (talk) 15:06, 20 February 2020 (UTC)[reply]
- Strongly disagree here on requiring a user id to enter a screen name. Basically nobody knows how to get a user id from a screen name and the swaps aren't that common. User experience in entering the data should be taken into consideration. I'd rather have great coverage with some bad data than bad coverage with perfect data. Running a bot constantly to fill in the ids seems tractable. I'm fine with any compromise solution that allows people to enter a screen name without knowing the ID (so entering "unknown" as the id or having two properties or the current situation or whatever). BrokenSegue (talk) 15:58, 20 February 2020 (UTC)[reply]
- ┌────────────────────────────────────────────────────────────────────────────────────────────────────┘
- I do understand that the user ID is harder to obtain as a human editor and would tend to agree that ideally we could set things up so that a display name can be entered and then a bot or similar automatically converts this into a user ID statement in quick fashion. This bot task would take care of that and I imagine I'd be running it pretty frequently, although I don't currently have a setup to continuously run it. That said, I think entering solely username data is problematic and strongly feel that we should be using IDs as the main statement since those are the true identifier. Even Twitter themselves suggest using User IDs:
- If you need to share Twitter content you obtained via the Twitter APIs with another party, the best way to do so is by sharing Tweet IDs, Direct Message IDs, and/or User IDs, which the end user of the content can then rehydrate (i.e. request the full Tweet, user, or Direct Message content) using the Twitter APIs. --SilentSpike (talk) 21:41, 20 February 2020 (UTC)[reply]
- The logical thing to do would have been to include only the numeric id, but historically, we started out with the username. Besides, for most of our contributors, it might be hard to figure out the numeric one. Maybe we could have a bot running that regularly does the conversion. @Mbch331: who does or did that for Utube channels.
- I have no strong opinions about how this should be done, but I'm very happy for someone to go ahead and update/homogenise data so it's consistent and we have some kind of meaningful name<>id link. For has characteristic (P1552), I personally would find it really useful if you could add this as you go along, as well as standardising (it was in the original proposal, but not the revised one). But if you think this would be best as a subsequent dedicated bot run, fair enough during the campaign. Andrew Gray (talk) 19:48, 20 February 2020 (UTC)[reply]
- One important concern has not been raised yet: the reusers. P2002 is used by more than a hundred templates according to its talk page, although this list is almost certainly incomplete even for Wikimedia wikis, not to speak about third-party users. These templates must be updated to still display the data they currently display. There may be even templates that not only link to the Twitter profile, but also display the user name—these should still display the user name, and not the user ID, which may be challenging on smaller wikis that lack extensive Lua support for Wikidata. Also, why is it better to store the user ID in the main statement and the user name in qualifier instead of vice versa? For accurate information this statement should be qualified with dates and regularly updated anyway, just like if the user name would be the main statement. If we want to force users to provide the user ID as well, this can be done with a required qualifier constraint (forbidding the use of the user name as a main statement would be a constraint, too). —Tacsipacsi (talk) 22:30, 20 February 2020 (UTC)[reply]
- I can address why is it better to store the user ID in the main statement and the user name in qualifier instead of vice versa? first. The simple answer is because literally the user ID is an external identifier while the username is not (it is string metadata associated with the external identifier that changes in time). This means any main statement of username requires both a time qualifier and ID qualifier to actually be meaningful (without the ID you cannot know if the username still applies to the item by checking twitter, while without a time qualifier it cannot be assumed that the statement was ever true if twitter no longer matches). Meanwhile a main statement of the user ID does not require any qualifier, it is only enhanced by qualifiers. This also makes life easier specifically for twitter data where almost all commonly used qualifiers can change in time because they can then more easily be tracked with multiple statements knowing that the main value isn't also changing. This approach also does not rely on users having to add qualifiers because the main statement is now meaningful without them. --SilentSpike (talk) 23:09, 20 February 2020 (UTC)[reply]
- As humans work with names and not IDs, the account name will almost always be added anyway. As digging in the DOM tree is not for everyday users (most of them won’t even know what “inspect” means in Property talk:P6552#Finding the user ID on Twitter's new redesign), most probably they will have to enter “unknown” for the numeric ID, which will have to be subsequently cleaned up by a bot, adding the user ID. This means that most statements will have two qualifiers anyway, although it will be more inconvenient for humans to add the statement. —Tacsipacsi (talk) 20:59, 21 February 2020 (UTC)[reply]
- I'm not sure I follow the reasoning, because if we're going to have to rely on a bot to add the numeric ID either way then surely it makes sense to have the bot structure the data in the way that makes the data meaningful by default (i.e. the ID claim alone is meaningful, the username claim alone is not). I would even suggest that having an ID claim with "unknown" and a username qualifier is at least more useful as data than just adding a username string with no other qualifiers. --SilentSpike (talk) 22:58, 21 February 2020 (UTC)[reply]
- But recording data with setting unknown user ID is much less intuitive than simply pasting the user name string. That form can also be normalized by the bot, though. —Tacsipacsi (talk) 23:40, 22 February 2020 (UTC)[reply]
- I'm not sure I follow the reasoning, because if we're going to have to rely on a bot to add the numeric ID either way then surely it makes sense to have the bot structure the data in the way that makes the data meaningful by default (i.e. the ID claim alone is meaningful, the username claim alone is not). I would even suggest that having an ID claim with "unknown" and a username qualifier is at least more useful as data than just adding a username string with no other qualifiers. --SilentSpike (talk) 22:58, 21 February 2020 (UTC)[reply]
- As humans work with names and not IDs, the account name will almost always be added anyway. As digging in the DOM tree is not for everyday users (most of them won’t even know what “inspect” means in Property talk:P6552#Finding the user ID on Twitter's new redesign), most probably they will have to enter “unknown” for the numeric ID, which will have to be subsequently cleaned up by a bot, adding the user ID. This means that most statements will have two qualifiers anyway, although it will be more inconvenient for humans to add the statement. —Tacsipacsi (talk) 20:59, 21 February 2020 (UTC)[reply]
- As for One important concern has not been raised yet: the reusers. This is a very valid concern and something that has crossed my mind, but I only see this data becoming more stale as time goes on until this band-aid is ripped off (as the majority of statements do not have a numeric ID and point in time qualifier). --SilentSpike (talk) 23:12, 20 February 2020 (UTC)[reply]
- We have two groups of Twitter user names. The first group are the ones that are already dead: they are irreversibly gone, we can’t do anything about them. The second group are the ones that are not yet dead: if the bot adds the qualifiers, they won’t become irrecoverably dead. This is true regardless of which property is the main statement. —Tacsipacsi (talk) 20:59, 21 February 2020 (UTC)[reply]
- I do agree that adding the missing numeric IDs is the most pressing concern here. However, the third group are usernames which are not dead, but now mismatched to a different account, there's nothing we can do to fix this for existing statements and by adding the ID associated with the usernames we're really just hoping they're all still the same. However, we can prevent this for all future statements by making the ID the main claim which a bot can easily do (and it seems like a bot is needed either way). So why not have the bot take care of this while it's already processing this data? --SilentSpike (talk) 22:58, 21 February 2020 (UTC)[reply]
- Because of the reusers. It was noted below how easily the user name can be get on enwiki; but that’s enwiki, on smaller wikis it may not be possible. The main statement can be get even without using Lua, with the parser functions; getting a qualifiers is impossible even with certain Lua modules. Smaller wikis often don’t have the capacity to import and maintain (regularly update, localize) complex Lua modules. (For example my home wiki, huwiki has a Lua module, but that’s incapable of returning the qualifier alone, and we don’t have the capacity to deal with yet another module.) Not to speak about non-wiki reusers like custom SPARQL queries, which are in theory doable using the qualifier method, but that’s more expensive, so more likely to time out; and also these queries are not listed anywhere, so they’ll just suddenly stop working, without any prior warnings. —Tacsipacsi (talk) 23:40, 22 February 2020 (UTC)[reply]
- I do agree that adding the missing numeric IDs is the most pressing concern here. However, the third group are usernames which are not dead, but now mismatched to a different account, there's nothing we can do to fix this for existing statements and by adding the ID associated with the usernames we're really just hoping they're all still the same. However, we can prevent this for all future statements by making the ID the main claim which a bot can easily do (and it seems like a bot is needed either way). So why not have the bot take care of this while it's already processing this data? --SilentSpike (talk) 22:58, 21 February 2020 (UTC)[reply]
- We have two groups of Twitter user names. The first group are the ones that are already dead: they are irreversibly gone, we can’t do anything about them. The second group are the ones that are not yet dead: if the bot adds the qualifiers, they won’t become irrecoverably dead. This is true regardless of which property is the main statement. —Tacsipacsi (talk) 20:59, 21 February 2020 (UTC)[reply]
- I can address why is it better to store the user ID in the main statement and the user name in qualifier instead of vice versa? first. The simple answer is because literally the user ID is an external identifier while the username is not (it is string metadata associated with the external identifier that changes in time). This means any main statement of username requires both a time qualifier and ID qualifier to actually be meaningful (without the ID you cannot know if the username still applies to the item by checking twitter, while without a time qualifier it cannot be assumed that the statement was ever true if twitter no longer matches). Meanwhile a main statement of the user ID does not require any qualifier, it is only enhanced by qualifiers. This also makes life easier specifically for twitter data where almost all commonly used qualifiers can change in time because they can then more easily be tracked with multiple statements knowing that the main value isn't also changing. This approach also does not rely on users having to add qualifiers because the main statement is now meaningful without them. --SilentSpike (talk) 23:09, 20 February 2020 (UTC)[reply]
- ┌────────────────────────────────────────────────────────────────────────────────────────────────────┘
- For smaller wikis, I would ask, is there a use case for the username which the numeric ID cannot fulfil? I would imagine most wikis are simply linking to it as the social media related to something, which the ID can fill the purpose of. As for data users downstream, surely they would rather use the ID anyway because it is more reliable data? The current up-to-date username can even be fetched from Twitter this way, whereas currently Wikidata is presenting the username to downstream users as if it is definitely correct information which isn't guaranteed. --SilentSpike (talk) 00:04, 23 February 2020 (UTC)[reply]
- Smaller wikis may for example want to display the user name (like “Twitter: realDonaldTrump”), but probably the most common issue is template parameters: if a template calls today
{{Twitter|{{{twitter_name|{{#property:P2002}}}}}}}
, the{{Twitter}}
template (and all other templates in the template call chain) need to rewritten to have a different parameter for the user ID, and even with that it’s painful to recreate the exact behavior (i.e., empty and missing parameter being different). Third-party queries will have troubles when the query result depends on the user name (not only includes it; e.g. it’s in a FILTER statement—see the complex constraint on Property talk:P2002); otherwise they could call the query service for a second time just to query the qualifiers, which avoids timeouts. —Tacsipacsi (talk) 17:24, 23 February 2020 (UTC)[reply]
- Smaller wikis may for example want to display the user name (like “Twitter: realDonaldTrump”), but probably the most common issue is template parameters: if a template calls today
- I think the concerns by Tacsipacsi are valid and I would generally raise similar ones.
However, if the stats mentioned by Andrew Gray at Property_talk:P2002#Renamed_accounts? are correct, we have a major problem with status quo: the data is in such shape that we need to consider if we should continue including it in such a form or delete it entirely. 25% is just too much. --- Jura 14:12, 21 February 2020 (UTC)[reply]- I don't have exact data on what fraction of the usernames are valid but I would guess the 25% error rate figure is an overestimate. Political figures are more likely than most to change their screen name (based on my anecdotal review) because they often put their tittle in their name and are often verified which means they have access to rename functionality. BrokenSegue (talk) 20:33, 21 February 2020 (UTC)[reply]
- I too would suspect that political twitter accounts probably change display name significantly more often than most. It's likely that significantly less than 25% of all twitter data in Wikidata is bad, but it's also true that because we've been prioritising usernames, data is more likely to be bad than if we prioritised user IDs (where the only bad data would be due to incorrect data). --SilentSpike (talk) 23:09, 21 February 2020 (UTC)[reply]
- FWIW, I think the 25% was correct for that particular sample, but I would agree it wouldn't be representative for most other accounts; what probably happened was that we gathered the accounts during the election campaign and did not update afterwards when they all became JoeBloggsMP rather than JoeBloggsNorwich or something. Across all users... I don't know, but I would estimate it at much much lower. Andrew Gray (talk) 12:41, 22 February 2020 (UTC)[reply]
- With numeric ids, we can more easily gather such numbers. --- Jura 13:46, 22 February 2020 (UTC)[reply]
- FWIW, I think the 25% was correct for that particular sample, but I would agree it wouldn't be representative for most other accounts; what probably happened was that we gathered the accounts during the election campaign and did not update afterwards when they all became JoeBloggsMP rather than JoeBloggsNorwich or something. Across all users... I don't know, but I would estimate it at much much lower. Andrew Gray (talk) 12:41, 22 February 2020 (UTC)[reply]
- I too would suspect that political twitter accounts probably change display name significantly more often than most. It's likely that significantly less than 25% of all twitter data in Wikidata is bad, but it's also true that because we've been prioritising usernames, data is more likely to be bad than if we prioritised user IDs (where the only bad data would be due to incorrect data). --SilentSpike (talk) 23:09, 21 February 2020 (UTC)[reply]
- I don't have exact data on what fraction of the usernames are valid but I would guess the 25% error rate figure is an overestimate. Political figures are more likely than most to change their screen name (based on my anecdotal review) because they often put their tittle in their name and are often verified which means they have access to rename functionality. BrokenSegue (talk) 20:33, 21 February 2020 (UTC)[reply]
- @Jura1, Tacsipacsi: Perhaps a transitional compromise could be found where the bot can add new ID statements as main statements and the username statements be left alone (other than adding an ID qualifier so that in future the statements can be matched to the user ID statements). Once templates are moved over from P2002 to P6552 then the old data can be removed. --SilentSpike (talk) 23:09, 21 February 2020 (UTC)[reply]
- Disconnect the two? I'd rather not. Either we stay with the current model or switch the two. Aren't most uses on WP through Template:Twitter (Q6741634)? It shouldn't be too complicated to fix them. --- Jura 23:21, 21 February 2020 (UTC)[reply]
- Was thinking more along the lines of data duplication. Although I tend to agree with you - this seems impractical. I obviously pretty strongly favour restructuring this data since either way we're going to need this bot task to add numeric IDs when users fail to do so (I don't expect this to be a run and done situation). If we are going with the existing model then I would highly suggest we make P585 a mandatory qualifier along with the numeric ID. Unsure how we formally decide which approach to take though. Of those who've responded here so far it seems like we have 3 in favour of restructuring, 1 against and 1 impartial. --SilentSpike (talk) 23:34, 21 February 2020 (UTC)[reply]
- Disconnect the two? I'd rather not. Either we stay with the current model or switch the two. Aren't most uses on WP through Template:Twitter (Q6741634)? It shouldn't be too complicated to fix them. --- Jura 23:21, 21 February 2020 (UTC)[reply]
- If we provide updates for a reasonable part of the Wikipedia uses, there shouldn't be much of an issue of converting them. For enwiki
{{#invoke:WikidataIB|getValue|P6552|qual=P2002|fetchwikidata=ALL|onlysourced=no|qualsonly=yes|noicon=yes}}
- should work. I tested that with Q136217#P6552 on preview at w:Erzincan.
- Also, we could proceed in two steps: first add P6552-statements with the qualifier P2002, then remove P2002-statements. The first step wouldn't impact users. --- Jura 10:27, 22 February 2020 (UTC)[reply]
- Nice! The two-step idea is what I had in mind above, because the first step is not destructive, only additive. --SilentSpike (talk) 10:54, 22 February 2020 (UTC)[reply]
- Given the problems with edit rates, it might not even be technically possible to do otherwise. --- Jura 12:00, 22 February 2020 (UTC)[reply]
- Nice! The two-step idea is what I had in mind above, because the first step is not destructive, only additive. --SilentSpike (talk) 10:54, 22 February 2020 (UTC)[reply]
- @Jura1, Tacsipacsi: From our small sample size, it seems that there probably isn't consensus to change the structure of this data. However, it does seem as though we all agree that adding the missing numeric ID data is the most important task here. Everyone who has replied seems to be content with the data being cleaned up, so here is what I propose:
- We add a constraint to P2002 to require a qualifier of one of: P580, P582, P585 (start, end, point in time) because the username must be placed in time if it is to be useful
- We update the P2002 single best value constraint to only be separated by the numeric ID qualifier (since this is the thing which defines which account it actually is)
- We document these changes on the talk pages so that other users do not change them (hopefully)
- I update this bot task to:
- Add missing numeric IDs
- Cleanup retrieved sources into point in time qualifiers (if data is undated)
- Homogenise the verified badge -> account statements
- Add significant follower counts and verified status to statements if the data is undated (replacing any of these as existing qualifiers since the information is not dated)
- Date the data with P585 so it is meaningful
- Write any difficult cases to userpage User:SilentSpikeBot/TwitterForReview for human review (suspended account, etc.)
- This discussion can be continued elsewhere if needed. Perhaps a request for comment is needed to make such a large data change.
- If we agree on this then I will update the task and post this to the bureaucrats' noticeboard.
- Although there is still some weirdness with statements that have P580 or P582 and the follower count (because that data does not mix well with start/end time when it is a snapshot). Example Q76#P2002. --SilentSpike (talk) 00:47, 23 February 2020 (UTC)[reply]
- Now that we announced it everywhere, I think we should give it some time. I wouldn't expect a massive interest in this somewhat obscure question. I think there is an advantage of the main value not needing date qualifiers when being read. I'm sure we could find a solution for huwiki and SPARQL queries when needed. --- Jura 05:13, 23 February 2020 (UTC)[reply]
- This plan seems acceptable for me, except that AFAIK “either” mandatory qualifiers are impossible right now, so the only working solution is to pick one of the time-related properties (I’d say P585 because of the issues with follower count and the other two), and make that a non-required mandatory qualifier constraint. If there are too many violations, a complex constraint can be set up, which precisely describes this constraint (but doesn’t give warnings on the items when violated). —Tacsipacsi (talk) 17:24, 23 February 2020 (UTC)[reply]
- I think I'll go ahead with this solution for now as I'd like to get this bot task off the ground. I don't see that time will bring about many more replies since this is a somewhat obscure topic as you say Jura. Once the data is all in Wikidata a restructure can always happen later on. --SilentSpike (talk) 21:59, 23 February 2020 (UTC)[reply]
- This plan seems acceptable for me, except that AFAIK “either” mandatory qualifiers are impossible right now, so the only working solution is to pick one of the time-related properties (I’d say P585 because of the issues with follower count and the other two), and make that a non-required mandatory qualifier constraint. If there are too many violations, a complex constraint can be set up, which precisely describes this constraint (but doesn’t give warnings on the items when violated). —Tacsipacsi (talk) 17:24, 23 February 2020 (UTC)[reply]
- Now that we announced it everywhere, I think we should give it some time. I wouldn't expect a massive interest in this somewhat obscure question. I think there is an advantage of the main value not needing date qualifiers when being read. I'm sure we could find a solution for huwiki and SPARQL queries when needed. --- Jura 05:13, 23 February 2020 (UTC)[reply]
- I read the discussion as objections have been removed and plan to approve the bot in a couple of days. If objections have not been removed please make a more precise statement here.--Ymblanter (talk) 20:28, 24 February 2020 (UTC)[reply]