Wikidata:Requests for permissions/Bot/Wakebrdkid's bot

From Wikidata
Jump to navigation Jump to search

Wakebrdkid's bot[edit]

Wakebrdkid's bot (talkcontribsnew itemsnew lexemesSULBlock logUser rights logUser rightsxtools)
Operator: Wakebrdkid (talkcontribslogs)

Task/s: Adding claims based off Wikipedia infoboxes.

Function details: "Infobox film" typically contains information about a film's director, screenwriter, producer, main actors, cinematographer, and studio, all of which are supported properties for a film on Wikidata. I'll probably work on other infoboxes after this one. Here are examples of the updates the code makes: Special:Contributions/108.235.225.145. --Wakebrdkid (talk) 06:00, 2 April 2013 (UTC)[reply]

This user account is not registered. Please log in with the bot in the browser, and make sure you create a userpage for it. --Ricordisamoa 06:24, 2 April 2013 (UTC)[reply]
Over 30 edits per minute!?! Please read this. I had a similar problem, too, while getting approved on it.wiki, and so I didn't get the flag there. --Ricordisamoa 06:32, 2 April 2013 (UTC)[reply]
Ok, I'll add a 5 second pause between edits like the guidelines suggest. Wakebrdkid (talk) 06:37, 2 April 2013 (UTC)[reply]
OK! What software do you use? Pywikipediabot? DotNetDataBot? Java? PHP? --Ricordisamoa 06:43, 2 April 2013 (UTC)[reply]
Mathematica. Wakebrdkid (talk) 06:47, 2 April 2013 (UTC)[reply]
Q81294?! --Ricordisamoa 06:54, 2 April 2013 (UTC)[reply]
Yes. I'm very productive with it. Wakebrdkid (talk) 07:00, 2 April 2013 (UTC)[reply]
Now you can put {{Bot|Wakebrdkid}} on its userpage, and then make ~50 test edits at max. 5 edits/minute. --Ricordisamoa 07:07, 2 April 2013 (UTC)[reply]
Ok, done. I found a bug where it was submitting "universe" (ID of 1) for items that it couldn't find in Wikidata. This was due to me mistakenly comparing a string to an integer. I'll go delete those claims where I added them. Wakebrdkid (talk) 08:20, 2 April 2013 (UTC)[reply]
That's why unflagged bots should edit slowly... but could you please link to an edit of those? --Ricordisamoa 08:27, 2 April 2013 (UTC)[reply]
Oh, I'm not saying the approval system should be changed. Ok, I finished deleting the Q1 claims. Special:Contributions/Wakebrdkid I can verify that the bug is fixed because the bot doesn't try to put them back after I delete them if I re-run it on a movie it already did. Wakebrdkid (talk) 08:44, 2 April 2013 (UTC)[reply]

Note that I've blocked this bot as it approached 2,000 edits on its test run. I'll unblock if people really feel we need even more test edits to judge from, but seeing as policy currently only demands 50, I think the current amount should be enough. Obviously this block shouldn't get in the way of this request, and I have no objection to the block being lifted if this request is closed as successful. — PinkAmpers&(Je vous invite à me parler) 16:42, 2 April 2013 (UTC)[reply]

 Comment: The bot has added "Q297255" (ASC) to Property:P344 in a number of places, for example at Q778161. "ASC" is a postnomial used by cinematographers, not a cinematographer in its own right. Gabbe (talk) 22:41, 3 April 2013 (UTC)[reply]

That's exactly the kind of feedback I'm looking for. I looked through about a hundred examples before running the tests, but ASC didn't come up. I've updated the code to block attempts to add claims with ASC as the value. Wakebrdkid (talk) 08:14, 5 April 2013 (UTC)[reply]
Oh, sweet. I didn't know the "What links here" was working. I've removed ASC from the 3 films to which it was added. Wakebrdkid (talk) 08:20, 5 April 2013 (UTC)[reply]
I could probably automate detecting these types of issues for any type of infobox by making the bot stop if it ever detects it is setting a property to a disambiguation item. Then I could either block it or manually specify how it should resolve the disambiguation when it finds it again. Wakebrdkid (talk) 08:31, 5 April 2013 (UTC)[reply]
Dito for "British Society of Cinematographers" (Q924996), see Q17738 and Q181795. Gabbe (talk) 08:53, 5 April 2013 (UTC)[reply]
Ok, I've added more blocks for those and the associated disambiguation pages and removed the errant additions. I don't think Wikidata has enough data in it yet to warrant blocking every time it tries to, for example, add a cinematographer that doesn't have "instance of: person" set, but I could have it keep a running list of times it tried to set a property to a value that didn't have the expected type. Then I could review the list by hand and automate setting "instance of" for those items once I've reviewed the list. Wakebrdkid (talk) 17:05, 5 April 2013 (UTC)[reply]
I agree with you, and I think it's hard to define conclusive "catch-all" criteria. Tip: Other post-nomials that you might find in infoboxes include Q463702 and Q1049326. I think it might be best to ignore any parameter that's a three-letter acronym. Sure, that would miss people like Q52447, but it would probably lead to a lot fewer inaccurate inclusions. Gabbe (talk) 20:00, 5 April 2013 (UTC)[reply]
I'll probably just go ahead and block all 3-letter acronyms. Keeping a list of those to review and potentially add would probably take less time than reviewing ones to delete. Wakebrdkid (talk) 17:31, 6 April 2013 (UTC)[reply]
And to be perfectly clear, I'm not at all objecting to the approval of this bot, I'm merely trying to help weed out what I think are quite minor bugs. Gabbe (talk) 20:04, 5 April 2013 (UTC)[reply]
Another one: To the page Q220394, the bot added Property:P272 with the value Q51123, which seems odd. I guess it's because the corresponding infobox listed "David W. Griffith Corp." as studio, with a link to him. Gabbe (talk) 07:11, 6 April 2013 (UTC)[reply]
I saw a similar issue with w:Citizen Kane. The box says, "Produced by: Orson Welles for Mercury Productions, RKO Radio Pictures". In that case it recognizes the two studios because they are linked, but it doesn't catch Orson Welles because it tries to find an ID for "Orson Welles for". Wakebrdkid (talk) 17:31, 6 April 2013 (UTC)[reply]

I'm going to change the bot from importing data from the film infobox to using the articles that have the film infobox but getting the data from IMDb so that I can set references for all of the claims. Many of the items that correspond to films do not have their IMDb ID property set yet. I'm gathering a list of those now, and then I'll try to resolve them automatically. Wakebrdkid (talk) 05:00, 7 April 2013 (UTC)[reply]

SamoaBot is already importing IMDb codes (because it overlapped with VIAFbot), so please pay attention. --Ricordisamoa 07:39, 7 April 2013 (UTC)[reply]
I won't submit any duplicates. Perhaps the English Wikipedia is biased toward different films than the Russian and Dutch versions. I've looked at 30,000 film items so far and 28,000 still need IMDb IDs. Wakebrdkid (talk) 08:16, 7 April 2013 (UTC)[reply]
I'm not sure it's entirely legal to scrape the data from IMDb in the way you seem to be suggesting in your latest post, see this, for example. Then again, I'm not an expert when it comes to database licensing issues. Copying the date from the Wikipedia infoboxes, as you originally suggested, should still be OK. Gabbe (talk) 09:00, 7 April 2013 (UTC)[reply]
It seems like we fit all of their criteria except for the "personal" part. They don't make it easy to ask them questions, but here is my attempt. https://getsatisfaction.com/imdb/topics/referencing_imdb_on_wikipedia_film_infoboxes?rfm=1 I've put some more thought into how to do better parses of the infoboxes, and using the IMDb FTP data would be easier than mining their site, so I'm open to pursuing either option. If they say no, I'm probably going to boycott their site because I haven't used it much in recent years anyway. Wakebrdkid (talk) 09:53, 7 April 2013 (UTC)[reply]
An admin replied and said that we could probably justify using the IMDb data as fair use, but the English Wikipedia has decided to not give special treatment to IMDb. http://en.wikipedia.org/wiki/Template:Infobox_film#IMDb.2C_Allmovie.2C_and_other_external_links So my ideas to improve parsing our infoboxes are to first expand all of the references in the article to full form, then to recognize anything that has been linked anywhere in the article as a potential item. I didn't catch the "Orson Welles for ..." case because Orson Welles had already been linked previously in the infobox, so he was just given as plain text afterward as per Wikipedia's standards. For the case of identifying David W. Griffith as a producer because David W. Griffith Corp. doesn't have its own article seems like the behavior we'd want, so I'm not planning to change its behavior there. No example comes easily to mind when piped text should take priority over a linked article. Wakebrdkid (talk) 16:30, 7 April 2013 (UTC)[reply]
Please note that Wikidata is under CC-0: no fair use is allowed. --Ricordisamoa 23:00, 7 April 2013 (UTC)[reply]
And again, "fair use" is something that might apply to individual cases in isolation. If I say in an article that World War II ended in 1945, or quote a sentence or even a whole paragraph from a history book, that's fair use. But if the take an entire chapter, scan it, and put in on Wikimedia Commons, that would be an entirely different kettle of fish. Likewise, if I look at IMDb and manually enter individual facts to individual articles (like "Orson Welles directed Citizen Kane") based on it, that might well be entirely different from automatically scouring all IMDb entries en masse whose Wikipedia articles have an infobox. Gabbe (talk) 09:13, 8 April 2013 (UTC)[reply]

 Support I feel ready to support this bot, provided that it operates by parsing the infoboxes on Wikipedia (as originally mentioned) and not by basing its edits on IMDb directly. I'd be willing to support the bot in the latter case as well, provided that someone versed in these topics vouches for the legality of doing so. Gabbe (talk) 07:47, 16 April 2013 (UTC)[reply]

 Neutral I took a peek at your edits and they seem ok. However, I think that your bot should add an source to the statements it adds to Wikidata.--Snaevar (talk) 18:34, 16 April 2013 (UTC)[reply]

Thanks for all of the feedback, but I have to give my bot low-priority for a while. I will still be raising awareness and enthusiasm about Wikidata, but I won't be able to maintain a bot for the time being. Wakebrdkid (talk) 03:17, 25 April 2013 (UTC)[reply]

That's fine. I'm going to untransclude from the main requests page for now, once you feel like you have more time feel free to re-transclude it and we can look at it again :) Legoktm (talk) 03:36, 25 April 2013 (UTC)[reply]