Wikidata talk:Bots

From Wikidata
Jump to: navigation, search

Mass-undo / mass-revert?[edit]

As we all know, mistakes can and will happen. Instead of merely delegating this issue to clean-up jobs after the fact, should it not be feasible for a Wikidata bot to itself undo what it has done? The structured data approach should entail that Wikidata is the ideal place for this kind of approach. (Although I admittedly have a limited knowledge about bot implementation, I haven't yet seen this kind of bot behavior when mistakes do happen; more often than not, semi-manual cleanups seem to be needed after mistakes occur.) -- Therefore, I would like to suggest making it a requirement for any bot that it is able to undo its last edit to any given item. (There could, for example, be a requirement that it should be able accept a list of items IDs plus a filter argument, e.g. how many of its own last edits to undo, or a date cutoff point for indicating 'all my edits since a particular date'.) Such a requirement could be adopted both in the bot approval process, making it a requirement for bot adoption, as well as in the underlying software. So what do you think, is this possible currently, or would it require some changes to the API or underlying software before it is possible to realize such a level of control? If it is already possible, or in the works, that would of course be great! Fred Johansen (talk) 17:22, 7 December 2014 (UTC)

This seems like a good idea. It's also needed for Widar; see e.g. User_talk:Yamaha5#instance_of_.28P31.29. Emw (talk) 18:54, 7 December 2014 (UTC)

GZWDer (flood) flooding to much ![edit]

GZWDer (flood) is a bot opperated by GZWDer. He was approved here (by only one contributor ?).

Firstly, this bot do not have a "proper" bot user page (with a "stop me" link). Secondly what exactly do this bot is not realy clear appart from "semi-automatic tasks" (witch is obvious for a bot).

More importantly, the user operating this bot have been warned many time about his beaviour. See : here, here, here and lastly here.

This bot have many problems in the past :

  • Creating duplicates, and so giving lot of manual chek. The reply of the user were "You can manually merge them" and "Duplicates can not be prevented".
  • Made items with no labels. The user have just replyed than another bot (Dexbot) will correct his own bot.
  • Create a bunch of wikidata stub entries without even speaking first to the related community. No reply yet.

After so many warnings, i think is about time to stop this flooding bot ! Miniwark (talk) 12:18, 30 March 2015 (UTC)

There are many such bots without "stop me" in Wikidata. I own one too. It is obivious that bots create duplicate items, those should have been connected before they were imported. AFAIK noone has to ask any project if they may create items for their pages. (I haven't seen such case since Wikidata was launched.)
So do you want to stop them all? Matěj Suchánek (talk) 12:52, 30 March 2015 (UTC)
That's not only the missing button but the way to use the bot. If you run your bot in a correct way the button is not needed but when you do thing without any agreement with the persons involved in the topic and if your bot is doing mass bad edits, the button is becoming a problem. Snipre (talk) 13:06, 30 March 2015 (UTC)

--GZWDer (talk) 05:11, 31 March 2015 (UTC)

there is a very big difference between wikipedia projects and wikisource projects : wikisource has very few contributors, and there is a reflection in progress about how to import texts, and exactly which texts are to be imported to wikidata : books, obviously... but for chapters, it's less obvious, and for pages it is clearly "N/A". Besides, the technical structure of wikisource is very different from that of wikipedia…
tonight I had to ask deletion of more than 120 pages from a book, that you imported, because these were not entire text, but text of single pages, that could not be put in Page namespace for technical reason…
besides, there is a need for data to add properties, that are much more difficult to get than from wp, because of transclusion that does not allow visualisation in wd, and categories are much less developped… all the work afterwards has to be made by hand and cannot be automated, which makes large blind imports very uneasy to manage…
there is absolutely no urge to import all wikisource very quickly, especially if items stay without claims for months (or years), because there is noone to document them… a systematic import, by author for example (or by source for articles of periodicals), is more efficient and easier to document afterwards…
The consensus of wikisource projects is to import validated texts first, then corrected texts, and then see for the others… you imported blindly, when we were working on tools to do it properly, which is why I am so irated about your bot… it works randomly, in languages it does not understand… that's crap… :(
if GZWDer just does not understand what he does, at least, just stop importing from wikisource… let ws-contributors work on it, as they are now, and let them do it properly - thanks --Hsarrazin (talk) 22:56, 13 May 2015 (UTC)


  • Properties can be added by autolist, and in the future by bot; but autolist doesn't create any new items
  • Already stopped importing from Wikisource but Dexbot will still create items for new Wikisource pages (while many items for old pages were created myself)

--GZWDer (talk) 05:00, 14 May 2015 (UTC)

BREAK - Deprecation & Removal of the ungroupedlist API parameter[edit]

The ungroupedlist api parameter will be deprecated from the wbgetentities and wbgetclaims modules on Wikidata as of the next deployment on Wednesday 29th July.

Please see the list announcement and phabricator issue ·addshore· talk to me! 18:05, 24 July 2015 (UTC)

Reinforcement of data analysis before any modification of the DB[edit]

The database is changing and we have to modify the requirement to modify data using a bot. An analysis of the data before any modification is now mandatory to avoid damage on the work of other contributors. I propose that all bots should include a test to be able to work on WD. The test have to match the following checks (assuming that the bot is importing values from a internal DB):

  1. no WD value -> create statement
  2. WD value is different and no source -> replace value, add source
  3. WD value is different and WP as source -> replace value, add source
  4. WD value is different and sourced with "stated in" -> add new statement
  5. WD value is the same and no source -> add source
  6. WD value is the same and WP as source -> replace source
  7. WD value is the same and sourced with "stated in", source is the same as bot source -> do nothing
  8. WD value is the same and sourced with "stated in", source is different from bot source-> add new statement

Finally bots should follow the source structure described in help:sources in order to allow citation on WP with link creation when this is possible (URL or elment allowing URL building, title,...). Information about language of the sources is very helpful for data user in order to select the data they want to display on WP articles. The source data should be sufficient to build similar citation like en:Template:Cite_book or en:Template:Cite web. Comment ? Snipre (talk) 11:10, 13 August 2015 (UTC)

Ranking is more complex than the case you provide so I prefer to keep that outside of the scope of the current discussion. Snipre (talk) 12:58, 13 August 2015 (UTC)
  • " An analysis of the data before any modification is now mandatory to avoid damage on the work of other contributors." Citation needed Multichill (talk) 13:08, 13 August 2015 (UTC)
@Multichill. I think it is common sense to respect the work of other contributors and I don't think we need a written policy to be polite. And if you want to loose time to discuss about the term mandatory, we can start to discuss if respect of contributors work is an option in a collaborative project. Snipre (talk) 13:32, 13 August 2015 (UTC)
Can you provide some examples of where present statements have been overwritten by bots? I have been forced to wipe out my watchlist, so I cannot see if/when I or somebody else have been overwritten. I have myself removed or changed many GeoNames ID (P1566)-statements, but nobody have complained this far. -- Innocent bystander (talk) 15:59, 13 August 2015 (UTC)
See that diff for example. Lookf for PubChem CID or some other identifiers. Happily the values are the same but I spent some time to provide detailed informations about the sources I used and the change provides less detailed info. I spoke with the bot operator and from this discussion I realized than nothing was requested from bot actions to prevent data erase. I know the problem because I am working on a bot code and that kind of test is quite boring to program. Snipre (talk) 20:13, 13 August 2015 (UTC)
As the bot developer and operator who caused this discussion, let me add my thoughts here. The bot touched 1,857 items, primarily checking if the values are correct and added external references for all correct(ed)/replaced values. Unfortunately, I overwrote the rich citations on 8 items Snipre had already added, which is not optimal. I will replace the citations my bot added with more detailed citations, as suggested by Snipre and on the Help:Sources. But still, I think that adding only the reference 'stated in' is sufficient. As the WD item already has the external identifier as a property, adding it for every citation again is just introducing a lot of redundancy with no real gain in provenance. This certainly relies on the assumption that all data which has the same reference on different property values of an WD item can also be found on the same page of the external resource, using the identifier provided as a property on the WD item. In this case (FDA approved drugs), all data referenced can be found on the single external page which is accessible via the external ID present as a property on the WD item, being either Drugbank ID (P715), PubChem ID (CID) (P662) or ChEMBL ID (P592). Sebotic (talk) 19:08, 18 August 2015 (UTC)
@Sebotic: Sorry to use your edit as example because as I said your change was not deleting correct values by wrong values, so the damage was zero for WD. But this edit was good showing that bot work should include more data analysis especially when people are working to improve WD. This is really a bad point when people see that their work (which can be the result of a long search and data comparison) is completely replaced by a bot with poor analytical skills.
For the question of what should be added for reference detailed better start that discussion in the talk page of Help:Sources. But here we have to thing about data citation and your solution doesn't provide enough information to create a citation. I can just propose you to read some WP articles to see how they cite their sources to understand what we should provide. Snipre (talk) 08:01, 19 August 2015 (UTC)
  • Two points: #2 and #3 are problematic. These values could be correct, but your source says some different. "stated in" is only one possibility to add a reference. Often only a plain URL is given. --Succu (talk) 20:29, 13 August 2015 (UTC)
    • Only "url" would be as good as "stated in", I guess. The most often problematic references are "imported from Wiki". But also other bot-added references can have major flaws. No matter what the source is, a bot can easily misunderstand the content of two items. One bot I know, could not see the difference between Prosper-Mathieu Henry (Q1084042), Paul-Pierre Henry (Q12015708) and Paul Henry and Prosper Henry (Q302840). The data was imported from an external database, but it used Wikipedia to know which name in the database corresponds to which item, and was directed wrong. -- Innocent bystander (talk) 07:28, 14 August 2015 (UTC)
    • We have to get ride of all unsourced data in order to reach the objective of a DB composed of source data (just have a look at the objectives of WD). So point #2 can't be challenged in my opinion. The problem is not the replacement but the importation of these values. For point #3 we face the problem of the splitting of the value from its source and the fact that WP is not a source. Curating all these data will take a lot of time and longer we wait to treat these values less we will find the source on WP due to the fact that WP is changing, articles are modified and data are replaced. Snipre (talk) 08:29, 14 August 2015 (UTC)
      • Symbol oppose vote.svg Oppose to #3. There may be a sources in "source" WP, only manual check can confirm or reject it. Thus bot should preserve WP-based value. It's not a problem for wikidata -- any project can just filter values with sources it doesn't like. I.e. project can ignore claims without sources or claims with only sources like "imported from enwiki". Any quality measuments can take it into account. -- Vlsergey (talk) 08:52, 14 August 2015 (UTC)
        • @Vlsergey: We can let data the reference "imported from WP" but we need at the same time some work to check these values and to eliminate the use of "imported from WP" as source. If we already plan to filter these data because few persons trust them then we have to be coherent and organize the replacement of these values. Snipre (talk) 15:14, 17 August 2015 (UTC)
          • They way you propose to delete data just because they are from WP and not from the source bot owner trust is incorrect. It is okay to manually replace them, but not via bot work. It turns out GND system has a lot of errors. Much more than WP. According to #3 correct data from WP will be replaced by incorrect data from GND. Just because we didn't import the correct source from WP yet. I can't agree with this. -- Vlsergey (talk) 15:38, 17 August 2015 (UTC)
  • Regarding #7 bot can update "check date". -- Vlsergey (talk) 08:52, 14 August 2015 (UTC)
  • Let's simplify this whole discussion a bit. The proposed rules can lead to essentially the following outcomes:
Action A.1: Add your statement as a new statement.
Action A.2: Add your reference to an existing statement (which otherwise agrees in content) and remove all WP references from that statement.
Action B: Delete one or more existing statements
We can treat A.1 and A.2 as one action:
Action A: When adding a new statement, a bot should first check if there is already a statement with the same property, value, and qualifiers. If yes, then the bot should just add its new source to the existing statement (if not present yet), and remove all WP references from the statement. If no, then it should add a new statement with its new source.
Looking at the proposal above, we can now see that a bot should always perform Action A. Note that (7) is a special case where Action A does not change anything. Also note that "replace" can be implemented as "delete+create". Therefore Action A already covers the proposed cases (1), (4), (5), (7), and (8). For the other cases, the main question is which statements to delete. For example, for case 2 above we would delete all statements with different value and no source (Action B) and then add the new statement (Action A). This view makes it easier to implement (and to discuss!) the suggested policy. Moreover, this view is much clearer in situations where there are multiple statements for a property (the above proposal uses language like "replace value" that suggests that there is only ony unique statement where the value can be replaced in; in reality, there might be many, all of which should be deleted when the new statement is added).
A bot implementer now needs to write a method that decides if an existing statement should be deleted in the light of the new statement (a function that takes an existing statement as input and returns a boolean value). The proposed rules can be summarised as follows:
A statement should be deleted if it has the same property as the new statement and
(a) it has a different value and no reference
(b) it uses WP as reference
That's easy enough to implement (though we might still discuss if this is what we want). Rule (a) covers the proposed case (2). Rule (b) covers the proposed cases (3) and (6). Nothing needs to be deleted in cases (1), (4), (5), (7), and (8).
I suggest to disentangle the discussion of Action A (how to update references) and Action B (which statements are deleted). This will simplify the discussion (and later implementation). --Markus Krötzsch (talk) 15:38, 17 August 2015 (UTC)
  • Symbol oppose vote.svg Oppose #2 and #3, per discussion at d:Wikidata:Project_chat#How_to_indicate_doubt_when_a_comparison_with_an_external_source_does_not_match. Sourced statements can be wrong; unsourced statements can be right. We create a higher quality resource if we do not hide such dissonances, but preserve them. Retention (i) preserves the warning signal to the user that the sourced statement may not be beyond question and (ii) acts as a flag to encourage further research from additional sources to make a fuller assessment of the correct value or the true range of possible values for the statement. Jheald (talk) 15:54, 17 August 2015 (UTC)
    • You are doing the same error of GerardM: you suppose that something can be wrong or right. This is not the job of WD to distinguish between right or wrong. Let the data user doing the job of choosing his right or wrong data. The only duty of WD is to provide the information to filter data meaning the sources
    • You confuse data quality and true data. What is data quality ? If we assume that the job of WD is not to find the truth, there is only one quality scale: data without source < data imported from WP < data with external source.
    • By deleting the unsourced data we don't assume that the new sourced data is correct we just provide the way to allow the users to find what is correct or wrong according to their criteria.
    • What bring unsourced data ? No possibility to analyzed data so no interesting data for data users who are coming to WD for data they can used, they can analyze, they can compare. Unless having an unsourced data which is stayed unoticed because filtered, a "wrong data" will be detected by users: I prefer to let them that kind of job because they are more of them than the contriutors. So no worry for the case of sourced data but wrong data. Snipre (talk) 19:00, 17 August 2015 (UTC)
  • I would have thought that nothing should be overwritten. Given for example the Geonames referenced data that has been overwritten, it is possible to gain an insight into the level of trust that should be given to a Geonames sourced piece of data. Moreover by adding sources it is possible to hypothesise which source relies on which other source. This is regularly an explicitly done in classical and biblical scholarship, and is a becoming a desirable skill for WMF project work. For example the statement that "Tale of Two Cities" is the best selling novel of all time was shown to be a case of citeogenesis. All the best: Rich Farmbrough21:26, 17 August 2015 (UTC).
    @Rich Farmbrough: The problem I have found with GeoNames, is not that the source itself is wrong. (I have detected such things too, but that has never been the reason why I have removed such statements.) The problem is rather that the linking between the items in the GeoNames Database has been wrongly connected to the items in the Wikidata Database. Many here have problems with separating the Province of Härjedalen from the Municipality of Härjedalen, that is true also within the Wikipedia-family. Neither of these database-items are wrong in the GeoNames database, but they were wrongly matched with the corresponding items at Wikidata. When we detect such mistakes within Wikipedia, we call it "interwiki conflict" and replace the sitelinks. We do not mark such sitelinks as "deprecated" and let them stay as they are forever. -- Innocent bystander (talk) 09:19, 18 August 2015 (UTC)
    That is a distinction worth drawing. There is still value in knowing the types of mistakes we make but that is a different form of meta-knowledge. All the best: Rich Farmbrough01:17, 22 August 2015 (UTC).

API breaking change (9th September)[edit]

Hi all! Please note there is an API breaking change (mainly for XML) scheduled for 9th September 2015. To read about it please see User:Addshore/API_Break_September_2015. This has also been announced to the relevant mailing lists and will be announced in the weekly summary. Any questions just ask! ·addshore· talk to me! 14:39, 27 August 2015 (UTC)