Wikidata talk:Bots

From Wikidata
Jump to: navigation, search


BREAK - Deprecation & Removal of the ungroupedlist API parameter[edit]

The ungroupedlist api parameter will be deprecated from the wbgetentities and wbgetclaims modules on Wikidata as of the next deployment on Wednesday 29th July.

Please see the list announcement and phabricator issue ·addshore· talk to me! 18:05, 24 July 2015 (UTC)

Reinforcement of data analysis before any modification of the DB[edit]

The database is changing and we have to modify the requirement to modify data using a bot. An analysis of the data before any modification is now mandatory to avoid damage on the work of other contributors. I propose that all bots should include a test to be able to work on WD. The test have to match the following checks (assuming that the bot is importing values from a internal DB):

  1. no WD value -> create statement
  2. WD value is different and no source -> replace value, add source
  3. WD value is different and WP as source -> replace value, add source
  4. WD value is different and sourced with "stated in" -> add new statement
  5. WD value is the same and no source -> add source
  6. WD value is the same and WP as source -> replace source
  7. WD value is the same and sourced with "stated in", source is the same as bot source -> do nothing
  8. WD value is the same and sourced with "stated in", source is different from bot source-> add new statement

Finally bots should follow the source structure described in help:sources in order to allow citation on WP with link creation when this is possible (URL or elment allowing URL building, title,...). Information about language of the sources is very helpful for data user in order to select the data they want to display on WP articles. The source data should be sufficient to build similar citation like en:Template:Cite_book or en:Template:Cite web. Comment ? Snipre (talk) 11:10, 13 August 2015 (UTC)

Ranking is more complex than the case you provide so I prefer to keep that outside of the scope of the current discussion. Snipre (talk) 12:58, 13 August 2015 (UTC)
  • " An analysis of the data before any modification is now mandatory to avoid damage on the work of other contributors." Citation needed Multichill (talk) 13:08, 13 August 2015 (UTC)
@Multichill. I think it is common sense to respect the work of other contributors and I don't think we need a written policy to be polite. And if you want to loose time to discuss about the term mandatory, we can start to discuss if respect of contributors work is an option in a collaborative project. Snipre (talk) 13:32, 13 August 2015 (UTC)
Can you provide some examples of where present statements have been overwritten by bots? I have been forced to wipe out my watchlist, so I cannot see if/when I or somebody else have been overwritten. I have myself removed or changed many GeoNames ID (P1566)-statements, but nobody have complained this far. -- Innocent bystander (talk) 15:59, 13 August 2015 (UTC)
See that diff for example. Lookf for PubChem CID or some other identifiers. Happily the values are the same but I spent some time to provide detailed informations about the sources I used and the change provides less detailed info. I spoke with the bot operator and from this discussion I realized than nothing was requested from bot actions to prevent data erase. I know the problem because I am working on a bot code and that kind of test is quite boring to program. Snipre (talk) 20:13, 13 August 2015 (UTC)
As the bot developer and operator who caused this discussion, let me add my thoughts here. The bot touched 1,857 items, primarily checking if the values are correct and added external references for all correct(ed)/replaced values. Unfortunately, I overwrote the rich citations on 8 items Snipre had already added, which is not optimal. I will replace the citations my bot added with more detailed citations, as suggested by Snipre and on the Help:Sources. But still, I think that adding only the reference 'stated in' is sufficient. As the WD item already has the external identifier as a property, adding it for every citation again is just introducing a lot of redundancy with no real gain in provenance. This certainly relies on the assumption that all data which has the same reference on different property values of an WD item can also be found on the same page of the external resource, using the identifier provided as a property on the WD item. In this case (FDA approved drugs), all data referenced can be found on the single external page which is accessible via the external ID present as a property on the WD item, being either Drugbank ID (P715), PubChem ID (CID) (P662) or ChEMBL ID (P592). Sebotic (talk) 19:08, 18 August 2015 (UTC)
@Sebotic: Sorry to use your edit as example because as I said your change was not deleting correct values by wrong values, so the damage was zero for WD. But this edit was good showing that bot work should include more data analysis especially when people are working to improve WD. This is really a bad point when people see that their work (which can be the result of a long search and data comparison) is completely replaced by a bot with poor analytical skills.
For the question of what should be added for reference detailed better start that discussion in the talk page of Help:Sources. But here we have to thing about data citation and your solution doesn't provide enough information to create a citation. I can just propose you to read some WP articles to see how they cite their sources to understand what we should provide. Snipre (talk) 08:01, 19 August 2015 (UTC)
  • Two points: #2 and #3 are problematic. These values could be correct, but your source says some different. "stated in" is only one possibility to add a reference. Often only a plain URL is given. --Succu (talk) 20:29, 13 August 2015 (UTC)
    • Only "url" would be as good as "stated in", I guess. The most often problematic references are "imported from Wiki". But also other bot-added references can have major flaws. No matter what the source is, a bot can easily misunderstand the content of two items. One bot I know, could not see the difference between Prosper-Mathieu Henry (Q1084042), Paul-Pierre Henry (Q12015708) and Paul Henry and Prosper Henry (Q302840). The data was imported from an external database, but it used Wikipedia to know which name in the database corresponds to which item, and was directed wrong. -- Innocent bystander (talk) 07:28, 14 August 2015 (UTC)
    • We have to get ride of all unsourced data in order to reach the objective of a DB composed of source data (just have a look at the objectives of WD). So point #2 can't be challenged in my opinion. The problem is not the replacement but the importation of these values. For point #3 we face the problem of the splitting of the value from its source and the fact that WP is not a source. Curating all these data will take a lot of time and longer we wait to treat these values less we will find the source on WP due to the fact that WP is changing, articles are modified and data are replaced. Snipre (talk) 08:29, 14 August 2015 (UTC)
      • Symbol oppose vote.svg Oppose to #3. There may be a sources in "source" WP, only manual check can confirm or reject it. Thus bot should preserve WP-based value. It's not a problem for wikidata -- any project can just filter values with sources it doesn't like. I.e. project can ignore claims without sources or claims with only sources like "imported from enwiki". Any quality measuments can take it into account. -- Vlsergey (talk) 08:52, 14 August 2015 (UTC)
        • @Vlsergey: We can let data the reference "imported from WP" but we need at the same time some work to check these values and to eliminate the use of "imported from WP" as source. If we already plan to filter these data because few persons trust them then we have to be coherent and organize the replacement of these values. Snipre (talk) 15:14, 17 August 2015 (UTC)
          • They way you propose to delete data just because they are from WP and not from the source bot owner trust is incorrect. It is okay to manually replace them, but not via bot work. It turns out GND system has a lot of errors. Much more than WP. According to #3 correct data from WP will be replaced by incorrect data from GND. Just because we didn't import the correct source from WP yet. I can't agree with this. -- Vlsergey (talk) 15:38, 17 August 2015 (UTC)
  • Regarding #7 bot can update "check date". -- Vlsergey (talk) 08:52, 14 August 2015 (UTC)
  • Let's simplify this whole discussion a bit. The proposed rules can lead to essentially the following outcomes:
Action A.1: Add your statement as a new statement.
Action A.2: Add your reference to an existing statement (which otherwise agrees in content) and remove all WP references from that statement.
Action B: Delete one or more existing statements
We can treat A.1 and A.2 as one action:
Action A: When adding a new statement, a bot should first check if there is already a statement with the same property, value, and qualifiers. If yes, then the bot should just add its new source to the existing statement (if not present yet), and remove all WP references from the statement. If no, then it should add a new statement with its new source.
Looking at the proposal above, we can now see that a bot should always perform Action A. Note that (7) is a special case where Action A does not change anything. Also note that "replace" can be implemented as "delete+create". Therefore Action A already covers the proposed cases (1), (4), (5), (7), and (8). For the other cases, the main question is which statements to delete. For example, for case 2 above we would delete all statements with different value and no source (Action B) and then add the new statement (Action A). This view makes it easier to implement (and to discuss!) the suggested policy. Moreover, this view is much clearer in situations where there are multiple statements for a property (the above proposal uses language like "replace value" that suggests that there is only ony unique statement where the value can be replaced in; in reality, there might be many, all of which should be deleted when the new statement is added).
A bot implementer now needs to write a method that decides if an existing statement should be deleted in the light of the new statement (a function that takes an existing statement as input and returns a boolean value). The proposed rules can be summarised as follows:
A statement should be deleted if it has the same property as the new statement and
(a) it has a different value and no reference
(b) it uses WP as reference
That's easy enough to implement (though we might still discuss if this is what we want). Rule (a) covers the proposed case (2). Rule (b) covers the proposed cases (3) and (6). Nothing needs to be deleted in cases (1), (4), (5), (7), and (8).
I suggest to disentangle the discussion of Action A (how to update references) and Action B (which statements are deleted). This will simplify the discussion (and later implementation). --Markus Krötzsch (talk) 15:38, 17 August 2015 (UTC)
  • Symbol oppose vote.svg Oppose #2 and #3, per discussion at d:Wikidata:Project_chat#How_to_indicate_doubt_when_a_comparison_with_an_external_source_does_not_match. Sourced statements can be wrong; unsourced statements can be right. We create a higher quality resource if we do not hide such dissonances, but preserve them. Retention (i) preserves the warning signal to the user that the sourced statement may not be beyond question and (ii) acts as a flag to encourage further research from additional sources to make a fuller assessment of the correct value or the true range of possible values for the statement. Jheald (talk) 15:54, 17 August 2015 (UTC)
    • You are doing the same error of GerardM: you suppose that something can be wrong or right. This is not the job of WD to distinguish between right or wrong. Let the data user doing the job of choosing his right or wrong data. The only duty of WD is to provide the information to filter data meaning the sources
    • You confuse data quality and true data. What is data quality ? If we assume that the job of WD is not to find the truth, there is only one quality scale: data without source < data imported from WP < data with external source.
    • By deleting the unsourced data we don't assume that the new sourced data is correct we just provide the way to allow the users to find what is correct or wrong according to their criteria.
    • What bring unsourced data ? No possibility to analyzed data so no interesting data for data users who are coming to WD for data they can used, they can analyze, they can compare. Unless having an unsourced data which is stayed unoticed because filtered, a "wrong data" will be detected by users: I prefer to let them that kind of job because they are more of them than the contriutors. So no worry for the case of sourced data but wrong data. Snipre (talk) 19:00, 17 August 2015 (UTC)
  • I would have thought that nothing should be overwritten. Given for example the Geonames referenced data that has been overwritten, it is possible to gain an insight into the level of trust that should be given to a Geonames sourced piece of data. Moreover by adding sources it is possible to hypothesise which source relies on which other source. This is regularly an explicitly done in classical and biblical scholarship, and is a becoming a desirable skill for WMF project work. For example the statement that "Tale of Two Cities" is the best selling novel of all time was shown to be a case of citeogenesis. All the best: Rich Farmbrough21:26, 17 August 2015 (UTC).
    @Rich Farmbrough: The problem I have found with GeoNames, is not that the source itself is wrong. (I have detected such things too, but that has never been the reason why I have removed such statements.) The problem is rather that the linking between the items in the GeoNames Database has been wrongly connected to the items in the Wikidata Database. Many here have problems with separating the Province of Härjedalen from the Municipality of Härjedalen, that is true also within the Wikipedia-family. Neither of these database-items are wrong in the GeoNames database, but they were wrongly matched with the corresponding items at Wikidata. When we detect such mistakes within Wikipedia, we call it "interwiki conflict" and replace the sitelinks. We do not mark such sitelinks as "deprecated" and let them stay as they are forever. -- Innocent bystander (talk) 09:19, 18 August 2015 (UTC)
    That is a distinction worth drawing. There is still value in knowing the types of mistakes we make but that is a different form of meta-knowledge. All the best: Rich Farmbrough01:17, 22 August 2015 (UTC).
  • Agree with the approach, but Symbol oppose vote.svg Oppose #2 #3 and #6 . One example : date of birth (P569) in Joseph Stalin (Q855). Statement from autorities Integrated Authority File (Q36578) and Bibliothèque nationale de France (Q193563) are the official date, which seems to be wrong and has been depreciated. As written in WP, the real date of birth seems to be the one stated in normal rank in Wikidata, which has WP as source. So against #3. #2 is similar and imho the best decision, replacement or not, has to be taken by an human. Now imagine that a cataloguer working in an institution which provides authority data, see the Wikipedia article, the references in the article, and rationally decides to change the date of birth. The authority data for the date of birth is not established from Wikipedia but "very influenced" by Wikipedia. And after that if we make an import from this source, if we decide to replace WP as source with authority as source, we will have a circular reference (~the devil) and lose the initial source to see it. WP is de facto a source even with all its known issues; it is better to admit it rather than trying to erase it. (Just for information, a case in SUDOC (Q2597810) where the date of death is referenced from wikipedia). So against #6. For the #4 I think it depends; personnaly if there is still a statment and the value has to be unique I prefer putting a flag in local and doing nothing, for an after look. The existing statement can be a subclass or a similar value, more precise, the authority data could be wrong (it happens sometimes) or because something we don't know needs a look. But sometimes #4 could be a good practice. Again thank you for this proposal. I prefer to see it as a recommendation not an injonction. I think bots have already similar decision trees and it's a good thing to discuss about it. Best regards --Shonagon (talk) 02:29, 31 October 2015 (UTC)
  • I would like to add to the already discussed that the #8 should not add a new statement, if the data is exactly the same I think it is better to add a new source to the existing statement instead of having more than one statement with the same value and different data. This wouldn't broke the one value constraints. -- Agabi10 (talk) 12:28, 31 October 2015 (UTC)

API breaking change (9th September)[edit]

Hi all! Please note there is an API breaking change (mainly for XML) scheduled for 9th September 2015. To read about it please see User:Addshore/API_Break_September_2015. This has also been announced to the relevant mailing lists and will be announced in the weekly summary. Any questions just ask! ·addshore· talk to me! 14:39, 27 August 2015 (UTC)

Bot request page broken?[edit]

I added a bot request at Wikidata:Requests_for_permissions/Bot/APSbot but it's not showing up on the main Wikidata:Requests_for_permissions/Bot page. Something broken? ArthurPSmith (talk) 21:11, 19 October 2015 (UTC)

You have to add the link for a new request manually to Wikidata:Requests_for_permissions/Bot. [1] --Pasleim (talk) 21:15, 19 October 2015 (UTC)
Thanks, that works! Maybe the instructions on the request page should be updated? ArthurPSmith (talk) 21:29, 19 October 2015 (UTC)
When creating a new request, you see {{Wikidata:Requests for permissions/Editintro/Bot}} --Pasleim (talk) 21:39, 19 October 2015 (UTC)

Interwiki bot[edit]

Hi,

Is there somewhere a bot to add interwikis in wikidata?

I would like for instance to link "fr:Catégorie:XXXX en littérature" to "en:Category:XXXX in literature" for all XXXX years.

Thank you for your help.

Vargenau (talk) 16:20, 2 December 2015 (UTC)

Interwiki bot in sa.wikipedia[edit]

In sa.wikipedia there are 8000+ categories which don't have interwikilink. Like 2000 die should be connect with २००० मरणम् If some one can help me ? NehalDaveND (talk) 06:07, 25 March 2016 (UTC)