Shortcuts: WD:RFBOT, WD:BRFA, WD:RFP/BOT

Wikidata:Requests for permissions/Bot

From Wikidata
Jump to navigation Jump to search


Wikidata:Requests for permissions/Bot
To request a bot flag, or approval for a new task, in accordance with the bot approval process, please input your bot's name into the box below, followed by the task number if your bot is already approved for other tasks. Then transclude that page onto this page, like this: {{Wikidata:Requests for permissions/Bot/RobotName}}.

Old requests go to the archive.

Once consensus is obtained in favor of granting the botflag, please post requests at the bureaucrats' noticeboard.

Translate this header box!


Bot Name Request created Last editor Last edited
UallvBot 2 2020-07-07, 11:48:00 Epìdosis 2020-07-07, 16:00:48
ISSNBot 2020-06-09, 20:15:07 Tfrancart 2020-07-10, 08:59:48
Dexbot 13 2020-06-27, 21:21:44 GZWDer 2020-07-07, 19:22:04
NicereddyBot 3 2020-06-26, 06:01:11 Ymblanter 2020-07-03, 18:45:11
GeertivpBot 2020-06-24, 07:25:51 Ymblanter 2020-07-07, 18:21:10
t cleanup bot 2020-06-21, 17:39:23 GZWDer 2020-06-21, 20:02:27
The Anomebot 3 2 2020-06-14, 11:29:42 Ymblanter 2020-06-19, 19:29:26
Alch Bot 2020-06-12, 18:24:34 Lymantria 2020-06-17, 05:28:32
OlafJanssenBot 2020-06-11, 21:45:35 Lymantria 2020-06-26, 08:07:22
IngeniousBot 1 2020-06-07, 18:55:13 Ymblanter 2020-06-12, 18:08:18
Recipe Bot 2020-05-20, 14:21:59 U+1F360 2020-05-24, 17:47:01
LullyArkBNFBot 2020-05-20, 09:35:21 DannyS712 2020-07-07, 12:01:55
CovidDatahubBot 2020-05-14, 21:28:33 Jura1 2020-06-29, 08:34:41
LouisLimnavongBot 2020-05-14, 13:09:17 LouisLimnavong 2020-05-14, 13:09:17
BsivkoBot 3 2020-05-08, 13:25:37 Bsivko 2020-05-08, 13:28:25
BsivkoBot 2 2020-05-08, 12:50:25 Jura1 2020-05-19, 10:37:06
ZoranBot 2020-04-30, 00:57:10 Ymblanter 2020-05-06, 18:40:21
EuropeanCommissionBot 2020-04-17, 16:43:42 DannyS712 2020-07-07, 12:02:39
DeepsagedBot 1 2020-04-14, 06:16:52 DeepsagedBot 2020-05-28, 17:29:12
Uzielbot 2 2020-04-07, 23:49:11 Jura1 2020-05-16, 13:23:42
WordnetImageBot 2020-03-18, 12:17:03 DannyS712 2020-07-07, 12:03:42
Lamchuhan-hcbot 2020-03-24, 08:06:07 GZWDer 2020-03-24, 08:06:07
GZWDer (flood) 3 2018-07-23, 23:08:28 Lymantria 2020-04-11, 17:23:40
MusiBot 2020-02-28, 01:01:19 Premeditated 2020-03-18, 09:43:03
AitalDisem 2020-01-14, 15:48:04 Lymantria 2020-05-16, 08:34:23
BsivkoBot 2019-12-28, 19:38:23 Bsivko 2020-05-08, 12:35:10
KnuthuehneBot 2019-12-26, 19:12:42 Ymblanter 2019-12-27, 19:52:15
BandiBot 2019-04-17, 10:14:19 Lymantria 2019-12-11, 15:48:03
MidleadingBot 3 2019-12-09, 14:51:01 GZWDer 2019-12-11, 08:54:03
TedNye 2019-09-18, 03:16:13 Tednye 2019-09-30, 10:26:17
TidoniBot 2019-08-30, 20:07:51 Jc3s5h 2019-09-27, 20:39:14
Antoine2711bot 2019-07-02, 04:25:58 Antoine2711 2020-03-07, 17:12:16
CoRepoBot 2019-03-18, 16:22:12 Jura1 2019-07-04, 14:26:39
EpiskoBot_2 2019-06-26, 18:56:25 Looperz 2019-07-12, 13:11:21
EbeBot 2019-04-06, 20:39:24 Yair rand 2019-05-13, 04:55:18
NMBot 2019-03-23, 20:33:46 Jura1 2019-05-12, 19:43:16
SmhiSwbBot 2018-12-19, 09:02:48 Ymblanter 2019-01-20, 20:47:11
MewBot 2018-09-22, 09:38:20 Pamputt 2018-10-30, 21:58:48
GZWDer (flood) 2 2018-07-16, 13:56:24 Liuxinyu970226 2018-09-15, 22:41:50
Crossref bot 2018-04-19, 21:12:41 GZWDer 2019-07-04, 12:52:14
Wolfgang8741 bot 2018-06-18, 02:17:10 DannyS712 2020-05-10, 18:28:47
CanaryBot 2 2018-05-10, 23:46:00 Ivanhercaz 2018-05-14, 18:26:33
Maria research bot 2018-03-13, 06:15:42 GZWDer 2019-07-04, 12:55:37
AmpersandBot 2 2018-02-22, 01:43:22 Jura1 2018-03-12, 10:18:09
Taiwan democracy common bot 2018-02-09, 07:09:27 GZWDer 2019-07-04, 13:01:33
Newswirebot 2018-02-08, 13:00:18 Multichill 2020-03-15, 20:23:07
KlosseBot 2017-11-17, 20:40:22 DannyS712 2020-07-07, 12:06:11
Neonionbot 2017-10-19, 06:15:18 GZWDer 2019-07-04, 12:56:13
Handelsregister 2017-10-16, 07:39:42 Pasleim 2018-02-09, 08:46:30
Jntent's Bot 2017-06-30, 23:37:11 Matěj Suchánek 2018-08-03, 09:21:28
ZacheBot 2017-03-04, 23:29:38 Zache 2017-07-11, 11:13:15
YULbot 2017-02-21, 18:05:13 DannyS712 2020-05-10, 18:29:36
YBot 2017-01-12, 16:43:19 Lymantria 2020-04-18, 11:48:55

UallvBot 2[edit]

UallvBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Uallv (talkcontribslogs)

Task/s: This bot will load data from Beweb Library (Q77541206) onto existing Wikidata entity with the correct reference. In case data are already present only a reference will be added.

Code: https://gist.github.com/DavideAllavena/1a1022a20ff6c1505b4c1b5e77ff6dd2

Function details: Data added to Wikidata (or reference in case data are already present) are manually created and relative to this properties:

--Uallv (talk) 11:47, 7 July 2020 (UTC)

ISSNBot[edit]

ISSNBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Tfrancart (talkcontribslogs)

Task/s: Adds the free fields from the ISSN Portal (Q70460099) to Wikidata, on entries that already have an ISSN. The bot is operated by the ISSN International Center (ISSN International Centre (Q12131129)).

Code: https://github.com/CIEPS/ISSNBot

Function details:

ISSNBot adds statements on entries that already have a value for ISSN (P236).

See Data donation from ISSN Register - Feedback welcome and Data_donation_from_ISSN_Register_(Followup) for details.

Each serial notice is described with metadata from the set of metadata already freely available from the ISSN Portal :

Note that ISSNbot :

  • Only updates items in which ISSN (P236) has a value
  • Will be used to synch data from ISSN Register to Wikidata, so values can be modified/deleted/deprecated if they have been previously synched but their values has changed since. The precise comparison algorithm the bot uses to create/update/delete data in Wikidata is documented at https://github.com/CIEPS/ISSNBot/wiki/ISSNBot-update-algorithm.

--Thomas Francart (talk) 20:15, 9 June 2020 (UTC)

@Tfrancart: Could you do a test run so we can review how you are doing this? How will you handle cases where some of these properties already exist on the item (and are different)? How will you handle items that have multiple ISSN's? ArthurPSmith (talk) 21:13, 3 July 2020 (UTC)
@ArthurPSmith: :
  • the test run is actually already done at https://editgroups.toolforge.org/b/ISSNBot/1a3ca2ec026c4bec/. The precise code is visible in this class in the code repo : https://github.com/CIEPS/ISSNBot/blob/master/issnbot-lib/src/main/java/org/issn/issnbot/WikidataSerial.java, but deserves better documentation, which I will do shortly. In short : if the same value for a property exists and is different, a new value is inserted, with proper reference. Existing value remains untouched; if the existing value is the same, a reference is added to it. An item that have multiple ISSN is not a problem as long as these ISSNs identify different formats of the same serial (e.g. print vs. electronic). We can detect is there is an inconsistency because in ISSN database multiple ISSN are associated in a "ISSN-L" (Linking ISSN) that groups together ISSN identifying the same serial. If an item in Wikidata has multiple ISSN that actually are linked to different ISSN-L in ISSN database, we don't synch the data of these entries. We are working under the assumption that A SERIAL ITEM IN WIKIDATA WILL HAVE A SINGLE VALUE FOR ISSN-L (P7363), and identifies a serial, independantly of its format.
  • Also note that we have added the capacity in the ISSNBot to clean the data synched previously, in the case a value for ISSN (P236) is removed or modified on the item. In that case, the title, language, place of publication on that item having an ISSN reference can be deleted by the bot.
    • Symbol support vote.svg Support Ok, those edits look good to me, looking forward to seeing this completed! ArthurPSmith (talk) 18:16, 7 July 2020 (UTC)

John Vandenberg (talk) 09:30, 2 December 2013 (UTC) Aubrey (talk) 12:15, 11 December 2013 (UTC) Daniel Mietchen (talk) 12:47, 11 December 2013 (UTC) Micru (talk) 13:09, 11 December 2013 (UTC) DarTar (talk) 01:37, 15 January 2014 (UTC) Maximilianklein (talk) 00:23, 28 March 2014 (UTC) Mvolz (talk) 08:10, 20 July 2014 (UTC) Andy Mabbett (Pigsonthewing); Talk to Andy 22:17, 27 July 2014 (UTC) Mattsenate (talk) 17:26, 14 August 2014 (UTC) author  TomT0m / talk page JakobVoss (talk) 14:25, 16 June 2016 (UTC) Mahdimoqri (talk) 08:04, 5 April 2018 (UTC) Jsamwrites Dig.log Sic19 (talk) 22:46, 12 July 2017 (UTC) Andreasmperu Nomen ad hoc Pete F (talk) 99of9 Mfchris84 (talk) 09:02, 26 November 2018 (UTC) Runner1928 (talk) 17:22, 1 December 2018 (UTC) Wittylama (talk) 09:55, 22 December 2018 (UTC) Jneubert (talk) 07:30, 22 February 2019 (UTC) --Juandev (talk) 20:28, 27 April 2019 (UTC) VIGNERON (talk) Uomovariabile (talk to me) 08:46, 24 June 2019 (UTC) SilentSpike (talk) Ecritures (talk) Tfrancart (talk) Dick Bos (talk) 10:47, 30 January 2020 (UTC) --Rdmpage (talk) 09:56, 15 May 2020 (UTC) Pictogram voting comment.svg Notified participants of WikiProject Periodicals : Anybody else from project Periodicals wants to review this ?

Dexbot 13[edit]

Dexbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Ladsgroup (talkcontribslogs)

Task/s: Deleting items that used to have a sitelink that is deleted (and removed now) and don't have any claims or backlinks in to any items or properties.

Code: here

Function details: Lots of admin work is deleting non-notable items that had an item for a brief moment because someone created their items in wikipedia and it got deleted. To be sure it doesn't delete anything that satisfies WD:N. It only deletes items that don't have any statements or backlinks to any other item or property and no sitelink either. I have deleted some as an example that you can find in here --Amir (talk) 21:21, 27 June 2020 (UTC)

discussion
  • Pictogram voting comment.svg Comment Would you include the item's label in the edit summary? (This is now standard for manual deletions by admins and it would be odd if it wasn't present when done by bot) --- Jura 11:31, 29 June 2020 (UTC)
  • Symbol support vote.svg Support; regarding the label, I agree that it would be better reporting at least one label (if there is only one, if there are more I don't know which would be the best way to choose which one). --Epìdosis 14:16, 29 June 2020 (UTC)

@Jura1, Epìdosis: Done. Does this look good? Amir (talk) 17:18, 3 July 2020 (UTC)

Yes in my opinion! --Epìdosis 20:23, 3 July 2020 (UTC)
Agree. Maybe the rest of the summary could be shorter. How does it determine that there no sitelinks, backlinks, statements? --- Jura 10:43, 6 July 2020 (UTC)
You should also check whether the item is used (subscribed) in wikis other than Wikidata.--GZWDer (talk) 19:22, 7 July 2020 (UTC)

t cleanup bot[edit]

t cleanup bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Jura1 (talkcontribslogs)

Task/s: cleanup leftover from last incident, once discussion closed

Code:

Function details: Help cleanup as needed ----- Jura 17:39, 21 June 2020 (UTC)

Non-admins can not delete items.--GZWDer (talk) 20:02, 21 June 2020 (UTC)

OlafJanssenBot[edit]

OlafJanssenBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: OlafJanssen (talkcontribslogs)

Function: Replace dead, outdated or non-persitent links to websites of the KB (national library of The Netherlands) in Wikidata with up-to-date and/or persistent URLs

Code: https://github.com/KBNLwikimedia/WikimediaKBURLReplacement and https://github.com/KBNLwikimedia/WikimediaKBURLReplacement/tree/master/ScriptsMerlijnVanDeen/scripts,

Function details: This article explains what the bot currently does on Dutch Wikipedia (bot edits on WP:NL listed here. I want to be able to do the same URL replacements in Wikidata, for which I'm requesting this bot flag. The bot flag for this type of task is already enabled on Dutch Wikipedia, see here for approval

--OlafJanssen (talk) 21:45, 11 June 2020 (UTC)

I will approve this task in a couple of days, provided that no objections will be raised. Lymantria (talk) 09:45, 20 June 2020 (UTC)

@Lymantria, OlafJanssen:

  • I don't really see it do useful edits. It's somewhat pointless to edit Listeria lists ([5], etc) and one should avoid to edit archive pages [6][7]. --- Jura 10:16, 24 June 2020 (UTC)
  • Discussion should take place at User talk:OlafJanssen. Lymantria (talk) 10:21, 24 June 2020 (UTC)
  • I think it should be un-approved. Shall I make a formal request? --- Jura 10:26, 24 June 2020 (UTC)
    • No. Let's reopen this discussion. Lymantria (talk) 08:07, 26 June 2020 (UTC)

Recipe Bot[edit]

Recipe Bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: U+1F360 (talkcontribslogs)

Task/s: Crawl https://www.allrecipes.com and insert items into wikidata with strctured data (ingredients and nutrition information).

Code: TBD. I haven't written the bot yet. I would like to get feedback before doing so.

Function details:

  • Crawl https://www.allrecipes.com and retrieve the structured data (example) for a recipe.
  • Parse the list of ingredients and nutrition information, Halt if any items are not parsed cleanly.
  • See if a Wikidata item already exists (unlikely, but a good saftey check)
  • Create an item for the recipe with the title, structured information (ingredients and nutrition infomration), and URL to the full work.

--U+1F360 (talk) 14:21, 20 May 2020 (UTC)

admittedly I'm not familiar with WD's bot policy but this does not seem useful as creating empty items would be useless and there is no place on any project where we should be mass posting recipes. Praxidicae (talk) 13:05, 22 May 2020 (UTC)
@Praxidicae: The items wouldn't be empty, they would contain metadata about the recipes. It would allow users to query recipes based on the ingredents, nutrition information, cook time, etc. U+1F360 (talk) 13:42, 22 May 2020 (UTC)
When I meant empty, I meant to other projects. Wikidata shouldn't serve as a cookbook. This is basically creating a problem that doesn't exist. Praxidicae (talk) 13:44, 22 May 2020 (UTC)
@Praxidicae: Most items on Wikidata do no have any sitelinks. I asked in the project chat if adding recipes was acceptable, and at least on a conceptual level that seems fine? I believe it would meet point #2 under Wikidata:Notability. I'm not sure how it's any different from Wikidata:WikiProject_sum_of_all_paintings. U+1F360 (talk) 13:52, 22 May 2020 (UTC)
I fundamentally disagree I guess. This is effectively using Wikidata as a mini project imo. Praxidicae (talk) 13:53, 22 May 2020 (UTC)
I feel like the ship has sailed on that question (unless I'm missing something). U+1F360 (talk) 13:55, 22 May 2020 (UTC)
It wasn't a question, it's me registering my objection to this request. Which I assume is allowed...Praxidicae (talk) 13:57, 22 May 2020 (UTC)
@Praxidicae: Of course it is. :) I guess my point is that, the "problem" is that our coverage of recipes is basically non-existent. I'd like to create a bot to expand that coverage. A recipe is a valuable creative work. Of course I don't expect people to write articles about recipes (seems rather silly). In the same way, we are adding every (notable) song to Wikidata... that's a lot of music. U+1F360 (talk) 14:01, 22 May 2020 (UTC)
Which is what I find problematic. There have been proposals in the past to start a recipe based project and they have been rejected each time by the community. This is effectively circumventing that consensus. Not to mention this already exists and I also have concerns about attribution when wholesale copying from allrecipes. Praxidicae (talk) 14:03, 22 May 2020 (UTC)
What about the copyright side? Their Terms of Use specifies that the copyrights are held by the copyright owners (users) and there is no indication of free license in the website. Recipes are not mere facts, numbers and/or IDs. Also, there is no indication of "why Wikidata needs this info". — regards, Revi 14:18, 22 May 2020 (UTC)
I'll kick myself for asking, but U+1F360 (talkcontribslogs), sell this to me. Explain the copyright details, explain the instructional sections, explain how alternative ingredients will work, explain how differences in measurement units in different countries will work. This is your opportunity. Sell it to all of us. Nick (talk) 15:01, 22 May 2020 (UTC)
Let me attempt to answer "all" the qustions. :) For some background, I was recently trying to find recipes based on the ingredients I have on hand. Sure, you can do a full-text search on Google, but if you have 2 potato (Q16587531), it doesn't tell you if the recipes require 2 or less potato (Q16587531), just that it mentions the word. :/ Also, not to mention all the other ingredients you may need that you may not have (especially during a global pandemic). I was looking for just a database of recipes (not the recipes themselves), and as far as I could find, that doesn't exist (at least not in a structured form). I also thought of many other questions which are difficult (if not impossible) to answer without such a dataset like: What is the most common ingredient in English-langauge recipes? What percentage of recipes are vegitarian? Quesitons like this are un-answerable without a dataset of known recipes. As far as copyright is concerned, according to the US Copyright Office:

A mere listing of ingredients is not protected under copyright law. However, where a recipe or formula is accompanied by substantial literary expression in the form of an explanation or directions, or when there is a collection of recipes as in a cookbook, there may be a basis for copyright protection. Note that if you have secret ingredients to a recipe that you do not wish to be revealed, you should not submit your recipe for registration, because applications and deposit copies are public records. See Circular 33, Works Not Protected by Copyright.

At least in the United States, the "metadata" about a recipe (ingredients, nutirition information, cook time, etc.) cannot be copyrighted and therefore exists in the public domain. Since it's unclear whether the directions on a recipe are under copyright or not, I think it's safest to leave all directions in the source. As an example. Let's say we have a cookbook like How to Cook Everything (Q5918527) should we not catelog every recipe from the book in Wikidata? I would think this would be valuable information no? In my mind this is the same difference as an album like Ghosts V: Together (Q88691681) which has a list of tracks like: Letting Go While Holding On (Q93522041). I am not suggesting that we create a wiki of freely licened recipes. As @Praxidicae: mentioned that has been proposed and rejected many times. This is the same thing as music albums with songs or tv shows with epsides. Now we could make up a threshold of notability for recipes. Does it need to be printed in a book? Does it need at lest 3 reviews if on allrecipes? I'm not sure what makes a recipe notable or not, but in my mind they are valuable works for art that should be cataloged. U+1F360 (talk) 17:05, 22 May 2020 (UTC)
I realized I missed a few questions in there. Alternative ingredients should be marked with a qualifier of some kind. Measurments should remain in whatever unit is in the referenced source (as we do with all other quantities on Wikidata). The measurments could be converted when a query is preformed or a recipe is retried. U+1F360 (talk) 17:51, 23 May 2020 (UTC)
I manually created a little example Oatmeal or Other Hot Cereal (Q95245657) from a cookbook that I own. Open to suggetions on the data model! U+1F360 (talk) 23:02, 23 May 2020 (UTC)
Here is another example: Chef John's Buttermilk Biscuits (Q95382239). Please let me know what you think and what should change (if anything). U+1F360 (talk) 17:47, 24 May 2020 (UTC)


CovidDatahubBot[edit]

CovidDatahubBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: TiagoLubiana (talkcontribslogs)

Task/s: Auto-update of death and case counts for outbreak items that are not regularly updated by the community.

Code:

Function details: --TiagoLubiana (talk) 21:28, 14 May 2020 (UTC)

Covid-19 country case, death and recovery counts bot[edit]

This bot will:

  • Update only items for country outbreaks that are neither known to be manually updated by dedicated editors, nor updated in the last 23 hours. This limitation is set to avoid conflicting with items that are maintained by different users across the globe.

Items will be updated using the following schema (based on E188):

Every statement will be referenced by

The code is available at this GitHub repo. Currently it is generating QuickStatements claims that are verified and manually added as a batch.

If approved, a backend using WikidataIntegrator would be implemented for daily updates.

The license of data to import is CC0 (https://datahub.io/core/covid-19)

Specifically, these pages have discussions that are key for the matter:

Discussion[edit]

  • Pictogram voting comment.svg Comment It would be good if you could do a few test edits. --Daniel Mietchen (talk) 02:56, 15 May 2020 (UTC)
  • @Daniel Mietchen: We ran the code a couple of times, and it worked fine. When the information remains the same for a few days, the qualifiers are stacked (see 2020 COVID-19 pandemic in Angola (Q88082534) for example) , but I'm not sure if that is undesirable. Notably, besides this effort, the last update for case counts in the Angola item is from more than a month ago. TiagoLubiana (talk) 23:13, 17 May 2020 (UTC)
  • Is there are a way to include a reference to the actual source (instead of an aggregator)? --- Jura 10:04, 16 May 2020 (UTC)
  • @Jura1: The aggregator uses many different sources, each with a format, specification, and rate of the update, so having references to the original sources would require an enormous effort. I have tried to use the WHO Reports for this goal, but they are severely delayed in comparison to the JHU and the national sources. Countries that have items that are updated by the Wikidata community with local sources TiagoLubiana (talk) 23:13, 17 May 2020 (UTC)
@TiagoLubiana: If you are still interested in this task please register the bot.--Ymblanter (talk) 18:10, 12 June 2020 (UTC)
@Ymblanter:. Hello, thanks for commenting. What do you mean by registering the bot? Making the bot account itself and start running? TiagoLubiana (talk) 15:31, 13 June 2020 (UTC)
Yes, not to start running, but to make some test edits from the bot account.--Ymblanter (talk) 15:32, 13 June 2020 (UTC)
  • Hi folks, just a note to say that myself and some others are very interested in this bot and its potential to help us out with some dynamic data updating on enwiki (see en:Template talk:COVID-19 pandemic data/Per capita#List of countries whose cases per capita is noticeably incorrect). I'm a developer myself, and a botop on enwiki, so if there's anything I can do to help speed the process along, I'm happy to help out! Naypta (talk) 20:05, 13 June 2020 (UTC)
    As soon as the bot has been registered and made some test edits (which still has not happened yet), assuming the edits are good, I will ask the previous commenters and will approve the bot in several days provided no objections have been made.--Ymblanter (talk) 18:33, 14 June 2020 (UTC)
    @Ymblanter: @Naypta: Great! I will create the account and do a couple test edits tomorrow. I will be able to figure out which points would require more development tomorrow, and then I report here. Is that ok? Thank you, TiagoLubiana (talk) 23:38, 14 June 2020 (UTC)
    Yes, please. The policy says it should be 50 test edits, we are not going to count them but please make the selection representative so that the users could judge.--Ymblanter (talk) 05:40, 15 June 2020 (UTC)
  • @Ymblanter: I have done a few test edits now (not 50 yet). The current implementation of the bot uses Wikidata Integrator. It is currently removing the previous values for the number of deaths (P1120), number of recoveries (P8010) and number of cases (P1603) and leaving only the current, most updated one. In the Wikidata:WikiProject COVID-19 we are converging to the idea of having legacy data as tabular case data and keeping sure that only the current numbers are actually correct. Otherwise, it gets a nightmare to make sure that the numbers are accurate. The bot only updates items that have not been detected as being manually curated by any community, so to avoid clashing with concurrent efforts. CovidDatahubBot (talk) 14:07, 16 June 2020 (UTC)
    Thsnks. Pinging the previous praticipants, @Daniel Mietchen, Jura1:--Ymblanter (talk) 14:15, 16 June 2020 (UTC)
    It deleted data. Please block the bot until it's repaired. --- Jura 14:35, 16 June 2020 (UTC)
    @Jura1: Okay. As I mentioned, this was partially intended, due to the consensus being built around tabular case data for legacy info. But sure, I understand. I will fix it so it only adds data. Thanks. If it is blocked, however, it would be harder to test. Can I just refrain from editing before this is fixed? TiagoLubiana (talk) 01:27, 19 June 2020 (UTC)
    Can you point to the discussion you where this is "being built"? P8204 was clearly not intended for that and if you want to add tabular info, you need to run this bot on Commons, not here. --- Jura 06:06, 19 June 2020 (UTC)
  • @Jura1: I can point to the log of the open project meetings at Wikidata:WikiProject_COVID-19/Project_meeting/Archive. I also would like to invite you for the meeting, if you would like to. It is really spread throughout Wikidata. There is something in that direction in the Property proposal itself, for example. I believe that @Mxn:'s work at creating P8204 has been motivated by the changes in historical data when case counts are updated. I do not intend to add tabular info yet, but it is currently the best way of keeping this info in a clean, coherent format, as far as I am aware of. TiagoLubiana (talk) 01:32, 29 June 2020 (UTC)
  • To elaborate, many jurisdictions are revising past days' case and death counts based on updated information (since tests can take some time to come back and investigation is often needed to determine whether the person is actually a resident of that jurisdiction). For example, this jurisdiction currently reports 437 cases and 10 deaths as of March 19, whereas on March 19 they reported only 196 cases and eight deaths. Unless the secondary source and this bot goes back and updates past days' figures to match the primary source, it's better to delete the old figures than to keep them around. – Minh Nguyễn 💬 05:05, 29 June 2020 (UTC)
    The question about historic data is understood, it's just not the Wikidata way to re-write history.
    Anyways, if you just want upload a table of current values for all countries, the place to do that is Commons. --- Jura 08:34, 29 June 2020 (UTC)
  • Hi all! What is the status of this currently? We've had to remove the per capita data table (which sources from Wikidata) from en-Wikipedia due to the number of outdated items, but we're ready to bring it back once this is running. {{u|Sdkb}}talk 21:06, 28 June 2020 (UTC)
  • @Sdkb: Thanks! I really want to get this done, but it is not easy (currently) to just update statements using the WikidataIntegrator infrastructure. Doing it in my spare time. If someone wants to take over and just make this faster, it would be great. TiagoLubiana (talk) 01:32, 29 June 2020 (UTC)
@TiagoLubiana: Thanks for the update! And apologies if the above came across as rushing — I recognize from the discussion so far that it's a complex task that'll take some time. Cheers, {{u|Sdkb}}talk 01:38, 29 June 2020 (UTC)

LouisLimnavongBot[edit]

LouisLimnavongBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: LouisLimnavong (talkcontribslogs)

Task/s: Bot to get birthplace and nationality for a list of artists.

Code: import pywikibot

site = pywikibot.Site("en", "wikipedia") page = pywikibot.Page(site, "Khalid") item = pywikibot.ItemPage.fromPage(page)

Function details: --LouisLimnavong (talk) 13:08, 14 May 2020 (UTC)

BsivkoBot[edit]

BsivkoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Bsivko (talkcontribslogs)

Task/s:

  • Term "видеоигра" is not correct for common description of "video game" for russian language. We can see it for ruwiki in Q7889, root (and childs) category and other cases. However, despite language differences, the first term was imported in bulk by bot(s?) (example). To fix this mistake we can use the same bot approach. And simultaneously, fill up description for empty pages.

Example of the first case and for the second one.

Code:

  • I use pywikibot, and there is a function which checks the presence of mistake and fix it. For empty cases, it prepares a short description:

def process_wikidata_computergame(title):

   item = get_wikidata_item("ru", title)
   if not item:
       return
   if 'ru' in item.descriptions.keys():
       if "видеоигра" in item.descriptions['ru']:
           item.descriptions['ru'] = item.descriptions['ru'].replace("видеоигра", "компьютерная игра")
           item.editDescriptions(descriptions=item.descriptions,
                                 summary=u'"компьютерная игра" is a common term for "videogame" in Russian')
   else:
       if 'P31' in item.claims.keys():
           if item.claims['P31'][0]:
               if item.claims['P31'][0].target:
                   if item.claims['P31'][0].target.id:
                       if item.claims['P31'][0].target.id == 'Q7889':
                           item.descriptions['ru'] = item.descriptions['ru'] = "компьютерная игра"
                           item.editDescriptions(descriptions=item.descriptions,
                                                 summary=u'added Russian description')
   pass


Function details: --Bsivko (talk) 13:25, 8 May 2020 (UTC)

  • The bot works in background with other articles, and it doesn't have broad scan. Bsivko (talk) 13:25, 8 May 2020 (UTC)

BsivkoBot[edit]

BsivkoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Bsivko (talkcontribslogs)

Task/s:

Code:

  • I use pywikibot, and there is a piece of software which gets property, makes a request with URL, gets page text, recognize article absence and switch to deprecated if keywords of absence were found:
   def url_checking(title, page):
   try:
       item = pywikibot.ItemPage.fromPage(page)
   except pywikibot.exceptions.NoPage:
       return
   if item:
       item.get()
   else:
       return
   if not item.claims:
       return
   id_macros = "##ID##"
   cfg = [
       {
           'property': 'P2924',
           'url': 'https://bigenc.ru/text/' + id_macros,
           'empty_string': 'Здесь скоро появится статья',
           'message': 'Article in Great Russian Encyclopedia is absent'
       },
       {
           'property': 'P4342',
           'url': 'https://snl.no/' + id_macros,
           'empty_string': 'Fant ikke artikkelen',
           'message': 'Article in Store norske leksikon is absent'
       },
       {
           'property': 'P6081',
           'url': 'https://ria.ru/spravka/00000000/' + id_macros + '.html',
           'empty_string': 'Такой страницы нет на ria.ru',
           'message': 'Article in RIA Novosti is absent'
       },
   ]
   for single in cfg:
       if single['property'] in item.claims:
           for claim in item.claims[single['property']]:
               rank = claim.getRank()
               if rank == 'deprecated':
                   continue
               value = claim.getTarget()
               url = single['url'].replace(id_macros, value)
               print("url:" + url)
               r = requests.get(url=url)
               print("r.status_code:" + str(r.status_code))
               if r.status_code == 200:
                   if single['empty_string'] in r.text:
                       claim.changeRank('deprecated',
                                        summary=single['message'] + " (URL: '" + url + "').")
               pass
   pass


Function details:

  • The bot works in background with processing other articles in ruwiki. So, that doesn't have broad scan. Also, there's not so many bad URL, and therefore, the activity is low (a few contribs per month). Bsivko (talk) 12:49, 8 May 2020 (UTC)

ZoranBot[edit]

ZoranBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Zoranzoki21 (talkcontribslogs)

Task/s: Removing old interwiki links from Wikimedia projects and migrating them here

Code: I will use Pywikibot script interwikidata.py

Function details: Bot will remove interwiki links from pages on Wikimedia projects and migrate them here (on example when I process this, or remove interwiki links on template, category related pages on Serbian Wikipedia.) --Zoranzoki21 (talk) 00:56, 30 April 2020 (UTC)

Please make some test edits.--Ymblanter (talk) 18:40, 6 May 2020 (UTC)

DeepsagedBot 1[edit]

DeepsagedBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Deepsaged (talkcontribslogs)

Task/s: import russian lexeme and sense from ru.wiktionary.org

Code:

Function details: --Deepsaged (talk) 06:16, 14 April 2020 (UTC)

@Deepsaged: please make some test edits--Ymblanter (talk) 19:06, 14 April 2020 (UTC)
@Ymblanter: done: создать (L297630), сотворить (L301247), небо (L301348) DeepsagedBot (talk) 17:26, 28 May 2020 (UTC)

Uzielbot 2[edit]

Uzielbot 2 (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Uziel302 (talkcontribslogs)

Task/s: mark broken links as deprecated

Code: https://github.com/Uziel302/wikidatauploadjson/blob/master/deprecatebrokenlinks

Function details: a simple wbeditentity calls to mark broken official links as deprecated. I did few examples on my bot account, all the edits proposed are of the same nature. I detect broken links based on http header (no response/400/404 are considered broken). --Uziel302 (talk) 23:49, 7 April 2020 (UTC)

no response/400/404 may have multiple reasons, including temporary maintenance, content that only accessible when sign-in, content only accessible in some countries, internet censorship, etc.--GZWDer (talk) 06:14, 8 April 2020 (UTC)
GZWDer, which of these edge cases are not relevant in manual checking? How is it possible to really detect broken links? And if no such option exists, should we ban "reason for deprecation: broken link"? Uziel302 (talk) 17:23, 8 April 2020 (UTC)
This means you should not flag them as broken links without checking them multiple times.--GZWDer (talk) 17:26, 8 April 2020 (UTC)
GZWDer, no problem, how many is multiple? Uziel302 (talk) 21:45, 8 April 2020 (UTC)
I am the main Bots writer in Hebrew Wikipedia, wrote over 500 Bots along the years. I can testify the broken links are a big problem and we need to resolve it from the source. I discussed it with Uziel302 prior to him writing here and I am convinced the method suggested here is the preferred method. Lets move forward to cleanup these broken links so they do not bother us any more. בורה בורה (talk) 09:18, 13 April 2020 (UTC)
@GZWDer: Would you react to the question? Is there a benchmark to consider a link broken? Repetitave checks with a minimla number of checks and a minimal time span? Lymantria (talk) 08:30, 16 May 2020 (UTC)
I don't think it should them to deprecated. You could add "end cause" *404". --- Jura 13:23, 16 May 2020 (UTC)

WordnetImageBot[edit]

WordnetImageBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: WordnetImageBot (talkcontribslogs)

Task/s:

This bot is part of the Final Degree Project in which i link the offset/code of the words in Wordnet with the words and images of wikidata.

Code:

Is to be done.

Function details: --WordnetImageBot (talk) 12:16, 18 March 2020 (UTC)

Link words and images with the words of Wordnet, that is, add an exact match (P2888) url to those words that haven´t got a link with Wordnet. If a word in Wikidata doesn´t have an image, this bot will add the image.

Please, make some test edits and create the bot's user page containing {{Bot}}. Lymantria (talk) 06:36, 27 April 2020 (UTC)
@Andoni723: reminder to make the test edits --DannyS712 (talk) 12:03, 7 July 2020 (UTC)

Taigiholic.adminbot[edit]

Taigiholic.adminbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Taigiholic (talkcontribslogs)

Task/s: interwiki linking and revising

Code: using pywikibot scripts

Function details:

  • The account is already a adminbot on nanwiki whose works are mainly interwiki linking and revising in semi-automatic and batch way.
  • The operator owns another bot named User:Lamchuhan-bot, which is mainly working on interwiki linking for NEW ARTICLES (such works will be mainly done by this new requesting account once it gets the flag) only, no revising works.
  • Only request for "normal bot" flag here at this site, not for "adminbot".

Thanks.--Lamchuhan-hcbot (talk) 00:17, 16 March 2020 (UTC)

Thanks.--Lamchuhan (talk) 00:19, 16 March 2020 (UTC)

I think I sort of understand what the task is, but could you please be more specific?--Jasper Deng (talk) 06:59, 16 March 2020 (UTC)

@Jasper Deng: The bot will run by using pywikibot scripts on nanwiki. Some of the tasks will have to run interwiki actions such as:

item.setSitelink(sitelink={'site': 'zh_min_nanwiki', 'title': 'XXXXX'}, summary=u'XXXXX')

Thanks.--Lamchuhan (talk) 07:34, 16 March 2020 (UTC)

Please register the bot account and make some test edits.--Ymblanter (talk) 20:24, 18 March 2020 (UTC)

GZWDer (flood) 3[edit]

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GZWDer (talkcontribslogs)

Task/s: Creating items for all Unicode characters

Code: Unavailable for now

Function details: Creating items for 137,439 characters (probably excluding those not in Normalization Forms):

  1. Label in all languages (if the character is printable; otherwise only Unicode name of the character in English)
  2. Alias in all languages for U+XXXX and in English for Unicode name of the character
  3. Description in languages with a label of Unicode character (P487)
  4. instance of (P31)Unicode character (Q29654788)
  5. Unicode character (P487)
  6. Unicode hex codepoint (P4213)
  7. Unicode block (P5522)
  8. writing system (P282)
  9. image (P18) (if available)
  10. HTML entity (P4575) (if available)
  11. For characters in Han script also many additional properties; see Wikidata:WikiProject CJKV character

For characters with existing items the existing items will be updated.

Question: Do we need only one item for characters with the same normalized forms, e.g. Ω (U+03A9, GREEK CAPITAL LETTER OMEGA) and Ω (U+2126, OHM SIGN)?--GZWDer (talk) 23:08, 23 July 2018 (UTC)

CJKV characters belonging to CJK Compatibility Ideographs (Q2493848) and CJK Compatibility Ideographs Supplement (Q2493862) such as 著 (U+FA5F) (Q55726748), 著 (U+2F99F) (Q55738328) will need to be split from their normalized form, eg. (Q54918611) as each of them have different properties. KevinUp (talk) 14:03, 25 July 2018 (UTC)

Request filed per suggestion on Wikidata:Property proposal/Unicode block.--GZWDer (talk) 23:08, 23 July 2018 (UTC)

Symbol support vote.svg Support I have already expressed my wish to import such dataset. Matěj Suchánek (talk) 09:25, 25 July 2018 (UTC)
Symbol support vote.svg Support @GZWDer: Thank you for initiating this task. Also, feel free to add yourself as a participant of Wikidata:WikiProject CJKV character. [8] KevinUp (talk) 14:03, 25 July 2018 (UTC)
Symbol support vote.svg Support Thank you for your contribution. If possible, I hope you to also add other code (P3295) such as JIS X 0213 (Q6108269) and Big5 (Q858372) in items you create or update. --Okkn (talk) 16:35, 26 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose the use a of the flood account for this. Given the problems with unapproved defective bot run under the "GZWDer (flood)" account, I'd rather see this being done with a new account named "bot" as per policy.
    --- Jura 04:50, 31 July 2018 (UTC)
  • Perhaps we could do a test run of this bot with some of the 88,889 items required by Wikidata:WikiProject CJKV character and take note of any potential issues with this bot. @GZWDer: You might want to take note of the account policy required. KevinUp (talk) 10:12, 31 July 2018 (UTC)
  • This account has had a bot flag for over four years. While most bot accounts contain the word "bot", there is nothing in the bot policy that requires it, and a small number of accounts with the bot flag have different names. As I understand it, there is also no technical difference between an account with a flood flag and an account with a bot flag, except for who can assign and remove the flags. - Nikki (talk) 19:14, 1 August 2018 (UTC)
  • The flood account was created and authorized for activities that aren't actually bot activities. While this new task is one. Given that there had already been run defective bot tasks with the flood account, I don't think any actual bot tasks should be authorized. It's sufficient that I already had to clean up 10000s of GZWDer's edits.
    --- Jura 19:46, 1 August 2018 (UTC)
I am ready to approve this request, after a (positive) decision is taken at Wikidata:Requests for permissions/Bot/GZWDer (flood) 4. Lymantria (talk) 09:11, 3 September 2018 (UTC)
  • Wouldn't these fit better into Lexeme namespace? --- Jura 10:31, 11 September 2018 (UTC)
    There is no language with all Unicode characters as lexemes. KaMan (talk) 14:31, 11 September 2018 (UTC)
    Not really a problem. language codes provide for such cases. --- Jura 14:42, 11 September 2018 (UTC)
    I'm not talking about language code but language field of the lexeme where you select q-item of the language. KaMan (talk) 14:46, 11 September 2018 (UTC)
    Which is mapped to a language code. --- Jura 14:48, 11 September 2018 (UTC)
Note I'm going to be inactive for real life issue, so this request is Time2wait.svg On hold for now. Comments still welcome, but I'm not able to answer it until January 2019.--GZWDer (talk) 12:08, 13 September 2018 (UTC)


MusiBot[edit]

MusiBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools) Operator: Joaquinserna (talkcontribslogs)

Task/s: Query over Genius.com API and adding Genius artist ID (P2373) and Genius artist numeric ID (P6351) statements to its respective artist in case it hasn't been added before.

Code: Not provided. Using WikidataJS for SPARQL querying and claim editing.

Function details: Sequencialy query Genius.com for every possible ArtistID, search Wikidata for any singer (Q177220) or musical group (Q215380) with the same label as Genius' artist name, check if it has Genius artist ID (P2373) and Genius artist numeric ID (P6351) statements, then adds them if neccessary.

Discussion:

Already did a successful test here forcing Muntader Saleh (Q4115189) to be the updated id, Genius' ArtistID n° 1 is Cam'ron (Q434913) which already has Genius artist ID (P2373) Genius artist numeric ID (P6351).

Joaquín Serna (talk) 01:11, 28 February 2020 (UTC)

Could you please make a bit more test edits and on real items?--Ymblanter (talk) 20:00, 3 March 2020 (UTC)
: Done, you can check it out here Joaquín Serna (talk)
Add Genius artist numeric ID (P6351) as a qualifier to Genius artist ID (P2373). If you gather the data from Genius API, use Genius API (Q65660713) as reference. Optionally if you could also add has quality (P1552)verified account (Q28378282) for "Verified Artist" that would be great. - Premeditated (talk) 09:42, 18 March 2020 (UTC)

AitalDisem[edit]

AitalDisem (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Aitalvivem (talkcontribslogs)

Task/s: This bot is made to create a sens (with an item) for every occitan lexeme in wikidata. This is the direct continuation of AitalvivemBot. This is a web application presented as a game. The program will get (for each lexeme) its french translation from Lo Congres's data (unfortunately those are privates data so we can't insert them in wikidata) and look for every item having the occitan word or translation in its label. Then it will use the collaborative work of the community to select the good senses and, once validated, insert them in wikidata. This program has the same goal than Michael Schoenitzer's MachtSinn but uses a translation database. I am also trying to make this program simple to adapt for other languages and with a complete documentation.

Code: You can find my code and documentation here

Function details: It would be very long to list every functions of this program (you can find it in the documentation here and here) but overall this bot will :

  • get informations about lexeme, senses and items
  • create senses
  • add items to senses using Property:P5137

The program will also verify to have enough positive responses from users before inserting a sens. All the details about the process of a game, the test of reliability of an user and the verification before inserting a sens are in the documentation.

--Aitalvivem (talk) 15:48, 14 January 2020 (UTC)

  • Symbol support vote.svg Support This seems like a nice approach to collaborative translation. ArthurPSmith (talk) 17:45, 15 January 2020 (UTC)
  • Can you provide some test edits (say 50-100)? Lymantria (talk) 10:30, 18 January 2020 (UTC)
    • @Lymantria: Hi, I did two test run on 30 lexemes, for each lexeme I did two edits : one to add a sens and the other to add an Item to this sens.
Here is the list of the lexemes : Lexeme:L41768, Lexeme:L44861, Lexeme:L57835, Lexeme:L57921, Lexeme:L235215, Lexeme:L235216, Lexeme:L235217, Lexeme:L235219, Lexeme:L235221, Lexeme:L235222, Lexeme:L235223, Lexeme:L235225, Lexeme:L235226, Lexeme:L235227, Lexeme:L235228, Lexeme:L235229, Lexeme:L235231, Lexeme:L235232, Lexeme:L235234, Lexeme:L235235, Lexeme:L235236, Lexeme:L235239, Lexeme:L235240, Lexeme:L235242, Lexeme:L235243, Lexeme:L235244, Lexeme:L235245, Lexeme:L235246, Lexeme:L235247, Lexeme:L235248
The first test failed because of a stupid mistake of mine in the configuration file of the program. For the second test I had a problem when adding the item for Lexeme:L235226 because there were quotations marks in the description of the item so I fixed the probelem, run it again and everything went well.--Aitalvivem (talk) 10:11, 21 January 2020 (UTC)
I take it the test edits are the ones by the IP? Lymantria (talk) 08:31, 22 January 2020 (UTC)
Yes, I used the bot account to connect to the api but I don't know why it prints the IP instead of the bots account--Aitalvivem (talk) 09:53, 22 January 2020 (UTC)
I would like to see you succeed to do so. Lymantria (talk) 12:57, 26 January 2020 (UTC) (@Aitalvivem: 07:02, 29 January 2020 (UTC))
@Aitalvivem: Any progress? Lymantria (talk) 08:34, 16 May 2020 (UTC)
  • Note for closing bureaucrats: IPBE granted for 6 months per Special:Permalink/1102840151#IP blocked; please switch to permanent IPBE when you approve it. (Or the bot master should consider using Wikimedia Cloud Services where you don't get any IP blocks and a server env for use.) — regards, Revi 14:14, 22 January 2020 (UTC)

BsivkoBot[edit]

BsivkoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Bsivko (talkcontribslogs)

I already have BsivkoBot at ruwiki. Do I need to register another account here? Or, should I link the existed bot? (right now, User account "BsivkoBot" is not registered)

Task/s:

  • Check Britannica URL (P1417) and clean up invalid claims. The reason that there are a lot of claims which linked to nowhere. For instance, Q1143358 has P1417 as sports/shortstop, which goes to https://www.britannica.com/sports/shortstop where we have Britannica does not currently have an article on this topic.

Code:

  • I use pywikibot, and currently I have a piece of software which gets P1417, makes a request with URL, gets page text, recognize article absence and stops with permissions exception:
   item = pywikibot.ItemPage.fromPage(page)
   if item:
       item.get()
       if item.claims:
           if 'P1417' in item.claims:
               brit_value = item.claims['P1417'][0].getTarget()
               brit_url = "https://www.britannica.com/" + brit_value
               r = requests.get(url=brit_url)
               if r.status_code == 200:
                   if "Britannica does not currently have an article on this topic" in r.text:
                       item.removeClaims(item.claims['P1417'], summary=f"Article in Britannica is absent (URL: 'Template:Brit url').")
               pass

Afterwards, I'm going to make test runs and integrate it with the other bot functions (I work with external sources at ruwiki, and in some cases auto-captured links from wikidata are broken and it leads to user complains).

Function details: --Bsivko (talk) 19:37, 28 December 2019 (UTC)

Please create an account for your bot here and make some test edits.--Ymblanter (talk) 21:26, 28 December 2019 (UTC)
I logged in by BsivkoBot via ruwiki and went to wikidata. It created the account (user BsivkoBot is existed here now). After that, I made a couple of useful actions by hand (not with bot), so BsivkoBot can do smth on the project. Next, I tried to run the code above and the exception changed to a different one:

{'error': {'code': 'failed-save', 'info': 'The save has failed.', 'messages': [{'name': 'wikibase-api-failed-save', 'parameters': [], 'html': {'*': 'The save has failed.'}}, {'name': 'abusefilter-warning-68', 'parameters': ['new editor removing statement', 68], 'html': {'*': 'Warning: The action you are about to take will remove a statement from this entity. In most cases, outdated statements should not be removed but a new statement should be added holding the current information. The old statement can be marked as deprecated instead.'}}], 'help': 'See https://www.wikidata.org/w/api.php for API usage. ..

I checked that it possible to remove the claim. So, the problem is on the bot side. Could you please help me, is it a permission problem or the code should be different? (as I see, it requires write rights, but I do not see any rights now) Bsivko (talk) 00:15, 29 December 2019 (UTC)
I changed the logic to setting of deprecated rank, and it was a success! Bot changed the rank and it was dissapeared for users in our article. Afterwards a test run, the code is the following:
       if item.claims:
           if 'P1417' in item.claims:
               for claim in item.claims['P1417']:
                   brit_value = claim.getTarget()
                   brit_url = "https://www.britannica.com/" + brit_value
                   r = requests.get(url=brit_url)
                   if r.status_code == 200:
                       if "Britannica does not currently have an article on this topic" in r.text:
                           claim.changeRank('deprecated', summary="Article in Britannica is absent (URL: '" + brit_url + "').")
               pass

Currently, it works. I'll integrate it into production. Bsivko (talk) 11:58, 29 December 2019 (UTC)

@Bsivko: The above error means you will require a confirmed flag for your bot.--GZWDer (talk) 21:03, 29 December 2019 (UTC)
Ok, I've got it, thank you for the explanation! I already implemented the function and rank changing is enough, it resolved the problem. Bsivko (talk) 21:10, 29 December 2019 (UTC)
Note your edits may be controversial. You should reach a consensus for such edits. (I don't support such edits, but someone may.)--GZWDer (talk) 21:48, 29 December 2019 (UTC)
I understand. I just started the discussion on chat. Please, join. Bsivko (talk) 00:20, 30 December 2019 (UTC)
  • Strong oppose per my comments when this was discussed in 2016. These are not "invalid claims". Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:07, 30 December 2019 (UTC)
  • @Bsivko: BsivkoBot appears to be editing without authorization; there isn't support for more edits here, and I don't see another permissions request. Please stop the bot, or I will have to block it --DannyS712 (talk) 02:58, 8 May 2020 (UTC)
  • As I see, current topic is still under discussion, and functions above are off till that moment. For the extra stuff I'll open another branch. Bsivko (talk) 12:35, 8 May 2020 (UTC)

KnuthuehneBot[edit]

KnuthuehneBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Knuthuehne (talkcontribslogs)

Task/s: Import suicide rate data published by the World Health Organization into countries' pages.

Code: https://tools.wmflabs.org/quickstatements/#/batch/16246 Data was downloaded from WHO and prepared with OpenRefine, then exported to QuickStatements.

Function details: The import will the crude suicide rate (both genders) for the years 2000, 2005, 2010, 2015 and 2016 where available. It will also add a source qualifier to all statements. A test run made using my account can be seen here: https://tools.wmflabs.org/quickstatements/#/batch/16246

--Knuthuehne (talk) 19:12, 26 December 2019 (UTC)

@Knuthuehne: Is the data CC-0?--Ymblanter (talk) 19:52, 27 December 2019 (UTC)

BandiBot[edit]

BandiBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Vsuarezp (talkcontribslogs)

Task/s: The main tasks are: add missing labels, descriptions and aliases for properties in Asturian language.

Code: No public repository, own script based on Wikimate.

Function details: Asturian Wikipedia's templates depends a lot on Wikidata, so the goal of this bot is to add missing translations to entities (labels, descriptions and aliases). BandiBot was flagged bot on Asturian Wikipedia, performing more than 85.000 edits. --BandiBot (talk) 10:14, 17 April 2019 (UTC)

@Vsuarezp: Any ambition to use the bot, that has been blocked since April? Lymantria (talk) 15:47, 11 December 2019 (UTC)

MidleadingBot 3[edit]

MidleadingBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Midleading (talkcontribslogs)

Task/s:This bot will import courtesy name (P1782)temple name (P1785)posthumous name (P1786) and art-name (P1787) from Wikipedia articles. These properties are basic properties of historical Chinese people yet their usage is extremely low. Almost everbody should have courtesy name (P1782) set.

Code:The code is not public because it must be reprogrammed to adapt to varied composition of Wikipedia articles before each run.

Function details:It may import the data from anywhere in a Wikipedia article using regular expression. The preprocessing is manual and usually begins with a Wikipedia articles export, followed by manual checking, pattern matching and testing, and the final data will be imported and is sourced to the Wikipedia article it imported from.

It may also import from Wikisource text imported from Wikimedia project (P143) Chinese Wikisource (Q19822573). --Midleading (talk) 14:50, 9 December 2019 (UTC)

You may found CBDB as another possible source to import.--GZWDer (talk) 13:18, 10 December 2019 (UTC)
I found people already imported the information from CBDB as alias, but they just create new items instead of checking for existing items linked to Wikipedia. I imported about 2000 courtesy name (P1782) values, and they contributed to nearly 2000 duplicated items found by matching imports from Wikipedia and CBDB. I merged about 1000 of them without sex or gender (P21) using this bot and those with sex or gender (P21) will be merged later, due to Wikidata Query Service lag followed by incident of QuickStatements. Also I don't trust CBDB so much as they create duplicated records with so little information to link them to a Wikidata item, and the effort to link them is better to be used to import from trusted text in Wikisource.--Midleading (talk) 14:59, 10 December 2019 (UTC)

TedNye[edit]

TedNye (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Tednye (talkcontribslogs)

I am studying migration patterns of people from Europe to USA. I want to query Wikidata using PHP.

Code:

Function details: --Tednye (talk) 03:16, 18 September 2019 (UTC)

For queries no bot flag is needed. That's for (mass) edits. Lymantria (talk) 10:49, 18 September 2019 (UTC)

My get_file_contents are denied because I dont have sufficient permissions. What is the remedy for this?

What's the URL you're trying to retrieve? Mbch331 (talk) 20:19, 21 September 2019 (UTC)

Using PHP file_get_contents('https://www.wikidata.org/wiki/Q22686') will not return the page I can extract the page using a tool but I want to do programatically

What is the IP-address of the server you are using? I tested it and I can get the content of the page. Mbch331 (talk) 07:29, 22 September 2019 (UTC)

My home IP address is:99.226.235.1
The goDaddy servers are 184.168.205.1, 184.168.40.1

My usage will be medium less than 100,000 queries per month


http://hanf.cash/extract2019/ produces error: SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed in $buffer = file_get_contents('https://www.wikidata.org/wiki/Q22686');

Can someone guide me here. Is it my server's IP address 184.168.205.1 that is the problem?

TidoniBot[edit]

TidoniBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Tidoni (talkcontribslogs)

Task/s: Import Birthdates and Deathdates of Identifiers already available at the Item. For example Library of Congress authority ID (P244) or Elite Prospects player ID (P2481)

Code:

Function details: Check all Entris containing P2481, check corresponding Website and if available add the Birthdate or Deathdate.

Reliable Sources:

  • Library of Congress authority ID (P244)
  • IAAF athlete ID (P1146)

Unreliable Sources:

  • Find A Grave memorial ID (P535)

Reliability to be discussed:

  • Elite Prospects player ID (P2481)
  • IMDb ID (P345)
  • Soccerway player ID (P2369)
  • WorldFootball.net player ID (P2020)

Examples: [9] [10] [11]

--Tidoni (talk) 20:07, 30 August 2019 (UTC)

Symbol note.svg Info Please note Wikidata:Administrators' noticeboard#Unapproved TidoniBot adding erroneous dates. --Succu (talk) 21:24, 30 August 2019 (UTC)
Considering the bots proven history of being unable to determine whether the Gregorian or Julian calendar is used, I suggest the bot be forbidden from adding any date before 15 February 1923, the date the last country, Greece, switched from the Julian to the Gregorian calendar. (Other countries switched later, but from calendars that are very unlikely to be mistaken for Gregorian, such as Islamic calendars.) I also suggest the quality of each source be considered, and the approval be only for sources specifically approved in this request.
A further condition should be manual reversion of all edits already made by the bot that cannot be substantiated with reliable sources. Jc3s5h (talk) 22:29, 30 August 2019 (UTC)
w:Wikipedia:Reliable sources/Perennial sources contains a list of sources which have been extensively discussed at the English Wikipedia with respect to reliability. There is an entry for IMDb; the summary result, in the form an icon, is that the source is generally unreliable. The main discussion is located at w:Wikipedia:Reliable sources/Noticeboard/Archive 267#RfC: IMDb. A point I consider particularly significant is that IMDb has been found to contain material copied from Wikipedia; using it would create circular referencing. Jc3s5h (talk) 12:46, 31 August 2019 (UTC)
Comment. I unblocked the bot so that it can now perform test edits.--Ymblanter (talk) 18:48, 31 August 2019 (UTC)
@Jc3s5h: Did the test edits yield any comments from your side? In particular, does the Julian/Gregorian problem seem to be tackled correctly? Lymantria (talk) 06:22, 11 September 2019 (UTC)
I only reviewed edits until I found an error; I did not attempt to find all the errors in the test edits. In the edits to Alexander Borodin (Q164004) the unreliable source Find A Grave is used to assert that Borodin was born 13 October 1833 Gregorian and died 15 February 1887 Gregorian. But according to Encyclopedia Britannica these are Julian calendar dates. The quotation from Britannica is "Aleksandr Borodin, in full Aleksandr Porfiryevich Borodin, (born Oct. 31 [Nov. 12, New Style], 1833, St. Petersburg, Russia—died Feb. 15 [Feb. 27], 1887, St. Petersburg)".
Therefore the bot fails on two counts: using an unreliable source and misinterpreting the date contained in the source. Jc3s5h (talk) 16:54, 11 September 2019 (UTC)
  • I think the approach discussed with Andrew Gray in the now archived AN discussion should work. We would have four types of dates: (1) safe ones (from when all had Gregorian), (2) safe ones (from before Gregorian), (3) assumed Gregorian (between the two), except for: (4) assumed Julian (date before the conversion to Gregorian in a country likely relevant for the person). --- Jura 18:31, 11 September 2019 (UTC)
One criterion I use in evaluating bot behavior is whether the bot does exactly what the request for approval says it will do. The current version of the request for approval acknowledges Find A Grave as an unreliable source. The bot added statements from that source anyway. Therefore the bot is defective. The fact that the edits were erroneous just makes it more badly broken. Jc3s5h (talk) 19:03, 11 September 2019 (UTC)
The source is available for everyone to evaluate and clearly indicated .. The edit as such seems to copy the data accurately. --- Jura 19:09, 11 September 2019 (UTC)
The bot adds false information to the dates stated in Find A Grave. The source gives dates with no explicit calendar; implicitly, since the events occurred in St. Petersburg, Russia, during a period when the Julian calendar was in effect, the best interpretation of the source is that they are Julian calendar dates. We do not copy sources, we read sources. We understand the context of statements in a source and interpret the statements in context. A bot should be confined to a narrow domain where mere copying will result in correct statements. Since this bot is not so confined, approval should be denied.
@Tidoni: Please, comment. Lymantria (talk) 05:27, 12 September 2019 (UTC)
The bot in its current state only adds dates after 1923, so only Gregorian dates are added. So from the technical site, this should work now. I am not sure if i should start adding dates from Find A Grave memorial ID after 1923 or if these edits are not wanted, because of the unreliability of the source? In my understanding every Information should be added but if a better on is available, it should be marked as deprecated. --Tidoni (talk) 11:38, 12 September 2019 (UTC)
I think "In my understanding every Information should be added but if a better on is available, it should be marked as deprecated" is completely wrong. It should not be a goal to find and add every source that verifies a statement, only enough good sources to be confident the statement is correct. There could also be merit in adding a free on-line source when a good book is already cited, so people will not have to go to the library to verify the statement. Marking statements and sources as deprecated would only be appropriate if an erroneous statement is a wide-spread misnomer that needs to be publicly repudiated. Jc3s5h (talk) 17:24, 12 September 2019 (UTC)
@Tidoni: Please, perform a second bunch of test edits, showing no dates before 15 February 1923 (unlike this one) and sticking what you have mentioned as reliable sources. Lymantria (talk) 05:40, 18 September 2019 (UTC)
  • Personally, I think we should include it if there is no other day-precision data available. As for dates before 1923, I'd apply the three groups outlined above. Gregorian calendar start date (P7295) is still work in progress, but it should allow to do some checks. BTW, I don't think any source has a guaranteed reliability. We wouldn't need Wikidata if that was so. --- Jura 07:47, 18 September 2019 (UTC)
@Lymantria: From which of the Sources should i try a testrun? Library of Congress authority ID and Find A Grave memorial ID where set to be unreliable and shouldn't be added. The other ones wheren't discussed. Should I add examples of the other sources? --Tidoni (talk) 09:07, 18 September 2019 (UTC)
@Tidoni: I was not aware that Library of Congress authority ID was considered unreliable? Not above here at least. Neither is IAAF athlete ID. I'd consider doing a test run on these two. Lymantria (talk) 09:14, 18 September 2019 (UTC)
Some of the LOC entries have the same source as Finda.. : Wikipedia. --- Jura 21:07, 19 September 2019 (UTC)
Still, does that make LOC unreliable to extract the birth and death dates from? Lymantria (talk) 09:51, 25 September 2019 (UTC)
I think reliability varies from reference to reference in relation to a statement. Find-a-.. has the advantage that it generally reproduces the primary source used. Something that is obviously superior to the use of a secondary or tertiary references. Obviously, even a primary source can be wrong, but we have ranks to indicate that. From my personal experience, I came across more incorrect or doubtful statements from BdF than the other two, but this is probably due to the high number we have from them. None listed the primary or secondary referenced used. BTW none of the statements about the reliability above have references and as such could be seen as defamatory. I think we'd better refrain from making such generalizing claims. --- Jura 10:28, 26 September 2019 (UTC)
Articles in Wikipedia and items in Wikidata require reliable sources. Discussion pages do not require reliable sources. Discussion of the reliability of sources is essential. Any publication that indiscriminately uses Wikipedia or Wikidata as a source is itself unreliable. The only situation where I would accept the use of Wikipedia or Wikidata as a source in an outside publication is if the author is an expert in the field, references a specific version of a Wikimedia article or item, and explains why it's correct in that specific situation. Jc3s5h (talk) 20:31, 27 September 2019 (UTC)
Ideally, a primary source that is mentioned in Find a Grave should be cited directly, perhaps being described as "as quoted in" the Find a Grave item. For birth and death dates, primary and secondary sources both have their place. For older items, it may be necessary to browse through the primary source from a time when dates were certainly Julian, locate the discontinuity associated with the changeover, and also take note of the date the year is incremented. Some sources will not be extensive or consistent enough to do this (maybe each gravestone carver does his own thing).
A good secondary source will have worked through all the calendar confusion and make it crystal clear what calendar system is being used to state the dates. Jc3s5h (talk) 20:39, 27 September 2019 (UTC)

antoine2711bot[edit]

antoine2711bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Antoine2711 (talkcontribslogs)

Task/s: This robot will add data in the context of Digital discoverability project of the RDIFQ (Q62382524).

Code: It's work done with OpenRefine, maybe a bit of QuickStatement, and maybe some API calls from Google Sheet.

Function details: Tranfer data for 280 movies from an association of distributors.

--Antoine2711 (talk) 04:25, 2 July 2019 (UTC)

@Antoine2711: Is your request still supposed to be active? Do you have test-/exmple-edits? Lymantria (talk) 07:18, 17 August 2019 (UTC)
@Lymantria: No, it's mainly batch operations that I do myself. There is nothing automated yet, and there won't be for the project I've been working in the last 9 months. --Antoine2711 (talk) 05:15, 2 March 2020 (UTC)
  • @Antoine2711, Lymantria: It seems to be unfinished, many items created are somewhat empty and not used by any other item: [12]. @‎Nomen ad hoc: had listed one of them for deletion. If the others aren't used or completed either, I think we should delete them. Other than that: lots of good additions. --- Jura 12:57, 8 September 2019 (UTC)
@Lymantria, Jura: The data I added all comes from a clean data set provided by distributors. I tried to do my best, but I might not have done everything perfectly. Someone spooted an empty item, and I added the missing data. If there are any other, I will do the same corrections.
My request for the bot is still pertinent as I will do other additions. What information do I need to provide for my bot request? --Antoine2711 (talk) 16:44, 8 September 2019 (UTC)
@Jura1: Sorry for not responding earlier. Theses people are team members on a movie, and I needed to create a statement with {{P]3092}} and also a quantifier, object has role (P3831), and in the case of Ronald Fahm (Q65116570), he's a hairdresser (Q55187). Ideally, I should be able to also push that in the Dxx, the description of this person. But I must be careful. I created around 1500 persons, and I might have 200 still not connected. Do you see anything else? --Antoine2711 (talk) 03:37, 25 February 2020 (UTC)
Yesterday I looked into this request again and noticed that the problems I had identified 5 months ago were still not fixed. If you need help finding all of them, I could do so. --- Jura 08:46, 25 February 2020 (UTC)
@Jura1: yes, anything you see that I didn't do well, tell me, and I'll correct it. If I create 500 items, and if I just do 1 % of errors, it's still 5 bad item creation. So even if I'm careful, I'm still learning and doing mistake. I try to correct them as fast as I can, and if you can help me pin-point problems, I'm fix them, like I did with everyone here. If you have SparQL queries (or other way of find lots of data) let me know and don't hesitate to share with me. Regards, Antoine --Antoine2711 (talk) 06:37, 2 March 2020 (UTC)
  • There's a deletion request for one of these items at Wikidata:Requests for deletions#Q65119761. I've mentioned a likely identifier for that one. Instead of creating empty items it would be better to find identifiers and links between items before creating them. For example Peter James (Q65115398) could be any of 50 people listed on IMDB - possibly nm6530075, the actor in Nuts, Nothing and Nobody (Q65055294)/tt3763316 but the items haven't been linked and they are possibly not notable enough for Wikidata. Other names in the credits there include Élise de Blois (Q65115717) (probably the same person as the Wikidata item) and Frédéric Lavigne (Q65115798) (possibly the same one but I'm not certain) and several with no item so I'm not sure if this is the source of these names. With less common names there could be one item that is then assumed to be another person with the same name. Peter James (talk) 18:29, 9 September 2019 (UTC)
Hi @Peter James: I think that most of theses are now linked. For a few hundred, I still need to put the occupation. I also think that I'm going to state, for the ones with few information, the movies in which they worked. Might help to identify those persons. I've also created links to given name and surname, but that doesn't help much for identification. Note that I added alot of IMDB ID, and those are excellent. Do you have suggestion for me? Regards, Antoine --Antoine2711 (talk) 04:53, 2 March 2020 (UTC)
  • I suggest we block this bot until we see a plan of action for cleanup of the problems already identified. --- Jura 09:44, 24 February 2020 (UTC)
Cleanup seems to be ongoing. --- Jura 09:36, 25 February 2020 (UTC)
  • Pictogram voting comment.svg Comment I fixed writing system (P282) on several hundreds of family name items created yesterday, e.g. at [13]. --- Jura 09:42, 1 March 2020 (UTC)
I didn't know that alphabet latin (Latin alphabet (Q41670)) and alphabet latin (Latin script (Q8229)) where actually 2 different things. Thank you for pointing that out. --Antoine2711 (talk) 04:25, 2 March 2020 (UTC)
  • when doing that, I came across a few "last" names that aren't actually family names, e.g. H. Vila (for Rodrigo H. Vila), and listed them for deletion. You might want to double-check all other. --- Jura 10:02, 1 March 2020 (UTC)
Yes, thanks for pointing that out. I'm also cleaning that up. --Antoine2711 (talk) 04:25, 2 March 2020 (UTC)
@Antoine2711: you seem to be running an unauthorized bot that is doing weird edits. Please explain. Multichill (talk) 15:46, 7 March 2020 (UTC)
@Multichill: Hi, yes, I did see those errors. I was cleaning that yesterday, and will continue today. This was a edit batch with 3% of mistake, on a 3000 lot. Even if 3% is not a lot, in those quantities, I must be very careful. Unfortunately I'm still learning. Please note that nothing this bot do is not supervised and launched by a human decision (which may be unperfect…). Regards, Antoine --Antoine2711 (talk) 17:12, 7 March 2020 (UTC)

CoRepoBot[edit]

CoRepoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Rubenwolff (talkcontribslogs)

Task/s: Insertion and Updating of meta data collected from all online companies. Data is taken from the respective homepages or news articles. Columns include : company name, company website, founding date, legal address, government company id, founders, number of employees.

Code:

Function details: --Rubenwolff (talk) 16:22, 18 March 2019 (UTC) The Open Company Repository has the mission to become the universal search engine for companies. In the process we have collected not only descriptive text but also structured information like company name, company website, founding date, legal address, government company id, founders, number of employees. We aggregate many sources and cross reference the information filtering out any companies with inconsistencies. For newer companies we also make attempts to directly contact the founders and verify the structured information with them.

Update Template:Now :

  • For startups we have added funding information.
  • For data validation we have added WHOIS and SSH (EV / OV) cert information. The WHOIS system seems to be defunct it is almost entirely full of anonymized addresses. But EV/OV Certificates are high quality information sources for company_name, country, city
  • Legal address has been surprisingly difficult to parse out of HTML. People are very non uniform in how they format addresses even if you just take 1 country and try to build a regex for that. We are looking into NLP based solutions now.
  • Because of the address difficulty the governemnt ID effort was also stalled. Once the SSL crawlers have completed we will restart the work to intigrate with government records and then we could consider these companies of very high data quality.

The bot should be able to continuously update wikidata with new companies and attributes of those company entities. Discussion:

  • Thanks, for creating this request. I think it would be great to have this data inside Wikidata. Can you specify here how many items you want to create? For how many entries do you have information about t he number of employees?
Is this about companies located anywhere or only in the US? Can you correspondingly say more about what's in the government ID field and how you expect that to be modelled in Wikidata?
When it comes to founders, that's interesting information but I would expect that you only have their names, is that accurate? Maybe we can model that with 'unknown value' and 'stated as'.ChristianKl❫ 18:29, 18 March 2019 (UTC)
  1. How much I want to insert : > 1000000 but < 10000000 companies. I would suggest we start with a test set of 100 companies. Then insert the confidently non-small companies as defined by >=30 backlinks >50 employees which would be 100367 companies. Then I would say we should define additional requirements on the age of small companies before inserting them. There are companies which do great work, have global impact but stay <50 employees and have little online presence. Here I would look at verification of the age of the company but it warrants farther community discussion.Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  2. Currently the counts for companies which have employee information are as follows (but these numbers change daily. Sometimes they go up because the crawlers found new companies, sometimes they go down becauase we added a new data quality filter ) . Count for companies with 1-50 employees 2302172, for 51-200 employees 305004, for 201-500 employees 89283, for 501-1000 employees 42372 , for 1001-5000 employees 7476, for 5001-10000 employees 8907 and for >=10001 employees 7476. Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  3. These companies are located anywhere but my crawlers start in the English web so it is indeed alot of US companies. There are 1553420 US companies in comparison to 1195446 non US non empty companies. Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  4. The goverment ID field is their registration number in their corresponding country. I have started this work with UK companies because the UK gov provides a nice API https://developer.companieshouse.gov.uk/api/docs/. The ID's are unique per country or country+province.For example gtn.ai is named "GTN LTD" with UK gov id "10775593" hence i would give it ID "gb/10775593" . For other countries like germany or the US the Gov ID's are only unique per region for example gtn-online.de is named "Geothermie Neubrandenburg GmbH" and has DE gov id "Neubrandenburg HRB 1249". The cool guys at Open Corporates have put alot of thought into the unification of these ID's so I will coordinate with them. Open Corporates is also the only Copy Left source for many Company registries (for example the German handelsregister I can only get from them ). Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
  5. For the founders I indeed do not have wiki Q ID's . In some cases I have twitter or linkedin accounts but i doubt this helps. What additional information would be required to disambiguate against existing person Q ID's ? Rubenwolff (talk) 11:12, 19 March 2019 (UTC)
It would be worthwhile to try to match the founder with twitter/linkedin accounts given that those are sometimes available on Wikidata. When only the name of the founder is known but not any ID I would advocate to save the data as "Unknown Value" with the qualifier stated as (P1932). ChristianKl❫ 17:28, 21 March 2019 (UTC)

Franzsimon Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30 User:PKM User:RollTide882071 Kristbaum Andber08 Sidpark SilentSpike Susanna Ånäs (Susannaanas) User:Johanricher User:Celead


Pictogram voting comment.svg Notified participants of WikiProject Companies ChristianKl❫ 11:40, 20 March 2019 (UTC)

@Rubenwolff: are you proposing to only create new items, or also to update existing items about companies? How are you matching your companies to Wikidata? I would suggest to first start with the companies that already exist on Wikidata. What tools are you using for this upload? I think it is important to make the code/workflow open. − Pintoch (talk) 13:24, 20 March 2019 (UTC)
    • Yea I am new to wikidata no idea what tools. Was hoping to get suggestions from the community (going to the next London meetup). Rubenwolff (talk)
  • In general I'm glad somebody's working on something like this, but a test needs to be run and reviewed. The test sample should include a range of companies, some of which have entries already in Wikidata, and some of which don't. It should demonstrate how you plan to handle parent/subsidiary/business division relations, etc. ArthurPSmith (talk) 14:49, 20 March 2019 (UTC)
  • @Rubenwolff: A few questions about the dataset itself:
  • Is there, in the dataset, a persistent identifier for each company? Could that identifier be introduced as a Wikidata external-id property and used for Mix-n-match with existing companies?
  • Is there an intellectual curation process within your database to eliminate duplicates from crawling (e.g., one company with multiple homepages/domain names)?
  • What is the license for the dataset? (Could you perhaps link to an according web page?)

-- Jneubert (talk) 15:25, 20 March 2019 (UTC)

      • My data is oriented around domain names. If a company changes their domain name they will 302 it to their new domain. Additionally if a smaller company gets acquired by a larger one they also 302 to the new owner. In the case of any redirect the destination is considered as new truth and the old company/domain are deleted. (I am working on storing this relationship graph explicitly )
      • I am running a trial asking companies to review their profile but with the current response rate I don't think it would scale. So probably not by humans. But we can create requirements that any piece of information that is not from a company homepage must have at least 2 citations for example.
      • undecided about the license, I am considering GPL or MIT
  • @Rubenwolff: Great project.
    • In the case of UK companies, would your bot populate Companies House ID (P2622) in addition to the ID in the format "gb/10775593" you mentioned in your example? - PKM (talk) 20:07, 20 March 2019 (UTC)
      • Oh cool you already have this Property yea I'll put it on the task list. Rubenwolff (talk)
  • I looked at an example of a large corporate in Germany: BMW (Q26678). The data in the corepo about it seems weak: a) name is not BMW Motorcycles, b) more than 10.000 employees seems rough, it is stated to be more than 120.000. I am skeptical to import such data into Wikidata. --Zuphilip (talk) 18:45, 24 March 2019 (UTC)
      • We have categories because employee count fluctuates and for non public companies it is never a precise number. It is common to stop at 10000+ because there are very few companies that have > 100000. So yea this should be a ENUM not an INT. Rubenwolff (talk)
  • How would updates look like? Supposedly you would add a new statement with a different "point in time" qualifier? You wrote "the destination is considered as new truth and the old company/domain are deleted": how would this look in Wikidata as we don't delete the "old truth". --- Jura 14:26, 4 July 2019 (UTC)

EpiskoBot 2[edit]

EpiskoBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Looperz (talkcontribslogs)

Task/s:

Code:

Function details:

--Looperz (talk) 18:55, 26 June 2019 (UTC)

  • Yeah, please do that. I fixed it on Q450675---- Jura 03:59, 29 June 2019 (UTC)
  • For the moment I am waiting for a decision, because this question affects about 34.000 Pages. With often not only one claim for consecrator (P1598). --Looperz (talk) 10:14, 30 June 2019 (UTC)
    • I think you should stop misusing the qualifier. You probably added some 8000 since the problem was mentioned to you. If you think your approach is correct, you might want to seek additional feedback on project chat. The bot approval is more technical in nature. Personally, I tend to oppose additional requests by users who are known not to cleanup their bot tasks. --- Jura 10:18, 30 June 2019 (UTC)
Thank you for your opinion, Jura. Where should I go an get other opinions? Since it is "just" one single opinion, this is no reason for a full edit stop or change. --Looperz (talk) 23:56, 30 June 2019 (UTC)
I know it's just your single opinion, but even so I don't think you bring much to support it either. Project chat is at Wikidata:Project chat. --- Jura 00:28, 1 July 2019 (UTC)
Your proposal for using object has role (P3831) is just a single opinion, too. subject has role (P2868) is at least an auto suggested qualifier. The sentence "I tend to oppose additional requests by users who are known not to cleanup their bot tasks" is an accusation I have to contradict to. I offered a change even by mass edit as soon as I get a common decision for that subject-object confusion. --Looperz (talk) 03:40, 1 July 2019 (UTC)
Well, I noticed you ignored Ahoerstemeier's opinion and autosuggestion might just come from your bad edits. --- Jura 08:09, 1 July 2019 (UTC)

Meanwhile object has role (P3831) has the majority and i am really looking forward to a change of that autosuggestion thing :-) --Looperz (talk) 13:10, 12 July 2019 (UTC)

EbeBot[edit]

EbeBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Ebe123 (talkcontribslogs)

Task/s
Code
Function details

Takes election data from Elections Canada and the Library of Parliament that I process in Excel to create new items or add candidacy in election (P3602) statements. Could the flooder flag be provided for the 3rd point? Ebe123 (talk | contributions) 20:39, 6 April 2019 (UTC)

  • I don't think P4100 should be used for the list a candidate is included on. Parliamentary groups would be the ones that are formed once a candidates is actually elected. Most of the time, this would be the list one, but it needn't be. Neither is the parliamentary group necessarily identical to the political party. --- Jura 04:10, 7 April 2019 (UTC)
    I don't understand your point about lists nor about parliamentary groups, but most candidates run as a member of a party (parliamentary group), and which party they run as. How they run is what is recorded, and is identical to political party... Ebe123 (talk | contributions) 12:29, 7 April 2019 (UTC)
    Most, but you are amalgamating three different things. Is the Canadian term for "parliamentary group" "caucus"? --- Jura 13:00, 7 April 2019 (UTC)
    Yes. Ebe123 (talk | contributions) 13:15, 7 April 2019 (UTC)
    News about the Trudeau/Wilson-Raybould/Philpott story are sometimes a bit sketchy, but isn't the expulsion from the caucus, not the party? --- Jura 14:42, 7 April 2019 (UTC)
    They ran as liberals, which would be the value used for P4100, but they can't run as Liberals next election. Ebe123 (talk | contributions) 20:46, 7 April 2019 (UTC)
    @Jura1:, are you satisfied with the answers?--Ymblanter (talk) 19:04, 10 April 2019 (UTC)
    I still think the planned property use is incorrectly combining different aspects. To me, it seems clear from the Canadian sample. In countries where there are more than 2 or 3 parties, parliamentary groups generally combine several parties, but members still get elected on a list for a specific party. --- Jura 07:14, 13 April 2019 (UTC)
    You're talking about coalitions, which are formed after the election, and uses political coalition (P5832). What property do you think better represents the party for which a candidate/nominee runs with? Ebe123 (talk | contributions) 13:45, 13 April 2019 (UTC)
  • Symbol support vote.svg Support -SixTwoEight (talk) 00:46, 10 April 2019 (UTC)
  • @Rhadamante‎, Serpicozaure: who seem to be working on parliamentary groups. --- Jura 08:25, 13 April 2019 (UTC)
    I agree with Jura, P4100 is not the thing to use here. Parliamentary groups, parties, coalitions, electoral lists re differnts thungs. Even for countries with Westminster system.
    • Parliamentary groups are exclusively for elected people sitting together in a parliamentary assembly. Most of the time they are from the same party, but sometimes not, one can have been elected as an independent, or from a party allied that has not the minimum required number of elected people to form its own group (for the case, I think what's going in he local parliament in Québec illustrates the concept pretty well)
    • Coalition can have different ways. It can be an electoral coalition of parties making a common electoral list together, while still being separate; eg in Spain, that was the case for Unidos Podemos (Q24039754), in Greece for SYRIZA (Q222897) (before transforming themselves into a party); in France, for the upcoming European election, there is even a coalition Socialist Party (Q170972)/Place publique (Q58366009)/New Deal (Q15629523) saying that their future deputies will decide individually in which parliamentary group they will sit... It can also be an alliance afterwards, political parties elected separately, sometimes with the clear goal to govern together in the end (that was at some point the case with the libdem/Cons in the UK or CDU/FDP in Germany in the noughties), sometimes not (for instance the current coalition Conservative Party (Q9626)/Conservative Party (Q9626) in the UK, or Five Star Movement (Q47817)/Lega Nord (Q47750) in Italy, not to mention Greece, Spain, Portugal, Germany, the Netherlands, Austria, Belgium...)
    But back to he subject, no, P4100 cannot be used for electoral list. Either use the parties, colitions, custom items, or even nothing, but not P4100. Rhadamante (talk) 20:24, 13 April 2019 (UTC)
    Je ne comprends pas trop votre point ; voici un exemple de comment j'utilise(rais) la propriété : (Justin Trudeau (Q3099714))
candidacy in election
Normal rank 2015 Canadian federal election Arbcom ru editing.svg edit
votes received 26,391
electoral district Papineau
parliamentary group Liberal Party of Canada
▼ 0 reference
+ add reference


+ add value
  • Le Canada n'a pas un système de liste électorale (sauf si l'on dit qu'un candidat par parti par circonscription fasse une liste). Comment représenterez-vous (avec quelle propriété) la relation entre parti et candidat avec une circonscription pour sa candidature lors d'une élection ? Ebe123 (talk | contributions) 00:27, 14 April 2019 (UTC)
    Je décortiquerais votre point sur le Québec (même si je ne suis pas trop familier) : Si vous parlez de 2018 Quebec general election (Q17001196), la Coalition Avenir Québec (Q2348226) est un parti et non une coalition de plusieurs (provincial, pas local). De plus, le minimum d'élus d'un parti pour ce faire considéré "officiel" est fédéral (12, donc Parti Québécois (Q950356) ne qualifierais pas dans ce parlement), mais c'est pas trop pertinent vu que j'utilise les partis enregistrés auprès l'institution d'élections, ce qui n'a pas de critères de succès. Ce serait du révisionnisme de changer les détails d'une candidature due à la formation après du parlement.
    J'avais changé la propriété de represents (P1268) à parliamentary group (P4100) car j'avais trouvé qu'il y aurait moins de confusion. Croyez-vous que P1268 est plus approprié ? Ebe123 (talk | contributions) 01:03, 14 April 2019 (UTC)
    Je n'ai jmais parlé de la CAQ. Mon propos sur la coalition n'avait autre raison que votre mésusage du terme dans la discussion plus haut. Et concernant la politique québécoise, je me référais, notamment, à Catherine Fournier quittant le PQ, ce qui aurait normalement du entraîner la disparition de son groupe parlementaire. Mais peu importe. L'exemple donné plus haut est un parfait exemple de contre-sens : le Liberal Party of Canada (Q138345) n'est pas un groupe parlementaire. Il ne doit jamais, comme les autres partis, servir à remplir P4100. Et de toutes façons, comme je l'ai dit plus haut, c'est un non-sens d'utiliser P4100 pour notifier l'affiliation d'un candidat à une élection à son parti, ou du moins au parti dont il a obtenu l'investiture. P4100 sert à renseigner dans quel groupe parlementaire siège un élu durant la législature où il est élu. Ce qui peut être susceptible de changer par ailleurs, cf, le cas de Catherine Fournier. Et puis, dans cette logique, comment qualifier Jean-Martin Aussant ou les deux député QS à l'élection de 2012 puisqu'il n'y a jamais eu de groupe parlementaire ON ou QS dans la législature qui a suivi ?
    represents (P1268) ne me semble pas adéquat non-plus. Un élu représente la population du territoire (ou le "territoire", quoi que ça veuille dire, dans le cas des sénateurs, par exemple en France ou aux États-Unis) où il est élu, pas son parti. Pourquoi ne pas utiliser plus simplement member of political party (P102) ?
    Rhadamante (talk) 04:50, 14 April 2019 (UTC)
    member of political party (P102) me va, c'est juste que ce n'était pas permis avec candidacy in election (P3602) (je l'ajouterais). À part, y aurait-il une situation où parliamentary group (P4100) est compatible avec candidacy in election (P3602)? Ebe123 (talk | contributions) 18:55, 14 April 2019 (UTC)
    J'en doute. ça serait bien si on avait un récapitulatif des variantes et de la terminologie locale applicable. Peut-être l'UIP peut nous aider. --- Jura 19:05, 14 April 2019 (UTC)
    Absolument jamais. un groupe parlementaire relève de l'organisation interne d'un parlement, donc de gens déjà élus. Il n'a aucun rapport avec l'élection. Rhadamante (talk) 22:03, 14 April 2019 (UTC)
    Ce dont je pensais. J'ai ajouté la contrainte contre l'utilisation de cette propriété.
  • J'ai re-mis le troisième point pour le transfer de qualifiants (??). Êtes-vous satisfaits avec ma demande de bot ? Ebe123 (talk | contributions) 00:32, 20 April 2019 (UTC)
    @Jura1:@Rhadamante:--Ymblanter (talk) 18:13, 5 May 2019 (UTC)
    • How about a couple of test edits? --- Jura 18:32, 5 May 2019 (UTC)
  • I don't think that the national election should be the target of candidacy in election (P3602). Each candidate runs in an election in their own riding. Each riding's election should have its own item, and those items should be the targets of candidacy in election (P3602). --Yair rand (talk) 23:59, 8 May 2019 (UTC)
    Each riding only forms a part of the full election, and so electoral district (P768) represents well what part of the election has been contested by the person. Ebe123 (talk | contributions) 03:40, 13 May 2019 (UTC)
    Each riding's election has a distinct electorate (P1831), a number of ballots cast (P1868), a set of candidates, a successful candidate (P991), a number of valid votes, and so on. I think it's quite clear that to store the relevant information, each election which is part of the broader election requires its own item, which should be the target of candidature statements. --Yair rand (talk) 04:55, 13 May 2019 (UTC)

NMBot[edit]

NMBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Notme1560 (talkcontribslogs)

Task/s: Remove brackets indicating translation around English scholarly article labels titles

Code: GitLab repository has scripts and queries to be copied into pywikibot installation.

Function details:

Requested December 2018.

Selects scholarly article (Q13442814) with a PubMed ID (P698) claim and an English label starting with [ and ending with ]. The English label is converted from [XXX]. to XXX.

If no title claim exists, it currently exits (uses the content of the old title to build the new title) but this can be refactored later. If multiple title claims exist, it also exits (doesn't handle deprecating multiple previous claims) but this can also be refactored later. Otherwise, the existing title claim is set to deprecated and a new claim with the correct format is added. --Notme1560 (talk) 20:33, 23 March 2019 (UTC)

How do you want to reflect that this a) a translation and b) of which language? What about title (P1476)? --Succu (talk) 20:52, 23 March 2019 (UTC)
All these articles are from PubMed, but I'm not sure who imported them and when they were imported. The titles with brackets in PubMed are supposed to indicate that the displayed title has been translated to English, but this bot doesn't have access to the untranslated/original titles since it doesn't pull from PubMed's API (only editing existing items). The original language/title information should be on the PubMed page accessible through the PubMed ID (P698) claim which links to the site: (ex) [Article in Portugese] source on the site so it can be shown there. Other than that, I'm not sure. --Notme1560 (talk) 21:11, 23 March 2019 (UTC)
I know. Hence Symbol oppose vote.svg Oppose --Succu (talk) 21:58, 23 March 2019 (UTC)
The original title and language can be retrieved from the XML version. In this case <Language>por</Language> <VernacularTitle>Estudo caso-controle com resposta multinomial: uma proposta de análise.</VernacularTitle>. Emijrp (talk) 09:22, 24 March 2019 (UTC)
Thanks Emijrp, I guess I will have to integrate the PubMed API now and I guess I can pull other missing article data now as well. I guess this can be closed and I can create a new request with the new tasks and details later. --Notme1560 (talk) 19:09, 24 March 2019 (UTC) (sig added hours later, forgot to sign)
I don't see anything wrong with fixing these English labels that are clearly wrong. There's no assertion anywhere that the label is the actual original title of the paper, we have other properties to state that sort of thing. That said, it would be nice to get title in the original language as well. It would also be nice if somebody could fix the rather substantial number of these which have been added with NO label in any language! I'm not sure how they even did that... ArthurPSmith (talk) 17:36, 25 March 2019 (UTC)
Sometimes CrossRef provides no title information... Is enWP preferring translated titles as labels? My question above remains unanswered. ([Case-control studies with multinomial responses: a proposal for analysis]. (Q27687073)). --Succu (talk) 20:21, 25 March 2019 (UTC)
  • I don't think it's Crossref that's the problem - here are examples with only a Pubmed ID: Q58595485 and Q61049189. SourceMD must be doing some over-filtering and then somehow creating items with no label at all!? ArthurPSmith (talk) 12:17, 26 March 2019 (UTC)
  • Somehow I missed this request. Thanks for doing it! --- Jura 19:43, 12 May 2019 (UTC)

SmhiSwbBot[edit]

SmhiSwbBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: SmhiSwbBot (talkcontribslogs)

Task/s: SMHI (Swedish Meteorological and Hydrological Institute) wants to upload surface water bodies information to wikidata. Apart from uploading, the data needs be be updated regularly as well. Code: We reappropriate the following code: https://github.com/lokal-profil/WFD_import Function details: --SmhiSwbBot (talk) 09:01, 19 December 2018 (UTC)

@SmhiSwbBot:, please make some test edits. I assume the database you are planning to upload is licensed as CC-0.--Ymblanter (talk) 20:47, 20 January 2019 (UTC)

MewBot[edit]

MewBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Rua (talkcontribslogs)

Task/s: Importing lexemes from en.Wiktionary in specific languages

Code:

Function details: The bot will be used to parse entries from English Wiktionary using pywikibot and mwparserfromhell, and then either create lexemes on Wikidata, or add information to existing lexemes. Care is taken to not duplicate information: the script checks if the lexeme exists and already has the desired properties and only adds anything if not. In case of doubt (e.g. multiple matching lexemes already exist) it skips the edit. I made some test edits using my own user account, they can be seen from [14] to [15]. Today I did a few on the MewBot account.

Individual imports will be proposed with the lexicographical data project first, as it has been said by the project leaders to be careful with imports at first. The current proposal is for Proto-Samic and Proto-Uralic lexemes, seen at Wikidata talk:Lexicographical data#Requesting permission for bot import: Proto-Uralic and Proto-Samic lexemes. Once the project leaders give the ok for all imports, permission will no longer be needed for individual imports. Planned future imports are for Dutch and the modern Sami languages. --—Rua (mew) 09:37, 22 September 2018 (UTC)

I am ready to approve this request in a couple of days, provided that no objections will be raised meanwhile. Lymantria (talk) 05:27, 25 September 2018 (UTC)
I just noticed that Wikidata:Bots says I need to indicate where the bot copied the data from. How do I indicate that the data came from Wiktionary? —Rua (mew) 10:51, 25 September 2018 (UTC)
Could you run your bot on few entries in order to evaluate it? Thanks in advance. Pamputt (talk) 10:59, 26 September 2018 (UTC)
I did, already. Do I need to do more? —Rua (mew) 11:02, 26 September 2018 (UTC)
Symbol oppose vote.svg Oppose Ah sorry I did not check before asking. For all reconstructed form, I think a reference is mandatory. As these "words" do not exist, these "words" come from specialist's work and have to be sourced. Two linguists may reconstruct different forms. That's said, I am not sure about copyright issue for reconstruct form. It probably belongs to public domain as a scientific work but it would be better to be sure. Pamputt (talk) 21:42, 26 September 2018 (UTC)
Not all reconstructions on Wiktionary can be sourced to some external work. Some were reconstructed by Wiktionary editors. This is because not all reconstructed forms are available in external works, and we have to fill the gaps ourselves. The bot adds links to Álgu and Uralonet if one exists. —Rua (mew) 22:26, 26 September 2018 (UTC)
I strongly disagree to import reconstructed forms that do not come from scientific works. One need criteria to accept such forms and academic paper is a good one. Otherwise, anyone can guess its own form. So if you run your bot, please import only "validated" forms. Pamputt (talk) 14:18, 27 September 2018 (UTC)
I agree with that. Only sourced reconstructed forms should be imported. Unsui (talk) 15:50, 27 September 2018 (UTC)
Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them. The criteria used by Wiktionary is that they follow established sound laws. Some reconstructions from linguistic sources don't pass that criterium. It fits with the general policy in Wiktionary of not blindly copying from dictionaries but making sure that forms make sense. Reconstructions that are questionable, whether from an external source or not, can be discussed and deleted if found to be invalid. If you have doubts about any of the reconstructions in Wiktionary, you should discuss it there.
That said, what should be done if words in different languages come from a common source, but there is no source that gives a reconstruction? Can lemmas be empty? —Rua (mew) 15:54, 27 September 2018 (UTC)
Here are some cases where Wiktionary has had to correct errors and omissions in sources. I provide a link to Wiktionary, and a link to Álgu, which gives its source.
...and many more. So you see if we have to rely on sources, we become vulnerable to errors, whereas we can correct those errors on Wiktionary, making it more reliable. If Wikidata can't apply the same level of scientific rigour then that is rather worrying. —Rua (mew) 16:42, 27 September 2018 (UTC)
Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them.
This is maybe the case on the English Wiktionary but on the French Wiktionary, original works for etymology are not allowed, every etymological information have to be sourced. Yet Wikidata has to define its own criteria and about reconstructed form, nothing has been decided so far. About you question "what do we do when a source give a wrong information", I would say in this case, we set a deprecated rank. Pamputt (talk) 19:05, 27 September 2018 (UTC)
You say, for exemple, "North Sami requires final *ā". OK but why not *ö ? Because linguists have defined laws for this langage. It is always linguists works. Hence, it is possible to put a reference. Otherwise anything may be created as a reconstructed form. Unsui (talk) 07:16, 28 September 2018 (UTC)
That's nonsense. It still has to stand up to scrutiny. —Rua (mew) 10:02, 28 September 2018 (UTC)
  • For how many new ones is this? --- Jura 11:11, 26 September 2018 (UTC)
  • Symbol oppose vote.svg Oppose for now. It's unclear how many would be imported and we need to solve the original research question first. --- Jura 08:03, 27 September 2018 (UTC)
    Can you elaborate? I don't see what the problem is. —Rua (mew) 10:07, 27 September 2018 (UTC)
    Apparently, you don't know how many you plan to import. --- Jura 10:12, 27 September 2018 (UTC)
    I gave a link to the categories in the other discussion. —Rua (mew) 10:20, 27 September 2018 (UTC)
    • Can you make a reliable statement? Categories tend to evolve and change subcategories. --- Jura 10:22, 27 September 2018 (UTC)
    wikt:Category:Proto-Samic lemmas currently contains 1303 entries. —Rua (mew) 10:25, 27 September 2018 (UTC)
  • I've made a post regarding the import and the conflict in Wiktionary vs Wikidata's policies: wikt:WT:Beer parlour/2018/September#What is Wiktionary's stance on reconstructions missing from sources?. —Rua (mew) 17:36, 27 September 2018 (UTC)
  • Is there any news on this? —Rua (mew) 10:08, 17 October 2018 (UTC)
    @Jura1:, are you fine now with the approval of this bot?--Ymblanter (talk) 13:01, 21 October 2018 (UTC)
    • I will try to write something tomorrow. --- Jura 18:21, 21 October 2018 (UTC)
    • First: sorry for the delay. The question what to do with lexemes reconstructed at Wiktionary remains open. In general, we would only import information from other WMF sites when we know or can assume that it can be referenced to other quality sources. This isn't the case here. One could argue that Wiktionary is an independent dictionary website and should be considered a reference on its own. Whether or not this is the case depends on how Wikidata and the various Wiktionaries will work going forward. The closer Wiktionary and Wikidata would work together going forward the less we can consider it as such. --- Jura 04:14, 25 October 2018 (UTC)
      • The majority of the Proto-Samic entries on Wiktionary does have an Álgu ID (P5903). Proto-Uralic entries mostly have Uralonet ID (P5902), but the lemma is not always identical to the form given on Uralonet, for which User:Tropylium is mostly responsible as the primary Uralic expert on Wiktionary. Would it be acceptable to import only those entries that have one of these IDs?
      • If so, that leaves the question of what to do with the remainder. It would be a shame if these can't be included in Wikidata, and would mean that Wiktionary is always more complete than Wikidata can be. Words that have an etymology on Wiktionary would have none on Wikidata, because of the Proto-Samic ancestral form being missing. —Rua (mew) 18:43, 30 October 2018 (UTC)
    @Rua: yes importing lexeme that have Álgu ID (P5903) or Uralonet ID (P5902) is fine with me. However, the lexeme for which the lemma is not identical to the form given on Uralonet do not have to be imported because they are not verifiable. They have to be similar to what the source says. Pamputt (talk) 21:58, 30 October 2018 (UTC)
  • Now pinging @Pamputt: as well.--Ymblanter (talk) 20:02, 21 October 2018 (UTC)
    I did not change my opinion because this bot wants to import reconstructed forms without any academic references. If the bot use academic work as source, it is fine with me, if not I oppose (and the discussion shows that we are in this case). Pamputt (talk) 20:08, 21 October 2018 (UTC)

GZWDer (flood) 2[edit]

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: GZWDer (talkcontribslogs)

Task/s: Create new items and improve existing items from cebwiki and srwiki

Code: Run via various Pywikibot scripts (probably together with other tools)

Function details: The work include several steps:

  1. Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
  2. Import GeoNames ID (P1566) for pages from w:ceb:Kategoriya:GeoNames ID not in Wikidata
  3. Import coordinate location (P625) for pages from w:ceb:Kategoriya:Coordinates not on Wikidata‎
  4. Add country (P17) for cebwiki items
  5. Add instance of (P31) for cebwiki items
  6. (probably) Add located in the administrative territorial entity (P131) for cebwiki items
  7. (probably) Add located in time zone (P421) for cebwiki items
  8. Add descriptions in Chinese and English for cebwiki items (only if step 4 and 5 is completed)

For srwiki, the actions are similar.

--GZWDer (talk) 13:56, 16 July 2018 (UTC)

Note: until phab:T198396 is fixed, this can only be done step-by-step and no mutliple task at a time.--GZWDer (talk) 14:02, 16 July 2018 (UTC)
Symbol support vote.svg Support Thank you for your elaboration! Keeping to my word now. Mahir256 (talk) 13:59, 16 July 2018 (UTC)
@Mahir256: Please unblock the bot account, I'm not goint to import more statements from cebwiki (and srwiki) until the discussion is closed and I have several other (low-speed) use of the bot account.--GZWDer (talk) 14:01, 16 July 2018 (UTC)
Yes, I did that, as I said I would do. Although @GZWDer: what will differ in your procedure with regard to the srwiki items? A lot of those places might have eswiki article equivalents (with the same INEGI code (Q5796667)); do you plan to link these if they exist? Mahir256 (talk) 14:02, 16 July 2018 (UTC)
The harvest_template script can not check duplicates and duplicates can only be found after data is imported (this may be a bug, though).--GZWDer (talk) 14:04, 16 July 2018 (UTC)
@Pasleim: Would this functionality be easy to add to the tool? It certainly seems desirable, especially with regard to GeoNames IDs. Mahir256 (talk) 14:06, 16 July 2018 (UTC)
See phab:T199698. I do not use Pasleim's harvest template tool because the tool stops automatically when meeting errors (it should retry the edit; if meeting rate limit retry after some time)--GZWDer (talk) 14:10, 16 July 2018 (UTC)
Symbol oppose vote.svg Oppose cebwiki is, as too many users concerned, the black hole of wikis. These so-called "datas" are having too many mistakes. --Liuxinyu970226 (talk) 14:15, 16 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose Needs to do far more checking as to whether related items already exist, to add the information and sitelink to existing items if possible; and to appropriately relate the new item to existing items if not. If other items already have any matching identifiers (but are eg linked to a different ceb-wiki item), or there is any other reason to think it may be a duplicate, then any new item should be marked instance of (P31) Wikimedia duplicated page (Q17362920) as its only P31, and be linked to the existing item by said to be the same as (P460). Jheald (talk) 14:19, 16 July 2018 (UTC)
    • Duplicates is easier to find after they are imported to Wikidata than on cebwiki.--GZWDer (talk) 14:24, 16 July 2018 (UTC)
@Jheald: It may be worth our time (or worth the time of those who already make corrections on cebwiki) to go to GeoNames and correct things our(them)selves so that in the event Lsjbot returns it doesn't recreate these duplicates. Mahir256 (talk) 14:34, 16 July 2018 (UTC)
@GZWDer: I try bloody hard to avoid creating new items that are duplicates, going to considerable lengths with off-line scripts and augmenting existing data to avoid doing so; and doing my level best to clear up any that have slipped online, as quickly as I can. I don't see why I should expect less from anybody else. Jheald (talk) 14:45, 16 July 2018 (UTC)
  • Pictogram voting comment.svg Comment given the capacity problems of Wikidata, the fact that cebwiki is practically dormant, I don't think this should be done. Somehow I doubt the operator will do any of the announcement maintenance as I think they announced that a couple of months back and then left it to other Wikidata users. So no, not another 600,000 items. For the general discussion, see Wikidata:Project_chat#Another_cebwiki_flood?.
    --- Jura 20:18, 16 July 2018 (UTC)
    • cebwiki is not dormant as the articles are still being maintained.--GZWDer (talk) 00:30, 17 July 2018 (UTC)
    • Is there a way to see this on ceb:? I take it that any user on ceb:Special:Recent changes without a local user page isn't really active there.
      --- Jura 04:41, 26 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose Per Jheald. Planning that it "is much easier to find such duplicates if the data is stored in a structured way", so deliberately importing duplicates (which won't be merged within a very short time) is an abuse of Wikidata and our resources. Resources spent on cleaning the mess of some origin are missing at other places to bring high quality data to other wikis and elsewhere. The duplicates are a big problem, they pop up on search and queries etc. Sitelinks might be added after data is cleaned off-Wikidata (if cleaning is feasible at all; no idea perhaps deletion of articles on cebwiki is a better solution than importing cebwiki sitelinks here). --Marsupium (talk) 23:26, 18 July 2018 (UTC)
    • Duplicates already exists everywhere in Wikidata so it should not be warrented that different items refer to different concepts (though it is usually the case), and nobody should use search and query result directly without care. Searchs are not intended to be directly used by 3rd party users. For queries, if data consumer really think duplicates in Wikidata query result is an issue they can choose to exclude cebwiki-only items in query result.--GZWDer (talk) 23:45, 18 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose Thanks a lot for your work on other wikis, it is immensely useful, but this workflow is really not appropriate for cebwiki. Creating new cebwiki items without being certain that they do not duplicate existing items creates a significant strain on the community. It is not okay to expect people to find ways to exclude cebwiki-only items in query results as a result: these items should not be created in the first place. − Pintoch (talk) 09:55, 19 July 2018 (UTC)
    • probably 90% of entries are unique to cebwiki. It may be wise to import these unique entries first.--GZWDer (talk) 16:38, 20 July 2018 (UTC)
      • Well, whatever the actual percentage is, many of us have painfully experienced that it is way too low for our standards. It may be wise to be more considerate to your fellow contributors, and stop hammering the server too. A lot of people have complained about cebwiki item creations, and it is really a shame that a block was necessary to actually get you to stop. So I really stand by my oppose. − Pintoch (talk) 07:34, 21 July 2018 (UTC)
    • The approach outlined above doesn't really address any of the problems with the data.
      --- Jura 04:41, 26 July 2018 (UTC)

Plan 2[edit]

The plan only does:

  1. Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
  2. Import GeoNames ID (P1566) for pages

Therefore:

  1. It is easier to find articles exist in other Wikipedias by search and projectmerge (and possible mix'n'match and other tools)
  2. Also possible to find entries from GeoNames ID, and vice versa
  3. As no other data will be imported in plan 2, it will not pollute query results and OpenRefine (unless specifically query GeoNames ID)
  4. Others may still import other data to these items, but only if they're confident to do so; they had better import coordinates etc. from a more reliable database (e.g. GEOnet Names Server)

--GZWDer (talk) 06:09, 26 July 2018 (UTC)

Symbol oppose vote.svg Oppose I just oppose your *cebwiki* importing, you are feel free to import Special:unconnectedpages links other than this wiki. --Liuxinyu970226 (talk) 04:45, 31 July 2018 (UTC)
  • @Pasleim: seems to have done quite a lot of maintenance on cebwiki sitelinks. I'm curious what his view is on this.
    --- Jura 06:39, 31 July 2018 (UTC)
Symbol oppose vote.svg Oppose, this still pollutes OpenRefine results - especially when reconciling via GeoNames ID, which should be the preferred way when this id is available in the table. I don't see how voluntarily keeping the items basically blank would be a solution at all, it makes it harder to find duplicates. − Pintoch (talk) 11:54, 5 August 2018 (UTC)
Do you have experience with matching based on existing GeoNames IDs then? I still see items on a regular basis which have the wrong ID thanks to bots which imported lots of bad matches years ago (e.g. Weschnitz (Q525148) and River Coquet (Q7337301)), so it would be great if you could explain what you did to avoid mismatches so that bots can do the same. If bots assume that our GeoNames IDs are correct, they'll add sitelinks/statements/descriptions/etc to the wrong items and make a mess that's much harder to clean up than duplicates are. - Nikki (talk) 20:09, 5 August 2018 (UTC)
@Pintoch: Wikidata Qids are designated as persistant identifiers; they are still valid when the items are merged, but no guarantees should be assumed that any items (whether bot created or not) is never merged or redirected. They are plenty of mismatches in cebwiki and Wikidata (which should be solved) but creating new items will not bring any new mismatches. Also, why do you think that leaving cebwiki pages unconnected is easier to find duplicates?--GZWDer (talk) 09:28, 6 August 2018 (UTC)
@Nikki: Yes I have experience with matching based on GeoNames IDs, and it generally gives very bad results because many items get matched to cebwiki items instead of the canonical item. I don't have any good strategy to avoid mismatches and that is the reason why I regret that these cebwiki items have been created without the appropriate checks for existing duplicates. I understand that cebwiki imports are not the only imports responsible for the unreliability of GeoNames ids in Wikidata, but in my experience the majority of errors came from cebwiki. I am not sure I fully get your point: are you arguing that it is fine to create duplicate cebwiki items because GeoNames IDs in Wikidata are already unreliable? I don't see how existing errors are an excuse for creating more of them. − Pintoch (talk) 09:02, 12 August 2018 (UTC)
@Pintoch: I am arguing that we need to avoid linking the cebwiki pages to the wrong items because merges are vastly better than splits, and that will involve some duplicates. Duplicate IDs continue being valid and will point to the right item even after a merge. The same is not true of splitting and you never know who is already using the ID. I agree that it would be nice to reduce the number of duplicates it creates, but nobody seems to have any idea how it should do that without creating even more bad matches, which is why I was hoping you might have some tips. - Nikki (talk) 13:12, 12 August 2018 (UTC)
@Nikki: okay, I get your point, thanks. So, no I haven't looked into the problem myself. If I had time I would first try to clean up the current items rather than creating new ones (and you have worked on this: thanks again!). I don't think there is any rush to empty w:ceb:Kategoriya:Articles without Wikidata item, so that's why I oppose this bot request. − Pintoch (talk) 18:24, 12 August 2018 (UTC)
@GZWDer: creating new items will not bring any new mismatches: creating new items will create new duplicates, and that is what disrupts our workflows. I personally don't care about the Wikidata <-> cebwiki mapping. If you care about this mapping, then please improve it without creating duplicates (that is, with reliable heuristics to match the cebwiki articles to existing items). If you do not have the tools to do this import without being disruptive to other Wikidata users, then don't do it. If someone else files a bot request to do this task, with convincing evidence that their import process is more reliable than yours, I will happily support it. − Pintoch (talk) 09:02, 12 August 2018 (UTC)
@Pintoch: Your argument is basically "create new duplicates in any case is harmful" - but duplicates already exists everywhere, created by different users. They may be eventually merged, and their IDs are still valid. There're much more cases for no matchs found and no items will be created for them in the foreseeable future (as it is not possible to handle all 500,000 pages manually).
@GZWDer: there are three differences between other users' duplicates and your duplicates: the first is the scale (500,000 items for this proposal), the second is the absence of any satisfactory checks for existing duplicates (which is unacceptable), the third is the domain (geographical locations are pivotal items that many other domains rely on - creating a mess there is more disruptive than in other areas). This is about creating 500,000 new geographical items with no reconciliation heuristics to check for existing duplicates. This is really detrimental to the project, and I am not the only one complaining about it. − Pintoch (talk) 10:31, 19 August 2018 (UTC)
Also, what about first creating items for pages without extent items with same labels (this is the default setting of PetScan)?--GZWDer (talk) 20:12, 13 August 2018 (UTC)
I think checks need to be more thorough than that, for instance because cebwiki article titles often include disambiguation information in brackets. For instance, these heuristics would fail to identify https://ceb.wikipedia.org/wiki/Amsterdam_(lungsod_sa_Estados_Unidos,_Montana) and Amsterdam-Churchill (Q614935). − Pintoch (talk) 10:31, 19 August 2018 (UTC)
  • Oppose. Although I'm not aware of this being a policy so far, I believe new items should be created from the encyclopedia that is likely to have the best information on them. A bot shouldn't create new items from a Russian Wikipedia item about a US state or a US politician, and a bot shouldn't create new items about Russian city or politician from an English Wikipedia article. This restriction wouldn't necessarily apply to items that are not firmly connected to any particular, country, such as algebra for example. Jc3s5h (talk) 16:18, 30 August 2018 (UTC)
    • No, this isn't a policy and it never could be. One of Wikidata's main functions is to support other Wikimedia projects by providing interwiki links and structured data. Requiring links to a particular Wikipedia before an item is considered notable would cripple Wikidata. We also can't control which Wikipedias people copy data from. We can refuse to allow a bot to run but that doesn't stop people from doing it manually or with tools like Petscan and Harvest Templates. - Nikki (talk) 12:08, 31 August 2018 (UTC)
  • @Ivan_A._Krestinin: In the meantime, KrBot seems to be doing this. --- Jura 10:28, 11 September 2018 (UTC)
  • Have no time to read the discussion. My bot is importing country (P17), coordinate location (P625), GeoNames ID (P1566) from cebwiki now. — Ivan A. Krestinin (talk) 21:24, 11 September 2018 (UTC)
    • @Ivan_A._Krestinin: There is a lot of opposition to mass-creating new items for cebwiki items (see above), so you should create a new request for permissions before continuing. - Nikki (talk) 12:05, 12 September 2018 (UTC)
      • Ok, I disabled new item creation. I have code for connecting pages from different wikies. But it does not work without item creation because it is based on scheme: import data, find duplicate items, analyze data conflicts, labels and etc., merge items. — Ivan A. Krestinin (talk) 20:07, 12 September 2018 (UTC)
        • Thanks. The main issue is that people don't want duplicates. If you can explain what your bot does to avoid duplicates when you create a new request for permissions, it will hopefully be enough to change people's minds. :) - Nikki (talk) 09:00, 13 September 2018 (UTC)

If someone is creating items for all cebwiki articles, I'm still plan to add statements and descriptions to them. However for real life issues I'd like to place the request Time2wait.svg On hold until January-February 2019 and see what happens. Comments and questions are still welcome, but I am probably not able to answer it anytime soon.--GZWDer (talk) 06:10, 12 September 2018 (UTC)

@GZWDer: Since there are too many oppose comments, and already bumped privacy concerns at WMF Trust & Safety, it's unlikely that your work can be approved, so why not withdrawn it? --Liuxinyu970226 (talk) 22:41, 15 September 2018 (UTC)

crossref bot[edit]

crossref bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Mahdimoqri (talkcontribslogs)

Task/s: to add missing journals from crossref api

Code: https://github.com/moqri/wikidata_scientific_citations/tree/master/add_journal/crossref

Function details: add missing journals from crossref --Mahdimoqri (talk) 21:12, 19 April 2018 (UTC)

See the discussion here and the data import request and workflow here

@DarTar, Daniel_Mietchen, Fnielsen, John_Cummings, Mahir256: any thoughts or feedback?

Wolfgang8741 bot[edit]

Wolfgang8741 bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Wolfgang8741 (talkcontribslogs)

Task/s: Openrefine imports to Wikidata.

Code: N/A

Function details: Data imports from Openrefine datasets --Wolfgang8741 (talk) 02:16, 18 June 2018 (UTC)

Pictogram voting comment.svg Comment What kind of data from what source?  – The preceding unsigned comment was added by Matěj Suchánek (talk • contribs).
@Matěj Suchánek: Sorry I missed this comment. This is not a fully automated bot, but human assisted tool OpenRefine for larger imports, starting with small scale tests before larger application. The current focus is on the GNIS import at this time, yes the import description and process needs to be built out a bit more, I'm not using this until I refine the process and get community approval the to import. Initial learning curve and orientation to the WikiData processes in progress. Wolfgang8741 (talk) 15:49, 5 September 2018 (UTC)
@Wolfgang8741: sorry for the delayed response. Are you still interested in pursuing this, and do you still want a bot flag? --DannyS712 (talk) 18:28, 10 May 2020 (UTC)

CanaryBot 2[edit]

CanaryBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Ivanhercaz (talkcontribslogs)

Task/s: set labels, descriptions and aliases in Spanish to properties without them in Spanish.

Code: the code is available in PAWS, it is a Jupyter IPython notebook. But when I have time I will upload the ipynb and py file to CanaryBot repo in GitHub.

Function details: Well, this task that I am requesting is very similar to the first task that I requested, but I asked to Ymblanter for an opinion about it, and after that he/she recommend me to request a new task because, as Ymblanter said, I am going to use a new code because in this case I am going to set labels, descriptions and aliases in properties, nor in items as I did in my last task.

In addition, this scripts works differently: I extracted all the properties without label and description in Spanish, or both, and then I mixed both in one CSV in which I am filling the cells with their respective translations. When I will have all the cells I will run the script, which will read each row, check if the property has labels, descriptions and aliases in Spanish, if not, the script will add the content of their respective cells.

I have to test and improve some things of the script. It is very basic, but it works for what I want to do. I make a log of everything to knows how to solve an error if it happens.

Well, I await your answers and opinions. Thanks in advance!

Regards, Ivanhercaz Plume pen w.png (Talk) 23:45, 10 May 2018 (UTC)

I improved the code: some stats, log report fixed... I think it is ready to run without problems. What I need now is to finish the translations of the properties. I await your opinions about this task. Regards, Ivanhercaz Plume pen w.png (Talk) 15:54, 12 May 2018 (UTC)
I am ready to approve the bot task in a couple of days, provided that no objections will be raised. Lymantria (talk) 06:54, 13 May 2018 (UTC)
  • Could you link the test edits? I only find Portuguese.
    --- Jura 16:24, 13 May 2018 (UTC)
    Of course Jura, I think I shared the contributions in test.wikidata, but not, excuse me. I think that you refered to the edits in Portuguese that my bot made with its first task. In this case I only work with Spanish labels, descriptions and aliases. You can check my last contributions in test.wikidata. I checked the edit summary was wrong because it was "setting es-label" for all, I mean when the bot change a description, an alias or a label; I just fixed it and now it show the correct summary, as you can see in the last three editions. But I have find a bug that I have to fix: if you check this diff, you can see how the bot replaced the existing alias for the new, and what I want is to append the new aliases and keep the old aliases, so I have to fix it.
    I am not worry about the time or if the task is accepted now or in the future, I just wanted to propose it and talk about how it would work. But, being sincere, I have to fill the CSV file yet, so I have many time to fix this type of errors and improve it. For that reason I requested another task.
    Regards, Ivanhercaz Plume pen w.png (Talk) 17:19, 13 May 2018 (UTC)
    For bot approvals, operators generally do about 50 or more edits here at Wikidata. These "test edits" are expected to be of standard quality.
    --- Jura 17:23, 13 May 2018 (UTC)
    I know Jura, but I can make the test edits in Wikidata without authorization or the request of someone because this task is not approved. Well, as you are requesting me these test edits, when the aliases bug has been solved I will run the bot in Wikidata to report here if it works fine or not. Regards, Ivanhercaz Plume pen w.png (Talk) 17:29, 13 May 2018 (UTC)
    I fixed the bug of the aliases, as you can check here. I will notify you, Jura, when I have done the test edits in Wikidata and not in test.wikidata. Regards, Ivanhercaz Plume pen w.png (Talk) 18:26, 13 May 2018 (UTC)
  • @Jura1, Ymblanter: Today I could only make a few test edits. I will make more in the next days to check it better. Regards, Ivanhercaz Plume pen w.png (Talk) 18:15, 14 May 2018 (UTC)
    I forgot to share with you the log and if you check the notebook you can see the generated graph. Regards, Ivanhercaz Plume pen w.png (Talk) 18:26, 14 May 2018 (UTC)

maria research bot[edit]

maria research bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Mahdimoqri (talkcontribslogs)

Task/s: add missing articles and citations information for articles listed on PubMed Central

Code: https://github.com/moqri/wikidata_scientific_citations

Function details: --Mahdimoqri (talk) 06:15, 13 March 2018 (UTC)

Symbol support vote.svg Support Mahir256 (talk) 22:37, 13 March 2018 (UTC)
Pictogram voting comment.svg Comment This Fatameh-based script is useful for most of phase 1 and works fine for PubMed IDs and for some Crossref IDs as well but it does not address the citation part from phase 2 onwards. --Daniel Mietchen (talk) 13:27, 14 March 2018 (UTC)
Thanks Daniel Mietchen, I modified the description of the task here to confirm what the bot does at the moment. Mahdimoqri (talk) 15:52, 14 March 2018 (UTC)
Symbol support vote.svg Support That looks good to me. --Daniel Mietchen (talk) 19:54, 14 March 2018 (UTC)
Symbol support vote.svg Support The Fatameh edits from this bot seems fine so far. It is a nice simple script. I note some Fatameh artifacts for the titles, e.g., "*." in BOOKWORMS AND BOOK COLLECTING (Q50454030). But I suppose we have to live with that... — Finn Årup Nielsen (fnielsen) (talk) 18:44, 14 March 2018 (UTC)
I was going to write the same thing. Can we remove the trailing full stop (".") ? I'm sure some bot could clean up the existing ones as well.
--- Jura 20:37, 14 March 2018 (UTC)
Thanks Finn Årup Nielsen (fnielsen) and Jura, I would be happy to add another script to remove asterisks or to fix any other issues you find, after the PMC items added.Mahdimoqri (talk) 23:10, 14 March 2018 (UTC)
For the final dot, can you remove this before adding it to label/title statement?
--- Jura 23:17, 14 March 2018 (UTC)
Thanks Jura! Unfortunately, as much as I know, Fatameh does not have any out of the box option for such changes. I'd recommend a separate script to be written just for this purpose since there are currently 14 Million other articles have such a problem (https://www.wikidata.org/w/index.php?tagfilter=OAuth+CID%3A+843&limit=50&days=7&title=Special:RecentChanges&urlversion=2). Daniel Mietchen might be interested in such a script too. Mahdimoqri (talk) 02:51, 15 March 2018 (UTC)
@T Arrow, Tobias1984: could you fix Fatameh?
--- Jura 07:21, 15 March 2018 (UTC)
There is a task for it here: https://phabricator.wikimedia.org/T172383 Mahdimoqri (talk) 15:08, 15 March 2018 (UTC)
Do any of the people who wrote the code actually follow phabricator? I tried to find the part of the code where the dot gets added/should be removed, but I was probably in the wrong module. Any ideas?
--- Jura 05:16, 16 March 2018 (UTC)
I'm just not checking it all that regularly. I've replied to the ticket. Fatameh relies on wikidatintegrator to do most of the heavy lifting. This uses PubMed as the data source and (unfortunately?) they actually report all the titles as ending in a period (or other punctuation). I think we need to find a reference for the titles without the period rather than just changing all the existing statements. There was a short discussion on the WikiCite Mailing List as well. I'm happy to work on a solution but I'm not really sure what is the best way forward. T Arrow (talk) 09:26, 16 March 2018 (UTC)
Jura, I added the fix for the trailing dots and asterisks in a separate script (fatameh_sister_bot). Any other issues that I can address to have your support?Mahdimoqri (talk) 06:22, 17 March 2018 (UTC)

Thanks all for providing feedback and offering solutions/help to address the issue with Fatameh. It seems it will be a fix eaither for Fatameh or a separate script. In eaither case, it is to be applied to all article items which I beleive could be done independently of this bot. Meanwhile, could you support and accept this bot so I can get it started and maybe set up a new bot for fixing other issues? Mahdimoqri (talk) 21:12, 16 March 2018 (UTC)

Symbol oppose vote.svg Oppose I don't think we should approve another Fatameh based bot until major concerns are fixed. --Succu (talk) 21:24, 16 March 2018 (UTC)
Thanks for your feedback Succu. I just created a bot (Fatameh_sister_bot) that fixes the issue with the label for the items created using Fatameh. I'll make sure I run it on everything maria research bot creates to address the concern with the titles. Are there any other issues that I can address? Mahdimoqri (talk) 06:04, 17 March 2018 (UTC)
@Succu: I also fixed this issue from the root in Fatameh source code here so new items are created without the trailing dot.
Title statements would need the same fix and some labels have already been duplicated into other languages (maybe this is taken care of, but I haven't seen any in the samples).
--- Jura 09:35, 18 March 2018 (UTC)
Thanks for the feedback Jura. The translated labels (if any) are added to labels. I will take care of the title statement now.
@Jura1: the titles are also fixed and the code has been updated (https://github.com/moqri/wikidata_scientific_citations/blob/master/fatameh_sister_bot/fix_labels_and_titles.py). Any other issues that I can address to have your support for the bot?
I think the cleanup bot/task can be authorized.
--- Jura 12:30, 21 March 2018 (UTC)
@Jura1: wonderful! this is the request for the cleanup bot: fatameh_sister_bot. Could you please state your support there, for a bot flag?
I don't think edits like this one are OK, Mahdimoqri, because you are ignoring the reference given. And please wait with this kind of corrections until you got the flag. --Succu (talk) 22:36, 22 March 2018 (UTC)
@Succu: the title in the reference is not exactly correct. Please refer to this reference or this reference for the correct title. Would you like the bot to change the reference as well?  – The preceding unsigned comment was added by [[User:{{{1}}}|{{{1}}}]] ([[User talk:{{{1}}}|talk]] • [[Special:Contributions/{{{1}}}|contribs]]).
The cleanup should be fine. It just strips an artifact PMD adds.
--- Jura 09:24, 23 March 2018 (UTC)
Translated titles are enclosed within brackets. This should be changed. The current version overwrites existing page(s) (P304) with incomplete values. --Succu (talk) 10:08, 18 March 2018 (UTC)
@Succu: thanks for the feedback! I could not find any instance of either of the issues! Could you please reply with one instance of each of these two issues that is created by my bot so that I can address them? Mahdimoqri (talk) 03:17, 19 March 2018 (UTC)
[Sexually-transmitted infection in a high-risk group from Montería, Colombia]. (Q50804547) is an example for the first issue. Removing the brackets only is not the solution. --Succu (talk) 22:30, 22 March 2018 (UTC)
I will not import any items with translated titles (until there is a consensus on what is the solution on this). Mahdimoqri (talk) 14:06, 30 March 2018 (UTC)
We should try figure out how to handle them (e.g. import the original language and delete "title"-statement, possibly find the original title and add that as title and label in that language). For new imports, it would just need to skip adding the title statement and add a language of work or name (P407).
--- Jura 09:24, 23 March 2018 (UTC)
Or use original language of film or TV show (P364). Anyway, it should be made clear that the original title is not English. -- JakobVoss (talk) 14:45, 24 March 2018 (UTC)
Should attempt to add a statement that identifies them as not being in English before we actually manage to determine the original language?
--- Jura 21:12, 24 March 2018 (UTC)

AmpersandBot[edit]

AmpersandBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: PinkAmpersand (talkcontribslogs)

Task/s: Generate descriptions for village items in the format of "village in <place>, <place>, <country>"

Code: https://github.com/PinkAmpersand/AmpersandBot/blob/master/village.py

Function details: With my first approved task (approved in July 2016, but not completed until recently), I set descriptions for about 20,000 Ukrainian villages based on their country (P17), instance of (P31), and located in the administrative territorial entity (P131) values. Now, I would like to use the latter two values to generalize this script to—ominous music—every village in the world!

The script works as follows:

  1. It pulls up 5,000 items backlinking to village (Q532)
  2. It checks whether an item is instance of (P31) village (Q532)
  3. It then labels items as follows:
    1. It removes disambiguation from labels in any language:
      1. It runs a RegEx search for ,| \(
      2. It removes those characters and any following ones
      3. It sets the old label as an alias for the given language
      4. If the alias is in Unicode, it creates an ASCII version and sets that as an alias as well
      5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
    2. It sets labels in all Latin-script languages:
      1. It checks if the current Latin-script languages all use the same label.
      2. If they don't, it does nothing except log the item for further review.
      3. If they do, it sets that label as the label for all other Latin-script languages, using a list of 196 (viewable in the source code)
      4. If the label is in Unicode, it also sets an ASCII version of the label as an alias
      5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
  4. And describes items as follows:
    1. It checks whether the item either a) lacks an English description or b) has an English description that merely says "village in <country>" or "village in <region>". (I've manually coded into the RegEx the names of every multi-word country. This still leaves a blind spot for multi-word entities other than countries. I welcome advice on how to fix this.)
    2. If so, it gets the item's parent entity. If that entity is a country, it describes the item as "village in <parent>"
    3. If the parent entity is not a country, it checks the grandparent entity. If that is a country, it describes the item as "village in <parent>, <grandparent>"
    4. Next onto the great-grandparent entity. "village in <parent>, <grandparent>, <great-grandparent>"
    5. For the great-great-grandparent entity, only the top three levels are used: "village in <grandparent>, <great-grandparent>, <great-great-grandparent>". This is slightly more likely to result in dupe errors, but the code handles those.
    6. Ditto the thrice-great-grandparent entity.
    7. If even the thrice-great-grandparent is not a country, the item is logged for further review. If people think I should go deeper, I am willing to; I may do so of my own initiative if the test run turns up too many of these errors.
  5. After 5,000 items have been processed, another 5,000 are pulled. The script continues until there are no backlinks left to describe.

Does this sound good? — PinkAmpers&(Je vous invite à me parler) 01:43, 22 February 2018 (UTC) Updated 22:17, 3 March 2018 (UTC)

Test run here. The only issue that arose was some items, like Koro-ni-O (Q25694), being listed in my command line as updated, but not actually updating. It's a bug, and I'll look into it, but its only effect is to limit the bot's potential, not to introduce any unwanted behavior. — PinkAmpers&(Je vous invite à me parler) 02:16, 22 February 2018 (UTC)
I will approve the bot in a couple of days provided no objections have been raised.--Ymblanter (talk) 08:39, 25 February 2018 (UTC)
Cool, thanks! But actually, I'm working on a few more things for the bot to do to these village items while it's "in the neighborhood", so would you mind holding off until I can post a second test run? — PinkAmpers&(Je vous invite à me parler) 00:23, 26 February 2018 (UTC)
This is fine, no problem.--Ymblanter (talk) 10:42, 26 February 2018 (UTC)
@Ymblanter:. Okay. I'm all done. I've updated the bot's description above. Diff of changes here. New test run here. There was one glitch in this test run, namely that the bot failed to add ASCII aliases for Unicode labels while performing the Latin-script label unanimity function. This was due to a stray space before the word aliases in line 247. I fixed that here, and ran a test edit here to check that that worked. But I'm happy to run a few dozen more test edits if you want to see that fix working in action. — PinkAmpers&(Je vous invite à me parler) 22:17, 3 March 2018 (UTC)
Concerning the Latin script languages, not all of them use the same spelling. For example, here I am sure that in lv it is not Utvin (most likely Utvins), in lt it is not Utvin, and possibly in some other languages it is not Utvin (for example, crh uses fonetic spelling, Utvin may be fine, but other names will not be fine). I would suggest to restrict this part of the task to major languages (say German, French, Spanish, Portuguese, Italian, Danish, Swedish, may be a couple of more) and for others make some research - I have no ideas for example what Navajo uses). The rest seems to be fine.--Ymblanter (talk) 07:48, 4 March 2018 (UTC)
I'm concerned about exonyms too. Even if a language uses the same name variant as other Latin-script languages for most settlements, then there are particular settlements for which it may not do so. 90.191.81.65 14:30, 4 March 2018 (UTC)
I considered that, 90.191.81.65, but IMHO it's not a problem. The script will never overwrite an earlier label, and indeed won't change the labels unless all existing Latin-script labels are in agreement. So the worst-case scenario here is that an item would go from having no label in one language to having one that is imperfect but not incorrect. An endonym will always be a valid alias, after all. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
I'm not sure that all languages consider an endonym as a valid alias if there's an exonym too. And if it is considered technically not incorrect then for some cases an endonym would still be rather odd. My concern on this is similar to one currently brought up in project chat. 90.191.81.65 07:58, 5 March 2018 (UTC)
I would think that an endonym is by definition a valid alias. The bar for "valid alias" is pretty low, after all. So if there isn't consensus to use endonyms as labels, I can set them as aliases instead. — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
Also, all romanized names are probably problematic. Many languages may use the same romanization system (the same as in English or the one recommended by the UN) for particular foreign language, but there are also languages which have their own romanization system. So a couple of the current Latin-script languages using the same romanization would be merely a coincidence. 90.191.81.65 14:49, 4 March 2018 (UTC)
I'm confused about your concern here. The only romanization that the script does is in setting aliases, not labels. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
All Ukrainian, Georgian, Arab etc. place names apart from exonyms are romanized in Latin-script languages. And there are different romanization systems, some are specific to particular language, e.g. Ukrainian-Estonian transcription. For instance, currently all four Latin labes for Burhunka (Q4099444) happen to be "Burhunka", but that wouldn't be correct in Estonian. 90.191.81.65 07:58, 5 March 2018 (UTC)
Well that's part of why I'm using a smaller set of languages now. Can you give me examples of languages within the set that have this same problem? — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
Thanks for the feedback, Ymblanter. I've pared back the list, and posted at project chat asking for help with re-expanding it. See Wikidata:Project chat § Help needed with l10n for bot. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)

I note that here bot picks up name of a former territorial entity, though preferred rank is set for current parish. Also, is the whole territorial hierarchy really necessary in description if there's no need to disambiguate from other villages with the same name in the same country? For a small country like Estonia I'd prefer simpler descriptions. 90.191.81.65 14:30, 4 March 2018 (UTC)

The format I'm using is standard for English-language labels. See Help:Description § Go from more specific to less specific. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
The section you refer to concerns with in what order you go more specific in a description. As for how specific you should go it leaves it open. Apart from saying in above section that adding one subregion of a country is common and bringing two examples where whole administrative hierarchy is not shown. 90.191.81.65 07:58, 5 March 2018 (UTC)
To me, the takeaway from Help:Description is that using a second-level subregion is not required, but also not discouraged. It comes down to an individual editor's choice. — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
  • Pictogram voting comment.svg Comment I'm somewhat concerned about the absence of a plan to maintain this going forward. If descriptions in 200 languages for 100,000s items are being added, this becomes virtually impossible to correct manually. Descriptions can need to be maintained if the names changes, if the P131 is found to be incorrect or irrelevant. Already now default labels for items that may seem static (e.g. categories/lists) aren't maintained once the are added, this would just add another chunk of redundant data that isn't maintained. The field already suffers from absence of the maintenance of cebwiki imports, so please don't add more to it. Maybe one would want to focus on English descriptions and native label statements instead.
    --- Jura 10:16, 12 March 2018 (UTC)

taiwan democracy common bot[edit]

taiwan democracy common bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools) Operator: Twly.tw (talkcontribslogs)

Task/s: Input Taiwan politician data, it's a project from mySociety

Code:

Function details: follow this step to input politician data, mainly in P39 statement and related terms, constituency and political party. --~~~~

The operator can't be the bot itself. So who's going to operate the bot? Mbch331 (talk) 14:46, 18 February 2018 (UTC)
Operator will be Twly.tw (talkcontribslogs), bot: taiwan democracy common bot.
This is Twly.tw (talkcontribslogs), based on Wikidata:Requests for permissions/Bot/taiwan democracy common--Ymblanter (talk) 09:42, 25 February 2018 (UTC)
I would like to get some input from uninvolved users here before we can proceed.--Ymblanter (talk) 18:56, 1 March 2018 (UTC)
  • The bot might need a fix for date precision (9→7). It seems that everybody is born on January 1: Q19825688, Q8274933, Q8274088, Q8350110. As these items already had more precise dates, it might want to skip them.
    --- Jura 11:00, 12 March 2018 (UTC) Fixed, thanks.
@Jura1:, can we proceed here?--Ymblanter (talk) 21:06, 21 March 2018 (UTC)
I have a hard time trying to figure out what it's trying to do. Maybe some new test edits could help. Is the date precision for the start date in the qualifier of Q19825688 correct.
--- Jura 21:37, 21 March 2018 (UTC)

Newswirebot[edit]

Newswirebot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Dhx1 (talkcontribslogs)

Task/s:

  1. Create items for news articles that are published by a collection of popular/widespread newspapers around the world.

Code:

  • To be developed.

Function details:

Purpose:

  • New items created by this bot can be used in described by source (P1343) and other references within Wikidata.
  • New items created by this bot can be referred to in Wikinews articles.

Process:

  1. For each candidate news article, check whether a Wikidata item of the same title exists with a publication date (P577) +/- 1 day.
    1. If an existing Wikidata item is found, check whether publisher (P123) is a match as well.
    2. If publisher (P123) matches, ignore the candidate news article.
  2. For each candidate news article, check whether an existing Wikidata item has the same official website (P856) (full URL to the published news article).
    1. If official website (P856) matches, ignore the candidate news article.
  3. If no existing Wikidata item is found, create a new item.
  4. Add a label in English which is the article title.
  5. Add descriptions in multiple languages the format of "news article published by PUBLISHER on DATE".
  6. Add statement instance of (P31) news article (Q5707594).
  7. Add statement language of work or name (P407) English (Q1860).
  8. Add statement publisher (P123).
  9. Add statement publication date (P577).
  10. Add statement official website (P856).
  11. Add statement author name string (P2093) which represents the byline (Q1425760). Note that this could the name of a news agency or combination of news agency and publisher if the writer is not identified.
  12. Add statement title (P1476) which represents the headline (Q1313396).

Example sources and copyright discussions:

--Dhx1 (talk) 13:00, 8 February 2018 (UTC)

Interesting initiative. How many articles do you plan to create per day? --Pasleim (talk) 08:44, 9 February 2018 (UTC)
I was thinking of programming the bot to regularly check Grafana and/or Special:DispatchStats or similar statistics endpoint, raising or lowering the rate of edits to a predefined limit. It appears that larger publishers may publish around 300 articles per day, so if bot was developed to work with 10 sources, that is around 3000 new articles per day, or one new article every 30 seconds. For the initial import, an edit rate of 1 article creation per second (what User:Research_Bot seems to use at the moment) would allow 86,400 articles to be processed per day, or approximately 30 days worth of archives processed per day. At that rate, it might take 4-5 months to complete the initial import. Dhx1 (talk) 10:12, 9 February 2018 (UTC)
We probably need the code and test edits to continue this discussion.--Ymblanter (talk) 08:31, 25 February 2018 (UTC)
@Dhx1: What do you think about Zotero translators? Could they be somehow used in order to speed up the process?--Malore (talk) 16:09, 20 September 2018 (UTC)
@Malore: I have been using scrapy which is trivial to use for crawling and extracting information. The trickier part at the moment is finding matching Wikidata articles that already exist, and writing to Wikidata. Pywikibot doesn't seem to allow writing a large Wikidata item at once with many claims, qualifiers and references. The API allows it however, and the WikidataIntegrator bot also allows it, albeit with little documentation to make it clear how it works. Zotero could be helpful if a large community forms around it with news website metadata scraping (for bibliographies).Dhx1 (talk) 11:53, 23 September 2018 (UTC)

@Dhx1:, how is the work getting the bot started going? --Trade (talk) 18:42, 15 March 2020 (UTC)

  • Symbol oppose vote.svg Oppose having a push system like this will mean we end up with thousands, maybe millions of unused items. Not the right approach. I suggest you implement a pull based system where a user can easily import it when needed as a reference. Multichill (talk) 20:22, 15 March 2020 (UTC)

neonionbot[edit]

neonionbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Jkatzwinkel (talkcontribslogs)

Task/s: Map semantic annotations made with annotation software neonion to wikidata statements in order to submit either bibliographical evidence, additional predicates or new entities to wikidata. Annotation software neonion is used for collaborative semantic annotating of academic publications. If a text resource being annotated is an open access publication and linked to a wikidata item page holding bibliographical metadata about the corresponding open access publication, verifiable contributions can be made to wikidata by one of the following:

  1. For a semantic annotation, identify an equivalent wikidata statement and provide bibliographical reference for that statement, linking to the item page representing the publication in which the semantic annotation has been created.
  2. If a semantic annotation provides new information about an entity represented by an existing wikidata item page, create a new statement for that item page containing the predicate introduced by the semantic annotation. Attach bibliographic evidence to the new statement analogously to scenario #1.
  3. If a semantic annotation represents a fact about an entity not yet represented by a wikidata item page, create an item page and populate it with at least a label and a P31 statement in order to meet the requirements for scenario #2. Provide bibliographical evidence as in scenario #1.


Code: Implementation of this feature will be published on my neonion fork on github.

Function details: Prerequisite: Map model of neonion's controlled vocabulary to terminological knowledge extracted from wikidata. Analysis of wikidata instance/class relationships ensures that concepts of controlled vocabulary can be mapped to item pages representing wikidata classes.

Task 1: Identify item pages and possibly statements on wikidata that are equivalent to the information contained in semantic annotations made in neonion.

Task 2: Based on the results of task 1, determine if it is appropriate to create additional content on wikidata in form of new statements or new item pages. For the statements at hand, provide an additional reference representing bibliographical evidence referring to the wikidata item page representing the open access publication in which neonion created the semantic annotation.

What data will be added? Proposed scenario is meant to be tried first on articles published in scientific open-access journal Apparatus. --Jkatzwinkel (talk) 06:15, 19 October 2017 (UTC)

I find this proposal very hard to understand without seeing an example - can you run one or mock one (or several) up using the neonionbot account so we can see what it would likely do? ArthurPSmith (talk) 13:12, 19 October 2017 (UTC)

Handelsregister[edit]

Handelsregister (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: SebastianHellmann (talkcontribslogs)

Task/s: Crawl https://www.handelsregister.de/rp_web/mask.do and then go to UT (Unternehmenstraeger) and add an entry for each German organisation with the basic info, especially registering court and assigned id by court into Wikidata.

Code: The code is a fork of https://github.com/pudo-attic/handelsregister (small changes only)

Function details:

Task 1, prerequisite for Task 2 Find all current organisations in Wikidata that are registered in Germany and find the correlating Handelsregister entry. Then add the data for the respective Wikidata items.

What data will be added? The Handelsregister collects information from all German courts, where all organisations in Germany are obliged to register. The data is given from the courts to a private company running the handelsregister, who makes part of the information public (i.e. UT - Unternehmenstraegerdaten, core data) and sells the other part. Each organisation can be uniquely identified by the registering court and the number assigned by this court (only the number is not enough, as two courts might assign the same number). Here is an example of the data:

  • Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH
  • Legal status: Gesellschaft mit beschränkter Haftung
  • Capital: 25.000,00 EUR
  • Date of entry: 29/08/2016
  • (When entering date of entry, wrong data input can occur due to system failures!)
  • Date of removal: -
  • Balance sheet available: -
  • Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
  • Prager Straße 38-40
  • 04317 Leipzig

Most items are stable, i.e. each org is registered, when it is founded and assigned a number by the court: Saxony District court Leipzig HRB 32853 . Then only the address and the status can change. For Wikidata, it is no problem keeping companies that are not existing any more as they should be conserved for historical purposes.

Maintenance should be simple: Once a Wikidata item contains the correct court and the number, the entry can be matched 100% to the entry in Handelsregister. This way Handelsregister can be queried once or twice a year to update the info in Wikidata.

Question 1: bot or other tool How data is added? I am keeping the bot request, but I will look at Mix and Match first. Maybe this tool is better suited for task 1.

Question 2: modeling Which properties should be used in Wikidata? I am particular looking for the property for the court as registering organisation, i.e. that has the authority to define the identity of an org. and then also the number (HRB 32853). The types, i.e. legal status can be matched to existing Wikidata entries. Most exist in the German Wikipedia. Any help for the other properties is appreciated.

Question 3: legal I still need to read up on the right situation for importing crawled data. Here is a hint given on the mailing list:

https://en.wikipedia.org/wiki/Sui_generis_database_rights You'd need to check whether in Germany it applies to official acts and registers too... https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights

Task 2 Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.

It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability

  • 2. It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references. If there is no item about you yet, you are probably not notable.

The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.

--SebastianHellmann (talk) 07:39, 16 October 2017 (UTC)

Could you make a few example entries to illustrate how the items you want to create will look like? What strategy will you use to avoid creating doublicate items? ChristianKl (talk) 12:38, 16 October 2017 (UTC)
I think this is a good idea, but I agree there needs to be a clear approach to avoiding creating duplicates - we have hundreds of thousands of organizations in wikidata now, many of them businesses, many from Germany, so there certainly should be some overlap. Also I'd like to hear how the proposer plans to keep this information up to date in future. ArthurPSmith (talk) 15:13, 16 October 2017 (UTC)
There was a discussion on the mailing list. It would be easier to complete the info for existing entries in Wikidata at first. I will check mix and match for this or other methods. Once this space is clean, we can rediscuss creating new identifiers. SebastianHellmann (talk) 16:01, 16 October 2017 (UTC)
Is there an existing ID that you plan to use for authority control? Otherwise, do we need a new property? ChristianKl (talk) 20:40, 16 October 2017 (UTC)
I think that the ID needs to be combined, i.e. registering court and register number. That might be two properties. SebastianHellmann (talk) 16:05, 29 November 2017 (UTC)
  • Given that this data is fairly frequently updated, how is it planned to maintain it?
    --- Jura 16:38, 16 October 2017 (UTC)
* The frequency of updates is indeed large: A search for deletion announcements alone in the limited timeframe of 1.9.-15.10.17 finds 6682 deletion announcements (which legally is the most seriouss change and makes approx. 10% of all announcements). So within one year, more than 50,000 companies are deleted - which for sure should be reflected in according Wikidata entries. Jneubert (talk) 15:44, 17 October 2017 (UTC)
Hi all, I updated the bot description, trying to answer all questions from the mailing list and here. I still have three questions, which I am investigating. Help and pointers highly appreciated. SebastianHellmann (talk) 23:36, 16 October 2017 (UTC)
  • Given that German is the default language in Germany I would prefer the entry to be "Sachsen Amtsgericht Leipzig HRB 32853" instead of "Saxony District court Leipzig HRB 32853". Afterwards we can store that as an external ID and make a new property for that (which would need a property proposal). ChristianKl (talk) 12:33, 17 October 2017 (UTC)
Thanks for the updated details here. It sounds like a new identifier property may be needed (unless one of the existing ones like Legal Entity Identifier (P1278) suffices, but I suspect most of the organizations in this list do not have LEI's (yet?)). Ideally an identifier property has some way to turn the identifiers into a URL link with further information on that particular identified entity, that de-referenceability makes it easy to verify - see "formatter URL" examples on some existing identifier properties. Does such a thing exist for the Handelsregister? ArthurPSmith (talk) 14:58, 17 October 2017 (UTC)

Franzsimon Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30 User:PKM User:RollTide882071 Kristbaum Andber08 Sidpark SilentSpike Susanna Ånäs (Susannaanas) User:Johanricher User:Celead


Pictogram voting comment.svg Notified participants of WikiProject Companies for input.

@SebastianHellmann: for task 1, you might also be interested in OpenRefine (make sure you use the German reconciliation interface to get better results). See https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation for details of its reconciliation features. I suspect your dataset might be a bit big though: I think it would be worth trying only on a subset (for instance, filter out those with a low capital). − Pintoch (talk) 14:52, 20 October 2017 (UTC)

Concerning Task 2, I'm a bit worried about the companies' notability (ot lack thereof), since the Handelsregister includes any and all companies. Not just the big ones where there's a good chance that Wikipedia articles, other sources, external IDs, etc exist. But also tiny companies and even one-person-companies, like someone selling stuff on Ebay or some guy selling christmas trees in his village. So it would be very hard to find any data on these companies outside the Handelsregister and the phonebook. --Kam Solusar (talk) 05:35, 21 October 2017 (UTC)

Agreed. Do we really need to be a complete copy of the Handelsregister? What for? How about concentrating on a meaningful subset instead that addresses a clear usecase? --LydiaPintscher (talk) 10:35, 21 October 2017 (UTC)
That of course is true. A strict reading of Wikidata:Notability could be seen as that at least two reliable sources are required. But then, that could be the phone book. Do we have to make those criteria more strict? That would require a RfC. Lymantria (talk) 07:58, 1 November 2017 (UTC)
I would at least try an RfC, but I am not immediately sure what to propose.--Ymblanter (talk) 08:05, 1 November 2017 (UTC)
If there's an RfC I would say that it should say that for data-imports of >1000 items the decision whether or not we import the data should be done via a request for bot permissions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)
@SebastianHellmann: is well-intended, but I agree not all companies are notable. Even worse than 1-man shops are inactive companies that nobody bothered to close yet. Just "comes from reputable source" is not enough: eg OpenStreetMaps is reputable, and it would be ok to import all power-stations (eg see Enipedia) but imho not ok to import all recyclable garbage cans. We got 950k BG companies at http://businessgraph.ontotext.com/ but we are hesitant to dump them on Wikidata. Unfortunately official trade registers usually lack measures of size or importance...
It's true the Project Companies has not gelled yet and there's no clear Community of Use for this data. On the other hand, if we don't start somewhere and experiment, we may never get big quantities of company data. So I'd agree to this German data dump by way of experiment --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)

Franzsimon Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30 User:PKM User:RollTide882071 Kristbaum Andber08 Sidpark SilentSpike Susanna Ånäs (Susannaanas) User:Johanricher User:Celead


Pictogram voting comment.svg Notified participants of WikiProject Companies Pictogram voting comment.svg Comment As best I know Project Companies has yet to gel up workable (for the immediate term) notability standard so the area remains fuzzy. Here is my current thinking [[30]] Very much like the above automation of updates. Hopefully the fetching scripts for Germany can be generalizeable to work in most developed countries that publish structured data on public companies. Would love to find WikiData consensus on Notability vs. its IT capacity and stomach for volumes of basically table data. Rjlabs (talk) 16:47, 3 November 2017 (UTC)

  • @Rjlabs: That hope is not founded because each jurisdiction does its own thing. OpenCorporates has a bunch of web crawling scripts (some of them donated) that they consider a significant IP. And as @SebastianHellmann: wrote their data is sorta open but not really. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • I Symbol support vote.svg Support importing the data. Having the data makes it easier to enter the employer when we create items for new people. Companies also engage into other actions that leave marks in databases such as registering patents or trademarks and it makes it easier to import such data when we already have items for the companies. The ability to run queries about the companies that are located in a given area is useful. ChristianKl (talk) 17:20, 3 November 2017 (UTC)
    • @ChristianKl: at least half of the 200M or so companies world-wide will never have notable employees nor patents, so "let's import them just in case" is not a good policy --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • When it comes to these mass imports I would only want to mass import datasets about companies from authoritative sources. If we talk about a country like Uganda, I think it would be great to have an item for all companies that truly exist in Uganda. People in Uganda care about the companies that exist in their country and there government might not have the capability to host that data in a user-friendly way. An African app developer could profit from the existance of a unique identifier that's the same for multiple African countries.
When it comes to the concern about data not being up-to-date there were multiple cases where I would have really liked data about 19th century companies will doing research in Wikidata. Having data that's kept up-to-date is great, but having old data is also great. ChristianKl () 20:11, 13 December 2017 (UTC)
  • @Rjlabs: We did go back and forth with a lot of ideas on how to set some sort of criteria for company notability. I think any public company with a stock market listing should be considered notable, as there's a lot of public data available on those. For private companies we talked about some kind of size cutoff, but I suppose the existence of 2 or more independent reference sources with information about the company might be enough? ArthurPSmith (talk) 18:01, 3 November 2017 (UTC)
  • @ArthurPSmith:@Denny:@LydiaPintscher: Arthur, let's make it any public company that trades on a recognized stock exchange, anywhere worldwide, with a continuous bid and ask quote, that actually trades at least once per week is automatically considered "notable" for WikiData inclusion. This is by virtue that real people wrote real checks to buy shares and there is sufficient continuing trading interest in the stock to make it trade at least once per week, and some exchange somewhere endows that firm to be listed on its exchange. We should also note that passing this hurdle means that SOME data on that firm is automatically allowable on WikiData, provided the data is regularly updated. Rjlabs (talk) 19:35, 3 November 2017 (UTC)
    • @Rjlabs, Denny, LydiaPintscher: Public Companies are a no-brainer because there's only 60k in the world (there are about 2.6k exchanges); compare to about 200M companies world-wide. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • Some data means (for right now) information like LEI, name, address, phone, industry code(s), brief text description of what they do) plus about 10 high level fields that cover the most frequently needed company data (such as: sales, employees, assets, principal exchange(s) down to where at least 20% of the volume is traded, unique symbol on that exchange, CEO, URL to investor relations section of website where detailed financial statements may be found, Central index key (or equivalent) with link to regulatory filings / structured data in the primary country where its regulated. For now that is all that should be "automatically allowable". No detailed financial statements, line by line, going back 10-20 years, with adjustments for stock splits, etc. No bid/offer/last trade time series. Consensus on further detail has to wait further gelling up. I Ping Lydia and Denny here to be sure they are good with this potential volume of linked data. (I think it would be great, a good start and limited. I especially like it if it MANDATES LEI, if one is available). Moving down from here (after 100% of public companies that are alive enough to actually trade) there is of course much more. However its a very murky area. >=2 independent reference sources with information about the company might be too broad causing WikiData capacity issues, or it may be too burdensome if someone has a structured data source that is much more reliable then WikiData to feed in, but lacks that "second source". Even if was one absolutely assured good quality source, and WikiData capacity was not an issue, I'd like to see a "sustainability" requirement up front. Load no private company data where it isn't AUTOMATICALLY updated or expired out. Again, would be great to have further Denny/Lydia input here on any capacity concern. Rjlabs (talk) 19:35, 3 November 2017 (UTC)
    • "A modicum of data" as you describe above is a good criterion for any company. --Vladimir Alexiev (talk)
    • On WikidataCon there was a question from the audience of whether Wikidata would be okay with importing the 400 million entries about items in museums that are currently managed by various museums. User:LydiaPintscher answered by saying that her main concerns aren't technical but whether our communities does well with handling a huge influx of items. Importing data like the Handelsregister will mean that there will be a lot of items that won't be touched by humans but I don't think that's a major concern for our community. Having more data means more work for our community but it also means that new people get interested in interacting with Wikidata. When we make decisions like this, technical capabilities however matter. I think it would be great if a member of the development team would write a longer blog post that explains the technical capabilities, so that we can better factor them into our policy decisions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)
I agree with Lydia. The issue is hardly the scalability of the software - the software is designed in such a way that there *should* not be problems with 400M new items. The question is do we have a story as a community to ensure that these items don't just turn into dead weight. Do we ensure that items in this set are reconciled with existing items if they should be? That we can deal with attacks on that dataset in some way, with targeted vandalism? Whether the software can scale, I am rather convinced. Whether the community can scale, I think we need to learn that.
Also, for the software, I would suggest not to grow 10x at once, but rather to increase the total size of the database with a bit more measure, and never to more than double it in one go. But this is just, basically, for stress-testing it, and to discover, if possible, early unexpected issues. But the architecture itself should accommodate such sizes without much ado (again - "should" - if we really go for 10x, I expect at least one unexpected bug to show up). --Denny (talk) 23:25, 5 November 2017 (UTC)
Speaking of the community being able to handle dead weight, it seems we mostly lack the tools to do so. Currently we are somewhat flooded by items from cebwiki and despite efforts by individual users to deal with one or the other problem, we still haven't tackled them systematically and this lead to countless items with unclear scope complicating every other import.
--- Jura 07:00, 6 November 2017 (UTC)
I don't think we should just add 400M new items in one go either. I don't think that the amount of vandalism that Wikidata faces scales directly with the amount of items that we host if we double the amount of items we don't double the amount of vandalism.
As far as the cebwiki items go, the problem isn't just that there are many items. The problem is that there's unclear scope for a lot of the items. For me that means that when we allow massive data imports we have to make sure that the imported data is up to a high quality where the scope of every item is clear. This means that having a bot approval process for such data imports is important and suggests to me that we should also get clear about the necessarity of having a bot approval for creating a lot of items via QuickStatements.
Currently, we are importing a lot of items via WikiCite and it seems to me that process is working without significant issues.
I agree that scaling the community should be a higher priority than scaling the number of items. One implication of that is that it makes sense to have higher standards for mass imports via bots than for items added by individuals (a newbie is more likely to become involved in our community when we don't great him by deleting the items they created).
Another implication is that the metric we celebrate shouldn't be focused on the number of items or statments/item but the number of active editors. ChristianKl () 09:58, 20 November 2017 (UTC)

Now what?[edit]

Lots of good discussion above. Would anyone care to summarize, and how do we move to a decision? --Vladimir Alexiev (talk) 15:10, 5 December 2017 (UTC)

  • Some seem to consider it too granular. Maybe a test could be done with a subset. If no other criteria can be determined, maybe a start could be with companies with a capital > EUR 100 mio.
    --- Jura 20:21, 13 December 2017 (UTC)

Jntent's Bot 1[edit]

Jntent's Bot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools) Operator: Jntent (talkcontribslogs)

Task/s:

The task is to add assertions about airports from template pages.

Code:

The code is based on pywikibot's harvest_templates.py  under scripts in https://github.com/wikimedia/pywikibot-core

Function details:


I added some constraints for literal values with regular expressions to parse "Infobox Airport" and similar ones in other languages. See the

I hope to scrape the airport templates from a few languages. Since the "Infobox Airport" template contains a links to pages about airport codes,

{{Infobox airport
| name         = Denver International Airport
| image        = Denver International Airport Logo.svg
| image-width  = 250
| image2       = DIA Airport Roof.jpg
| image2-width = 250
| IATA         = DEN
| ICAO         = KDEN
| FAA          = DEN
| WMO          = 72565
| type         = Public
| owner        = City & County of Denver Department of Aviation
| operator     = City & County of Denver Department of Aviation
| city-served  = [[Denver]], the [[Front Range Urban Corridor]], Eastern Colorado, Southeastern Wyoming, and the [[Nebraska Panhandle]]
| location     = Northeastern [[Denver]], [[Colorado]], U.S.
| hub          =
...
}}

I will use links to pages about airport codes to find airports. One example is:

https://en.wikipedia.org/wiki/International_Air_Transport_Association_airport_code

Template element Property Constraining regex (from properties)
IATA Property:P238 [A-Z]{3}
ICAO Property:P239 ([A-Z]{2}|[CKY][A-Z0-9])[A-Z0-9]{2}
FAA Property:P240 [A-Z0-9]{3,4}
coordinates Property:P625 6 numbers and 2 cardinalities surrounded by "|" from the coord template:
{{coord|39|51|42|N|104|40|23|W|region:US-CO|display=inline,title}}
city-served Property:P931 The first valid link, standard harvest_template.py behavior

 – The preceding unsigned comment was added by Jntent (talk • contribs).

  • Pictogram voting comment.svg Comment I think there were some problems with these infoboxes in one language. Not sure which one it was. Maybe Innocent bystander recalls (I think he mentioned it once).
    --- Jura 11:28, 8 July 2017 (UTC)
    Well, I am not sure if I (today) remember any such problems. But it could be worth to mention that these codes also can be found in sv:Mall:Geobox and ceb:Plantilya:Geobox that are used in the Lsjbot-articles. These templates are not specially adapted to airports, but Lsj used the same template also for this group of articles. The Swedish template has special parameters for this ("IATA-kod" and "ICAO-kod") while the cebwiki articles uses a parameter "free" and "free_type". (Could be worth checking free1, free2 too.) See ceb:Coyoles (tugpahanan) as an example. -- Innocent bystander (talk) 15:17, 8 July 2017 (UTC)
  • @Jntent: in this edit I see the bot replaced FDKS with FDKB, while in en.wp infobox and lead section ar two values for ICAO cadoe : FDKS/FDKB. I would suggest to not change any existing value, or these should be checked manually probably if changed. The most safe way to act here would be to just add missing values. XXN, 14:07, 17 July 2017 (UTC)
  • @Jntent: Still interested? Matěj Suchánek (talk) 09:21, 3 August 2018 (UTC)


ZacheBot[edit]

ZacheBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Zache (talkcontribslogs)

Task/s: Import data from pre-created CSV lists.

Code: based on Pywikibot (Q15169668), sample import scripts [31]

Function details:

--Zache (talk) 23:29, 4 March 2017 (UTC)

@Zache:, could you pls make a couple of test edits, I do not see any lakes in the contribution of the bot.--Ymblanter (talk) 21:20, 14 March 2017 (UTC)
@Zache: Are you still planning to do this taks? If so, please provide a few test edits. --Pasleim (talk) 08:13, 11 July 2017 (UTC)
Hi, i did the vaalidatahack without bot permissions so that one is done already. The lake thing is ongoing project and currently done using quickstatements for single lakes and CC0 licence screening for larger imports is still the same. Most likely there is also WLM related data imports in this summer by me, but i am not sure how big (most like under < 2000 items which some are updates for existing items and some are new) User Susannaanas started this and i am continuing with filling the details to the WLM the targets. Most likely this WLM stuff is made using pywikibot instead of quickstatements because i can do consistency checks with the code. --Zache (talk) 11:12, 11 July 2017 (UTC)

YULbot[edit]

YULbot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: YULdigitalpreservation (talkcontribslogs)

Task/s:

  • YULbot has the task of creating new items for pieces of software that do not yet have items in Wikidata.
  • YULbot will also make statements about those newly-created software items.

Code: I haven't written this bot yet.

Function details:

This bot will set the English language label for these items and create statements using publisher (P123), ISBN-13 (P212), ISBN-10 (P957), place of publication (P291), publication date (P577). --YULdigitalpreservation (talk) 18:04, 21 February 2017 (UTC)

good to run a test with a few examples so we can see what you're planning! ArthurPSmith (talk) 20:46, 22 February 2017 (UTC)
Interesting. Where does the data come from? Emijrp (talk) 12:04, 25 February 2017 (UTC)
The data is coming from the pieces of software themselves. These are pieces of software that are in the Yale Library collection. We could also supplement with data from oldversions.com.YULdigitalpreservation (talk) 13:07, 28 February 2017 (UTC)
Please let us know when the bot is ready for approval.--Ymblanter (talk) 21:12, 14 March 2017 (UTC)
@YULdigitalpreservation: any update? --DannyS712 (talk) 18:29, 10 May 2020 (UTC)

YBot[edit]

YBot (talkcontribsnew itemsSULBlock logUser rights logUser rightsxtools)
Operator: Superyetkin (talkcontribslogs)

Task/s: import data from Turkish Wikipedia

Code: The bot, currently active on trwiki, uses the Wikibot framework.

Function details: The code imports data (properties and identifiers) from trwiki, aiming to ease the path to Wikidata Phase 3 (to have items that store the data served on infoboxes) --Superyetkin (talk) 16:42, 12 January 2017 (UTC)

It would be good if you could check for constraint violations insteaf of just blindly copying data from trwiki. These violations are probably all caused by the bot. --Pasleim (talk) 19:26, 15 January 2017 (UTC)
Yes, I am still interested in this. --Superyetkin (talk) 12:20, 4 March 2018 (UTC)
@Superyetkin: If that is the case, can you take away concerns as indicated by Pasleim, by showing how you'll avoid the constraint violations? Lymantria (talk) 13:53, 31 May 2018 (UTC)
I think I can check for constraint violations using the related API method --Superyetkin (talk) 17:55, 1 June 2018 (UTC)
@Pasleim: Would that be sufficient? Lymantria (talk) 09:10, 3 June 2018 (UTC)
That API method works only for statements which are already added to Wikidata. It would be good if some consistency check could be made prior adding a statement. For example, the unique value constraint of YerelNet village ID (P2123) can be checked be downloading all current values [32], importing them into an array and then prior saving a statement the bot checks if the value is already in the array. Format constraint can be realized in php by preg_match(). Item constraints don't need be be checked because they only indicate missing data but not wrong data. --Pasleim (talk) 17:52, 3 June 2018 (UTC)