Shortcuts: WD:RFBOT, WD:BRFA, WD:RFP/BOT

Wikidata:Requests for permissions/Bot

From Wikidata
Jump to navigation Jump to search


Wikidata:Requests for permissions/Bot
To request a bot flag, or approval for a new task, in accordance with the bot approval process, please input your bot's name into the box below, followed by the task number if your bot is already approved for other tasks.

Old requests go to the archive.

Once consensus is obtained in favor of granting the botflag, please post requests at the bureaucrats' noticeboard.

Translate this header box!


Bot Name Request created Last editor Last edited
JneubertAutomated 3 2018-11-19, 07:41:42 Jneubert 2018-11-20, 06:32:44
MewBot 2 2018-11-13, 12:01:06 ArthurPSmith 2018-11-20, 14:57:01
Abbe98 Bot 2 2018-11-16, 12:29:40 Multichill 2018-11-18, 17:12:57
Soweego_bot_2 2018-11-05, 18:09:45 Jura1 2018-11-20, 06:50:04
JonHaraldSøbyWMNO-bot 2018-10-25, 13:00:07 Jura1 2018-10-25, 13:54:57
MewBot 2018-09-22, 09:38:20 Pamputt 2018-10-30, 21:58:48
soweego_bot 2018-09-12, 10:58:31 Lymantria 2018-09-18, 05:24:48
zbmathAuthorID 2018-08-27, 16:09:16 Ymblanter 2018-08-30, 17:46:59
ScorumMEBot 2 2018-08-06, 14:39:27 Lymantria 2018-09-01, 06:04:00
GZWDer (flood) 3 2018-07-23, 23:08:28 GZWDer 2018-09-13, 12:08:46
GZWDer (flood) 2 2018-07-16, 13:56:24 Liuxinyu970226 2018-09-15, 22:41:50
crossref bot 2018-04-19, 21:12:41 Crossref bot 2018-04-20, 14:38:57
WikiBot 2018-06-17, 15:10:10 Matěj Suchánek 2018-08-03, 09:09:13
PricezaBot 2018-06-14, 09:18:09 Praxidicae 2018-06-14, 19:29:22
schieboutct 2018-04-22, 01:39:47 Matěj Suchánek 2018-08-03, 09:09:43
wikidata get 2018-06-15, 10:51:58 Matěj Suchánek 2018-08-03, 09:15:02
Wolfgang8741 bot 2018-06-18, 02:17:10 Wolfgang8741 2018-09-05, 15:51:10
CanaryBot 2 2018-05-10, 23:46:00 Ivanhercaz 2018-05-14, 18:26:33
maria research bot 2018-03-13, 06:15:42 Mahdimoqri 2018-03-30, 14:06:13
AmpersandBot 2 2018-02-22, 01:43:22 Jura1 2018-03-12, 10:18:09
Arasaacbot 2018-01-15, 12:28:44 Matěj Suchánek 2018-08-08, 11:24:07
taiwan democracy common bot 2018-02-09, 07:09:27 Jura1 2018-03-21, 21:37:02
Newswirebot 2018-02-08, 13:00:18 Dhx1 2018-09-23, 11:53:12
KlosseBot 2017-11-17, 20:40:22 Matěj Suchánek 2018-08-03, 09:19:57
NIOSH bot 2017-11-14, 05:59:08 Ymblanter 2018-08-26, 20:33:45
neonionbot 2017-10-19, 06:15:18 ArthurPSmith 2017-10-19, 13:12:49
Handelsregister 2017-10-16, 07:39:42 Pasleim 2018-02-09, 08:46:30
Jntent's Bot 2017-06-30, 23:37:11 Matěj Suchánek 2018-08-03, 09:21:28
WikiProjectFranceBot 2017-05-08, 20:01:48 Lymantria 2018-05-31, 13:51:32
Jefft0Bot 2017-04-17, 15:16:29 Matěj Suchánek 2018-08-03, 09:18:22
MexBot 2 2017-06-08, 03:00:53 ValterVB 2017-06-25, 14:32:26
Emijrpbot 8 2017-03-25, 11:42:28 Matěj Suchánek 2017-06-09, 06:47:22
ZacheBot 2017-03-04, 23:29:38 Zache 2017-07-11, 11:13:15
YULbot 2017-02-21, 18:05:13 YULdigitalpreservation 2018-03-06, 13:15:37
YBot 2017-01-12, 16:43:19 Pasleim 2018-06-03, 17:52:12


JneubertAutomated 3[edit]

JneubertAutomated (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Jneubert (talkcontribslogs)


Task/s:

Create items missing in Wikidata from 20th century press archives (Q36948990) (PM20) metadata

Code:

https://github.com/jneubert, https://github.com/zbw/sparql-queries

Function details:

A federated SPARQL query creates a list of PM20 entries (currently companies, could similarly work for persons and other types), which is consumed by a script transforming it to QuickStatements. The query makes sure that the PM20 folder ID (P4293) does not exist in WD, and it can restrict the entries to a subset which has already been checked in a Mix-n-Match catalog.

Example items are Ruberoidwerke (Q58712849) or Banque d'Anvers (Q58718259). According to this discussion, the scipt has been extended to include official name (P1448).

MewBot 2[edit]

MewBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Rua (talkcontribslogs)

Task/s: Importing attested lexemes from en.wiktionary

Code:

Function details: Since Wikidata:Requests for permissions/Bot/MewBot doesn't seem to be going anywhere, I'd like to start with importing attested lexemes instead. I would like to import Northern Sami lemmas (Wiktionary's term for lexeme) onto Wikidata. Only the lexeme itself will be imported; no senses, no forms and no etymology. It will import the Álgu ID (P5903) property if possible, though. This is again a feasibility study, to see how well it works. A problem that I foresee is that of homographs having the same part of speech. When there are already one or more lexemes with the same spelling and part of speech on Wikidata, the bot has no way of telling which of them belongs to which Wiktionary lemma. Moreover, if they were imported, there'd be no way to distinguish them on Wikidata either and they'd look like duplicate lexemes until senses and/or etymology are added. The bot will skip these, but it will mean that some lexemes cannot be imported easily. A possible solution would be to add the Wikidata lexeme ID into the wikicode on Wiktionary's side, but I doubt the people on Wiktionary would like that as they seem to be a bit allergic to Wikidata.

The code will be adapted from the first proposal, so it has already been demonstrated to work. I only need to change the language and remove the code that imports the etymology. I would like to import other languages using this method later, so that Wikidata will have a good set of lexemes to start with and other users can then add the data to them as they see fit. Having the lexemes already present also makes adding etymologies easier.

Copyright shouldn't be an issue, as lemmas and parts of speech don't seem like copyrightable things to begin with.

--—Rua (mew) 12:00, 13 November 2018 (UTC)

  • Symbol support vote.svg Support Can't you distinguish homographs via the ID property? ArthurPSmith (talk) 15:20, 19 November 2018 (UTC)
    • On Wikidata, yes, but how do you know which of them a particular Wiktionary lemma belongs to? —Rua (mew) 15:40, 19 November 2018 (UTC)
      • I meant Álgu ID (P5903) - doesn't that uniquely identify each homograph? But I guess you indicated it was not available for all of them. ArthurPSmith (talk) 14:57, 20 November 2018 (UTC)

Abbe98_Bot[edit]

Abbe98_Bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Abbe98 (talkcontribslogs)

Task/s: Import of IIIF manifests for artworks from the Nationalmuseum(Sweden). Example edits: https://www.wikidata.org/w/index.php?limit=50&title=Special%3AContributions&contribs=user&target=Abbe98+Bot&namespace=&tagfilter=&topOnly=1&start=2018-11-16&end=2018-11-16

Code: https://gist.github.com/Abbe98/67f6702478a4b556609cb186162a60a6

Function details: Imports IIIF Manifests for artworks already in Wikidata together with a source claim. --Abbe98 (talk) 12:29, 16 November 2018 (UTC)

I renamed the request because we already had Wikidata:Requests for permissions/Bot/Abbe98 Bot. Looks good to me. Multichill (talk) 17:12, 18 November 2018 (UTC)

Feedback[edit]

  • @Lymantria, Hjfocs: in the sample edits, it duplicated "official website" and imdb ID to "described at URL". --- Jura 19:33, 18 November 2018 (UTC)
@Jura1: thank you very much for your feedback. I have posted a reply here to avoid further modification of this archived page: User_talk:Jura1#Thanks_for_your_feedback_on_User:Soweego_bot_task_2


Dear Jura1 (talkcontribslogs),

In Wikidata:Requests_for_permissions/Bot/Soweego_bot_2, you mentioned 2 important points. Let me explain.

  1. IMDb: the bot sees raw URLs from target sources like MusicBrainz (Q14005) and tries its best to convert them to known external identifiers. To achieve so, it attempts to match each given input URL against all formatter URL (P1630) of external identifier properties, and to extract the correct identifier through format as a regular expression (P1793). This is done via SPARQL queries. Unfortunately, IMDb ID (P345) seems to have an exotic formatter URL marked as preferred statement: https://tools.wmflabs.org/wikidata-externalid-url/?p=345&url_prefix=https://www.imdb.com/&id=$1. It will never match an IMDb input URL, hence the reason why you are seeing them added as is. Do you have any suggestions to avoid this? Of course, consider that building a custom rule for each exception is not a sustainable solution;
  2. official website (P856) VS described at URL (P973): I totally understand this point and will implement an extra check for official website (P856) values.

Thanks again for your precious comments.

Cheers,
Hjfocs (talk) 09:46, 19 November 2018 (UTC)

Sounds good. BTW, I added the commment above as well as it might be easier to find for other people. Further, I included the recommended "new section" header. --- Jura 06:49, 20 November 2018 (UTC)

JonHaraldSøbyWMNO-bot[edit]

JonHaraldSøbyWMNO-bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Jon Harald Søby (WMNO) (talkcontribslogs)

Task/s: Add items (and keep them up-to-date) from the Sami bibliography from the National Library of Norway to Wikidata.

Code: Not published yet (but will be eventually, see phab:T205631 et al)

Function details: As part of Wikimedia Norge's Northern Sami project, we have prepared an import of the data from the Sami bibliography to Wikidata. The Sami bibliography is a listing of all works published in Sami languages or about Sami people/culture in Norway. It contains around 26,000 work editions with plenty of metadata, and the items will be structured according to the standards laid out in Wikidata:WikiProject Books. I am also planning to write a script to keep the data up-to-date, but the first priority is doing the import. --Jon Harald Søby (WMNO) (talk) 12:59, 25 October 2018 (UTC)

  • interesting. Please do some tests once ready. --- Jura 13:54, 25 October 2018 (UTC)

MewBot[edit]

MewBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Rua (talkcontribslogs)

Task/s: Importing lexemes from en.Wiktionary in specific languages

Code:

Function details: The bot will be used to parse entries from English Wiktionary using pywikibot and mwparserfromhell, and then either create lexemes on Wikidata, or add information to existing lexemes. Care is taken to not duplicate information: the script checks if the lexeme exists and already has the desired properties and only adds anything if not. In case of doubt (e.g. multiple matching lexemes already exist) it skips the edit. I made some test edits using my own user account, they can be seen from [3] to [4]. Today I did a few on the MewBot account.

Individual imports will be proposed with the lexicographical data project first, as it has been said by the project leaders to be careful with imports at first. The current proposal is for Proto-Samic and Proto-Uralic lexemes, seen at Wikidata talk:Lexicographical data#Requesting permission for bot import: Proto-Uralic and Proto-Samic lexemes. Once the project leaders give the ok for all imports, permission will no longer be needed for individual imports. Planned future imports are for Dutch and the modern Sami languages. --—Rua (mew) 09:37, 22 September 2018 (UTC)

I am ready to approve this request in a couple of days, provided that no objections will be raised meanwhile. Lymantria (talk) 05:27, 25 September 2018 (UTC)
I just noticed that Wikidata:Bots says I need to indicate where the bot copied the data from. How do I indicate that the data came from Wiktionary? —Rua (mew) 10:51, 25 September 2018 (UTC)
Could you run your bot on few entries in order to evaluate it? Thanks in advance. Pamputt (talk) 10:59, 26 September 2018 (UTC)
I did, already. Do I need to do more? —Rua (mew) 11:02, 26 September 2018 (UTC)
Symbol oppose vote.svg Oppose Ah sorry I did not check before asking. For all reconstructed form, I think a reference is mandatory. As these "words" do not exist, these "words" come from specialist's work and have to be sourced. Two linguists may reconstruct different forms. That's said, I am not sure about copyright issue for reconstruct form. It probably belongs to public domain as a scientific work but it would be better to be sure. Pamputt (talk) 21:42, 26 September 2018 (UTC)
Not all reconstructions on Wiktionary can be sourced to some external work. Some were reconstructed by Wiktionary editors. This is because not all reconstructed forms are available in external works, and we have to fill the gaps ourselves. The bot adds links to Álgu and Uralonet if one exists. —Rua (mew) 22:26, 26 September 2018 (UTC)
I strongly disagree to import reconstructed forms that do not come from scientific works. One need criteria to accept such forms and academic paper is a good one. Otherwise, anyone can guess its own form. So if you run your bot, please import only "validated" forms. Pamputt (talk) 14:18, 27 September 2018 (UTC)
I agree with that. Only sourced reconstructed forms should be imported. Unsui (talk) 15:50, 27 September 2018 (UTC)
Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them. The criteria used by Wiktionary is that they follow established sound laws. Some reconstructions from linguistic sources don't pass that criterium. It fits with the general policy in Wiktionary of not blindly copying from dictionaries but making sure that forms make sense. Reconstructions that are questionable, whether from an external source or not, can be discussed and deleted if found to be invalid. If you have doubts about any of the reconstructions in Wiktionary, you should discuss it there.
That said, what should be done if words in different languages come from a common source, but there is no source that gives a reconstruction? Can lemmas be empty? —Rua (mew) 15:54, 27 September 2018 (UTC)
Here are some cases where Wiktionary has had to correct errors and omissions in sources. I provide a link to Wiktionary, and a link to Álgu, which gives its source.
...and many more. So you see if we have to rely on sources, we become vulnerable to errors, whereas we can correct those errors on Wiktionary, making it more reliable. If Wikidata can't apply the same level of scientific rigour then that is rather worrying. —Rua (mew) 16:42, 27 September 2018 (UTC)
Wiktionary's goal is to be an alternative for existing dictionaries, including etymological dictionaries, not to be dependent on them.
This is maybe the case on the English Wiktionary but on the French Wiktionary, original works for etymology are not allowed, every etymological information have to be sourced. Yet Wikidata has to define its own criteria and about reconstructed form, nothing has been decided so far. About you question "what do we do when a source give a wrong information", I would say in this case, we set a deprecated rank. Pamputt (talk) 19:05, 27 September 2018 (UTC)
You say, for exemple, "North Sami requires final *ā". OK but why not *ö ? Because linguists have defined laws for this langage. It is always linguists works. Hence, it is possible to put a reference. Otherwise anything may be created as a reconstructed form. Unsui (talk) 07:16, 28 September 2018 (UTC)
That's nonsense. It still has to stand up to scrutiny. —Rua (mew) 10:02, 28 September 2018 (UTC)
  • For how many new ones is this? --- Jura 11:11, 26 September 2018 (UTC)
  • Symbol oppose vote.svg Oppose for now. It's unclear how many would be imported and we need to solve the original research question first. --- Jura 08:03, 27 September 2018 (UTC)
    Can you elaborate? I don't see what the problem is. —Rua (mew) 10:07, 27 September 2018 (UTC)
    Apparently, you don't know how many you plan to import. --- Jura 10:12, 27 September 2018 (UTC)
    I gave a link to the categories in the other discussion. —Rua (mew) 10:20, 27 September 2018 (UTC)
    • Can you make a reliable statement? Categories tend to evolve and change subcategories. --- Jura 10:22, 27 September 2018 (UTC)
    wikt:Category:Proto-Samic lemmas currently contains 1303 entries. —Rua (mew) 10:25, 27 September 2018 (UTC)
  • I've made a post regarding the import and the conflict in Wiktionary vs Wikidata's policies: wikt:WT:Beer parlour/2018/September#What is Wiktionary's stance on reconstructions missing from sources?. —Rua (mew) 17:36, 27 September 2018 (UTC)
  • Is there any news on this? —Rua (mew) 10:08, 17 October 2018 (UTC)
    @Jura1:, are you fine now with the approval of this bot?--Ymblanter (talk) 13:01, 21 October 2018 (UTC)
    • I will try to write something tomorrow. --- Jura 18:21, 21 October 2018 (UTC)
    • First: sorry for the delay. The question what to do with lexemes reconstructed at Wiktionary remains open. In general, we would only import information from other WMF sites when we know or can assume that it can be referenced to other quality sources. This isn't the case here. One could argue that Wiktionary is an independent dictionary website and should be considered a reference on its own. Whether or not this is the case depends on how Wikidata and the various Wiktionaries will work going forward. The closer Wiktionary and Wikidata would work together going forward the less we can consider it as such. --- Jura 04:14, 25 October 2018 (UTC)
      • The majority of the Proto-Samic entries on Wiktionary does have an Álgu ID (P5903). Proto-Uralic entries mostly have Uralonet ID (P5902), but the lemma is not always identical to the form given on Uralonet, for which User:Tropylium is mostly responsible as the primary Uralic expert on Wiktionary. Would it be acceptable to import only those entries that have one of these IDs?
      • If so, that leaves the question of what to do with the remainder. It would be a shame if these can't be included in Wikidata, and would mean that Wiktionary is always more complete than Wikidata can be. Words that have an etymology on Wiktionary would have none on Wikidata, because of the Proto-Samic ancestral form being missing. —Rua (mew) 18:43, 30 October 2018 (UTC)
    @Rua: yes importing lexeme that have Álgu ID (P5903) or Uralonet ID (P5902) is fine with me. However, the lexeme for which the lemma is not identical to the form given on Uralonet do not have to be imported because they are not verifiable. They have to be similar to what the source says. Pamputt (talk) 21:58, 30 October 2018 (UTC)
  • Now pinging @Pamputt: as well.--Ymblanter (talk) 20:02, 21 October 2018 (UTC)
    I did not change my opinion because this bot wants to import reconstructed forms without any academic references. If the bot use academic work as source, it is fine with me, if not I oppose (and the discussion shows that we are in this case). Pamputt (talk) 20:08, 21 October 2018 (UTC)

zbmathAuthorID[edit]

zbmathAuthorID (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Zbmath authorid (talkcontribslogs)

Task/s: Add external IDS zbmath author ID (P1556) to wikidata item of mathematicians, based on manually checked data curation at zbmath.org

Code:

Function details: The mathematical bibliographic database zbmath.org (Q18241050) maintains links to several services and databases, amongst other to wikidata (e.g. https://zbmath.org/authors/?q=ai:dieudonne.jean-alexandre links to Q371957). They have currently 11000 such links, the half of which having been established manually. I would like to register a bot that would store the corresponding back link wikidata->zbmath on the wikidata side, for any of these links zbmath->wikidata, i.e. adding one claim P1556. It would run on a daily basis, with a very low load (app 5 a day).

--Zbmath authorid (talk) 16:09, 27 August 2018 (UTC)

Please make some test edits.--Ymblanter (talk) 17:46, 30 August 2018 (UTC)

ScorumMEBot[edit]

ScorumMEBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: ScorumME (talkcontribslogs)

Task/s: Создание и обновление Wikidata футбольной статистики на данный момент это победы, поражения, ничьи, кол-во забитых голов, кол-во пропущенных голов в лиге для определенных команд. Будет работать в полуручном режиме. Готов нести ответственность за все совершённые ботом правки.
The creation and updating of Wikidata football statistics at the moment are wins, losses, draws, number of goals scored, number of goals conceded in the league for certain teams. Bot will work in semi-manual mode. Ready to accept responsibility for all changes made by bot.

Function details: Бот написан на nodejs и использует библиотеки Wikidata Edit и Wikidata SDK от Maxlath https://www.wikidata.org/wiki/Wikidata:Tools/External_tools. Сервер слушает фид, который выдает в реальном времени статистику по футбольным матчам и, разобрав его, производит отправку запроса на обновление соответствующих данных Wikidata. Все лимиты по запросам соблюдает.
The bot is written in Node.js and uses the Wikidata Edit and Wikidata SDK libraries from Maxlath https://www.wikidata.org/wiki/Wikidata:Tools/External_tools. Server listens to feed, which provides real-time football statistics and sends a request for updating the relevant Wikidata data.

В БД бота хранится информация о идентификаторах записи wikidata.
Information about the identifiers of the wikidata records are stored in our database.

Question: How long should we wait for your decision on our request? -- – The preceding unsigned comment was added by ScorumMEBot (talk • contribs).
Please perform some test edits. Lymantria (talk) 06:03, 1 September 2018 (UTC)

GZWDer (flood) 3[edit]

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: GZWDer (talkcontribslogs)

Task/s: Creating items for all Unicode characters

Code: Unavailable for now

Function details: Creating items for 137,439 characters (probably excluding those not in Normalization Forms):

  1. Label in all languages (if the character is printable; otherwise only Unicode name of the character in English)
  2. Alias in all languages for U+XXXX and in English for Unicode name of the character
  3. Description in languages with a label of Unicode character (P487)
  4. instance of (P31)Unicode character (Q29654788)
  5. Unicode character (P487)
  6. Unicode hex codepoint (P4213)
  7. Unicode block (P5522)
  8. writing system (P282)
  9. image (P18) (if available)
  10. HTML entity (P4575) (if available)
  11. For characters in Han script also many additional properties; see Wikidata:WikiProject CJKV character

For characters with existing items the existing items will be updated.

Question: Do we need only one item for characters with the same normalized forms, e.g. Ω (U+03A9, GREEK CAPITAL LETTER OMEGA) and Ω (U+2126, OHM SIGN)?--GZWDer (talk) 23:08, 23 July 2018 (UTC)

CJKV characters belonging to CJK Compatibility Ideographs (Q2493848) and CJK Compatibility Ideographs Supplement (Q2493862) such as 著 (U+FA5F) (Q55726748), 著 (U+2F99F) (Q55738328) will need to be split from their normalized form, eg. (Q54918611) as each of them have different properties. KevinUp (talk) 14:03, 25 July 2018 (UTC)

Request filed per suggestion on Wikidata:Property proposal/Unicode block.--GZWDer (talk) 23:08, 23 July 2018 (UTC)

Symbol support vote.svg Support I have already expressed my wish to import such dataset. Matěj Suchánek (talk) 09:25, 25 July 2018 (UTC)
Symbol support vote.svg Support @GZWDer: Thank you for initiating this task. Also, feel free to add yourself as a participant of Wikidata:WikiProject CJKV character. [14] KevinUp (talk) 14:03, 25 July 2018 (UTC)
Symbol support vote.svg Support Thank you for your contribution. If possible, I hope you to also add other code (P3295) such as JIS X 0213 (Q6108269) and Big5 (Q858372) in items you create or update. --Okkn (talk) 16:35, 26 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose the use a of the flood account for this. Given the problems with unapproved defective bot run under the "GZWDer (flood)" account, I'd rather see this being done with a new account named "bot" as per policy.
    --- Jura 04:50, 31 July 2018 (UTC)
  • Perhaps we could do a test run of this bot with some of the 88,889 items required by Wikidata:WikiProject CJKV character and take note of any potential issues with this bot. @GZWDer: You might want to take note of the account policy required. KevinUp (talk) 10:12, 31 July 2018 (UTC)
  • This account has had a bot flag for over four years. While most bot accounts contain the word "bot", there is nothing in the bot policy that requires it, and a small number of accounts with the bot flag have different names. As I understand it, there is also no technical difference between an account with a flood flag and an account with a bot flag, except for who can assign and remove the flags. - Nikki (talk) 19:14, 1 August 2018 (UTC)
  • The flood account was created and authorized for activities that aren't actually bot activities. While this new task is one. Given that there had already been run defective bot tasks with the flood account, I don't think any actual bot tasks should be authorized. It's sufficient that I already had to clean up 10000s of GZWDer's edits.
    --- Jura 19:46, 1 August 2018 (UTC)
I am ready to approve this request, after a (positive) decision is taken at Wikidata:Requests for permissions/Bot/GZWDer (flood) 4. Lymantria (talk) 09:11, 3 September 2018 (UTC)
  • Wouldn't these fit better into Lexeme namespace? --- Jura 10:31, 11 September 2018 (UTC)
    There is no language with all Unicode characters as lexemes. KaMan (talk) 14:31, 11 September 2018 (UTC)
    Not really a problem. language codes provide for such cases. --- Jura 14:42, 11 September 2018 (UTC)
    I'm not talking about language code but language field of the lexeme where you select q-item of the language. KaMan (talk) 14:46, 11 September 2018 (UTC)
    Which is mapped to a language code. --- Jura 14:48, 11 September 2018 (UTC)
Note I'm going to be inactive for real life issue, so this request is Time2wait.svg On hold for now. Comments still welcome, but I'm not able to answer it until January 2019.--GZWDer (talk) 12:08, 13 September 2018 (UTC)

GZWDer (flood) 2[edit]

GZWDer (flood) (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: GZWDer (talkcontribslogs)

Task/s: Create new items and improve existing items from cebwiki and srwiki

Code: Run via various Pywikibot scripts (probably together with other tools)

Function details: The work include several steps:

  1. Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
  2. Import GeoNames ID (P1566) for pages from w:ceb:Kategoriya:GeoNames ID not in Wikidata
  3. Import coordinate location (P625) for pages from w:ceb:Kategoriya:Coordinates not on Wikidata‎
  4. Add country (P17) for cebwiki items
  5. Add instance of (P31) for cebwiki items
  6. (probably) Add located in the administrative territorial entity (P131) for cebwiki items
  7. (probably) Add located in time zone (P421) for cebwiki items
  8. Add descriptions in Chinese and English for cebwiki items (only if step 4 and 5 is completed)

For srwiki, the actions are similar.

--GZWDer (talk) 13:56, 16 July 2018 (UTC)

Note: until phab:T198396 is fixed, this can only be done step-by-step and no mutliple task at a time.--GZWDer (talk) 14:02, 16 July 2018 (UTC)
Symbol support vote.svg Support Thank you for your elaboration! Keeping to my word now. Mahir256 (talk) 13:59, 16 July 2018 (UTC)
@Mahir256: Please unblock the bot account, I'm not goint to import more statements from cebwiki (and srwiki) until the discussion is closed and I have several other (low-speed) use of the bot account.--GZWDer (talk) 14:01, 16 July 2018 (UTC)
Yes, I did that, as I said I would do. Although @GZWDer: what will differ in your procedure with regard to the srwiki items? A lot of those places might have eswiki article equivalents (with the same INEGI code (Q5796667)); do you plan to link these if they exist? Mahir256 (talk) 14:02, 16 July 2018 (UTC)
The harvest_template script can not check duplicates and duplicates can only be found after data is imported (this may be a bug, though).--GZWDer (talk) 14:04, 16 July 2018 (UTC)
@Pasleim: Would this functionality be easy to add to the tool? It certainly seems desirable, especially with regard to GeoNames IDs. Mahir256 (talk) 14:06, 16 July 2018 (UTC)
See phab:T199698. I do not use Pasleim's harvest template tool because the tool stops automatically when meeting errors (it should retry the edit; if meeting rate limit retry after some time)--GZWDer (talk) 14:10, 16 July 2018 (UTC)
Symbol oppose vote.svg Oppose cebwiki is, as too many users concerned, the black hole of wikis. These so-called "datas" are having too many mistakes. --Liuxinyu970226 (talk) 14:15, 16 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose Needs to do far more checking as to whether related items already exist, to add the information and sitelink to existing items if possible; and to appropriately relate the new item to existing items if not. If other items already have any matching identifiers (but are eg linked to a different ceb-wiki item), or there is any other reason to think it may be a duplicate, then any new item should be marked instance of (P31) Wikimedia duplicated page (Q17362920) as its only P31, and be linked to the existing item by said to be the same as (P460). Jheald (talk) 14:19, 16 July 2018 (UTC)
    • Duplicates is easier to find after they are imported to Wikidata than on cebwiki.--GZWDer (talk) 14:24, 16 July 2018 (UTC)
@Jheald: It may be worth our time (or worth the time of those who already make corrections on cebwiki) to go to GeoNames and correct things our(them)selves so that in the event Lsjbot returns it doesn't recreate these duplicates. Mahir256 (talk) 14:34, 16 July 2018 (UTC)
@GZWDer: I try bloody hard to avoid creating new items that are duplicates, going to considerable lengths with off-line scripts and augmenting existing data to avoid doing so; and doing my level best to clear up any that have slipped online, as quickly as I can. I don't see why I should expect less from anybody else. Jheald (talk) 14:45, 16 July 2018 (UTC)
  • Pictogram voting comment.svg Comment given the capacity problems of Wikidata, the fact that cebwiki is practically dormant, I don't think this should be done. Somehow I doubt the operator will do any of the announcement maintenance as I think they announced that a couple of months back and then left it to other Wikidata users. So no, not another 600,000 items. For the general discussion, see Wikidata:Project_chat#Another_cebwiki_flood?.
    --- Jura 20:18, 16 July 2018 (UTC)
    • cebwiki is not dormant as the articles are still being maintained.--GZWDer (talk) 00:30, 17 July 2018 (UTC)
    • Is there a way to see this on ceb:? I take it that any user on ceb:Special:Recent changes without a local user page isn't really active there.
      --- Jura 04:41, 26 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose Per Jheald. Planning that it "is much easier to find such duplicates if the data is stored in a structured way", so deliberately importing duplicates (which won't be merged within a very short time) is an abuse of Wikidata and our resources. Resources spent on cleaning the mess of some origin are missing at other places to bring high quality data to other wikis and elsewhere. The duplicates are a big problem, they pop up on search and queries etc. Sitelinks might be added after data is cleaned off-Wikidata (if cleaning is feasible at all; no idea perhaps deletion of articles on cebwiki is a better solution than importing cebwiki sitelinks here). --Marsupium (talk) 23:26, 18 July 2018 (UTC)
    • Duplicates already exists everywhere in Wikidata so it should not be warrented that different items refer to different concepts (though it is usually the case), and nobody should use search and query result directly without care. Searchs are not intended to be directly used by 3rd party users. For queries, if data consumer really think duplicates in Wikidata query result is an issue they can choose to exclude cebwiki-only items in query result.--GZWDer (talk) 23:45, 18 July 2018 (UTC)
  • Symbol oppose vote.svg Oppose Thanks a lot for your work on other wikis, it is immensely useful, but this workflow is really not appropriate for cebwiki. Creating new cebwiki items without being certain that they do not duplicate existing items creates a significant strain on the community. It is not okay to expect people to find ways to exclude cebwiki-only items in query results as a result: these items should not be created in the first place. − Pintoch (talk) 09:55, 19 July 2018 (UTC)
    • probably 90% of entries are unique to cebwiki. It may be wise to import these unique entries first.--GZWDer (talk) 16:38, 20 July 2018 (UTC)
      • Well, whatever the actual percentage is, many of us have painfully experienced that it is way too low for our standards. It may be wise to be more considerate to your fellow contributors, and stop hammering the server too. A lot of people have complained about cebwiki item creations, and it is really a shame that a block was necessary to actually get you to stop. So I really stand by my oppose. − Pintoch (talk) 07:34, 21 July 2018 (UTC)
    • The approach outlined above doesn't really address any of the problems with the data.
      --- Jura 04:41, 26 July 2018 (UTC)

Plan 2[edit]

The plan only does:

  1. Create items from w:ceb:Kategoriya:Articles without Wikidata item (plan to do together with step 2)
  2. Import GeoNames ID (P1566) for pages

Therefore:

  1. It is easier to find articles exist in other Wikipedias by search and projectmerge (and possible mix'n'match and other tools)
  2. Also possible to find entries from GeoNames ID, and vice versa
  3. As no other data will be imported in plan 2, it will not pollute query results and OpenRefine (unless specifically query GeoNames ID)
  4. Others may still import other data to these items, but only if they're confident to do so; they had better import coordinates etc. from a more reliable database (e.g. GEOnet Names Server)

--GZWDer (talk) 06:09, 26 July 2018 (UTC)

Symbol oppose vote.svg Oppose I just oppose your *cebwiki* importing, you are feel free to import Special:unconnectedpages links other than this wiki. --Liuxinyu970226 (talk) 04:45, 31 July 2018 (UTC)
  • @Pasleim: seems to have done quite a lot of maintenance on cebwiki sitelinks. I'm curious what his view is on this.
    --- Jura 06:39, 31 July 2018 (UTC)
Symbol oppose vote.svg Oppose, this still pollutes OpenRefine results - especially when reconciling via GeoNames ID, which should be the preferred way when this id is available in the table. I don't see how voluntarily keeping the items basically blank would be a solution at all, it makes it harder to find duplicates. − Pintoch (talk) 11:54, 5 August 2018 (UTC)
Do you have experience with matching based on existing GeoNames IDs then? I still see items on a regular basis which have the wrong ID thanks to bots which imported lots of bad matches years ago (e.g. Weschnitz (Q525148) and River Coquet (Q7337301)), so it would be great if you could explain what you did to avoid mismatches so that bots can do the same. If bots assume that our GeoNames IDs are correct, they'll add sitelinks/statements/descriptions/etc to the wrong items and make a mess that's much harder to clean up than duplicates are. - Nikki (talk) 20:09, 5 August 2018 (UTC)
@Pintoch: Wikidata Qids are designated as persistant identifiers; they are still valid when the items are merged, but no guarantees should be assumed that any items (whether bot created or not) is never merged or redirected. They are plenty of mismatches in cebwiki and Wikidata (which should be solved) but creating new items will not bring any new mismatches. Also, why do you think that leaving cebwiki pages unconnected is easier to find duplicates?--GZWDer (talk) 09:28, 6 August 2018 (UTC)
@Nikki: Yes I have experience with matching based on GeoNames IDs, and it generally gives very bad results because many items get matched to cebwiki items instead of the canonical item. I don't have any good strategy to avoid mismatches and that is the reason why I regret that these cebwiki items have been created without the appropriate checks for existing duplicates. I understand that cebwiki imports are not the only imports responsible for the unreliability of GeoNames ids in Wikidata, but in my experience the majority of errors came from cebwiki. I am not sure I fully get your point: are you arguing that it is fine to create duplicate cebwiki items because GeoNames IDs in Wikidata are already unreliable? I don't see how existing errors are an excuse for creating more of them. − Pintoch (talk) 09:02, 12 August 2018 (UTC)
@Pintoch: I am arguing that we need to avoid linking the cebwiki pages to the wrong items because merges are vastly better than splits, and that will involve some duplicates. Duplicate IDs continue being valid and will point to the right item even after a merge. The same is not true of splitting and you never know who is already using the ID. I agree that it would be nice to reduce the number of duplicates it creates, but nobody seems to have any idea how it should do that without creating even more bad matches, which is why I was hoping you might have some tips. - Nikki (talk) 13:12, 12 August 2018 (UTC)
@Nikki: okay, I get your point, thanks. So, no I haven't looked into the problem myself. If I had time I would first try to clean up the current items rather than creating new ones (and you have worked on this: thanks again!). I don't think there is any rush to empty w:ceb:Kategoriya:Articles without Wikidata item, so that's why I oppose this bot request. − Pintoch (talk) 18:24, 12 August 2018 (UTC)
@GZWDer: creating new items will not bring any new mismatches: creating new items will create new duplicates, and that is what disrupts our workflows. I personally don't care about the Wikidata <-> cebwiki mapping. If you care about this mapping, then please improve it without creating duplicates (that is, with reliable heuristics to match the cebwiki articles to existing items). If you do not have the tools to do this import without being disruptive to other Wikidata users, then don't do it. If someone else files a bot request to do this task, with convincing evidence that their import process is more reliable than yours, I will happily support it. − Pintoch (talk) 09:02, 12 August 2018 (UTC)
@Pintoch: Your argument is basically "create new duplicates in any case is harmful" - but duplicates already exists everywhere, created by different users. They may be eventually merged, and their IDs are still valid. There're much more cases for no matchs found and no items will be created for them in the foreseeable future (as it is not possible to handle all 500,000 pages manually).
@GZWDer: there are three differences between other users' duplicates and your duplicates: the first is the scale (500,000 items for this proposal), the second is the absence of any satisfactory checks for existing duplicates (which is unacceptable), the third is the domain (geographical locations are pivotal items that many other domains rely on - creating a mess there is more disruptive than in other areas). This is about creating 500,000 new geographical items with no reconciliation heuristics to check for existing duplicates. This is really detrimental to the project, and I am not the only one complaining about it. − Pintoch (talk) 10:31, 19 August 2018 (UTC)
Also, what about first creating items for pages without extent items with same labels (this is the default setting of PetScan)?--GZWDer (talk) 20:12, 13 August 2018 (UTC)
I think checks need to be more thorough than that, for instance because cebwiki article titles often include disambiguation information in brackets. For instance, these heuristics would fail to identify https://ceb.wikipedia.org/wiki/Amsterdam_(lungsod_sa_Estados_Unidos,_Montana) and Amsterdam-Churchill (Q614935). − Pintoch (talk) 10:31, 19 August 2018 (UTC)
  • Oppose. Although I'm not aware of this being a policy so far, I believe new items should be created from the encyclopedia that is likely to have the best information on them. A bot shouldn't create new items from a Russian Wikipedia item about a US state or a US politician, and a bot shouldn't create new items about Russian city or politician from an English Wikipedia article. This restriction wouldn't necessarily apply to items that are not firmly connected to any particular, country, such as algebra for example. Jc3s5h (talk) 16:18, 30 August 2018 (UTC)
    • No, this isn't a policy and it never could be. One of Wikidata's main functions is to support other Wikimedia projects by providing interwiki links and structured data. Requiring links to a particular Wikipedia before an item is considered notable would cripple Wikidata. We also can't control which Wikipedias people copy data from. We can refuse to allow a bot to run but that doesn't stop people from doing it manually or with tools like Petscan and Harvest Templates. - Nikki (talk) 12:08, 31 August 2018 (UTC)
  • @Ivan_A._Krestinin: In the meantime, KrBot seems to be doing this. --- Jura 10:28, 11 September 2018 (UTC)
  • Have no time to read the discussion. My bot is importing country (P17), coordinate location (P625), GeoNames ID (P1566) from cebwiki now. — Ivan A. Krestinin (talk) 21:24, 11 September 2018 (UTC)
    • @Ivan_A._Krestinin: There is a lot of opposition to mass-creating new items for cebwiki items (see above), so you should create a new request for permissions before continuing. - Nikki (talk) 12:05, 12 September 2018 (UTC)
      • Ok, I disabled new item creation. I have code for connecting pages from different wikies. But it does not work without item creation because it is based on scheme: import data, find duplicate items, analyze data conflicts, labels and etc., merge items. — Ivan A. Krestinin (talk) 20:07, 12 September 2018 (UTC)
        • Thanks. The main issue is that people don't want duplicates. If you can explain what your bot does to avoid duplicates when you create a new request for permissions, it will hopefully be enough to change people's minds. :) - Nikki (talk) 09:00, 13 September 2018 (UTC)

If someone is creating items for all cebwiki articles, I'm still plan to add statements and descriptions to them. However for real life issues I'd like to place the request Time2wait.svg On hold until January-February 2019 and see what happens. Comments and questions are still welcome, but I am probably not able to answer it anytime soon.--GZWDer (talk) 06:10, 12 September 2018 (UTC)

@GZWDer: Since there are too many oppose comments, and already bumped privacy concerns at WMF Trust & Safety, it's unlikely that your work can be approved, so why not withdrawn it? --Liuxinyu970226 (talk) 22:41, 15 September 2018 (UTC)

crossref bot[edit]

crossref bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Mahdimoqri (talkcontribslogs)

Task/s: to add missing journals from crossref api

Code: https://github.com/moqri/wikidata_scientific_citations/tree/master/add_journal/crossref

Function details: add missing journals from crossref --Mahdimoqri (talk) 21:12, 19 April 2018 (UTC)

See the discussion here and the data import request and workflow here

@DarTar, Daniel_Mietchen, Fnielsen, John_Cummings, Mahir256: any thoughts or feedback?

WikiBot[edit]

WikiBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: 1succes2012 (talkcontribslogs)

Task/s:access and parse data from Wikipedia

Code:To be developed

Function details:get article summaries, get data like links and images from a page and return it back to my users --1succes2012 (talk) 15:09, 17 June 2018 (UTC)

Pictogram voting comment.svg Comment For accessing data, a bot account is not necessary (unless you are about to hit security limits). Matěj Suchánek (talk) 09:09, 3 August 2018 (UTC)

PricezaBot[edit]

PricezaBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Pricezabot (talkcontribslogs)

Task/s: Add price to wikidata commercial products (e.g. phone, electronics, camera, etc)

Code:

Function details: --Pricezabot (talk) 09:18, 14 June 2018 (UTC) Priceza is price comparison engine in SEA, we have a lot of pricing data for commercial product and this bot will create statement in Wikidata on pricing detail from our website.

Comment If you're going to be importing data from your own aggregate website, this would quite literally be a spambot... Chrissymad (talk) 19:29, 14 June 2018 (UTC)

schieboutct[edit]

schieboutct (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Ctschiebout (talkcontribslogs)

Task/s: create bot to add missing demonyms

Code:

Function details: --Ctschiebout (talk) 01:39, 22 April 2018 (UTC)

Pictogram voting comment.svg Comment Code? Source? Matěj Suchánek (talk) 09:09, 3 August 2018 (UTC)

wikidata get[edit]

wikidata get (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: 27.4.240.118 (talkcontribslogs)

Task/s:

Code:

Function details: --27.4.240.118 10:51, 15 June 2018 (UTC)

Pictogram voting comment.svg Comment Please expand this request. Matěj Suchánek (talk) 09:15, 3 August 2018 (UTC)

Wolfgang8741 bot[edit]

Wolfgang8741 bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Wolfgang8741 (talkcontribslogs)

Task/s: Openrefine imports to Wikidata.

Code: N/A

Function details: Data imports from Openrefine datasets --Wolfgang8741 (talk) 02:16, 18 June 2018 (UTC)

Pictogram voting comment.svg Comment What kind of data from what source?  – The preceding unsigned comment was added by Matěj Suchánek (talk • contribs).
@Matěj Suchánek: Sorry I missed this comment. This is not a fully automated bot, but human assisted tool OpenRefine for larger imports, starting with small scale tests before larger application. The current focus is on the GNIS import at this time, yes the import description and process needs to be built out a bit more, I'm not using this until I refine the process and get community approval the to import. Initial learning curve and orientation to the WikiData processes in progress. Wolfgang8741 (talk) 15:49, 5 September 2018 (UTC)

CanaryBot 2[edit]

CanaryBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Ivanhercaz (talkcontribslogs)

Task/s: set labels, descriptions and aliases in Spanish to properties without them in Spanish.

Code: the code is available in PAWS, it is a Jupyter IPython notebook. But when I have time I will upload the ipynb and py file to CanaryBot repo in GitHub.

Function details: Well, this task that I am requesting is very similar to the first task that I requested, but I asked to Ymblanter for an opinion about it, and after that he/she recommend me to request a new task because, as Ymblanter said, I am going to use a new code because in this case I am going to set labels, descriptions and aliases in properties, nor in items as I did in my last task.

In addition, this scripts works differently: I extracted all the properties without label and description in Spanish, or both, and then I mixed both in one CSV in which I am filling the cells with their respective translations. When I will have all the cells I will run the script, which will read each row, check if the property has labels, descriptions and aliases in Spanish, if not, the script will add the content of their respective cells.

I have to test and improve some things of the script. It is very basic, but it works for what I want to do. I make a log of everything to knows how to solve an error if it happens.

Well, I await your answers and opinions. Thanks in advance!

Regards, Ivanhercaz Plume pen w.png (Talk) 23:45, 10 May 2018 (UTC)

I improved the code: some stats, log report fixed... I think it is ready to run without problems. What I need now is to finish the translations of the properties. I await your opinions about this task. Regards, Ivanhercaz Plume pen w.png (Talk) 15:54, 12 May 2018 (UTC)
I am ready to approve the bot task in a couple of days, provided that no objections will be raised. Lymantria (talk) 06:54, 13 May 2018 (UTC)
  • Could you link the test edits? I only find Portuguese.
    --- Jura 16:24, 13 May 2018 (UTC)
    Of course Jura, I think I shared the contributions in test.wikidata, but not, excuse me. I think that you refered to the edits in Portuguese that my bot made with its first task. In this case I only work with Spanish labels, descriptions and aliases. You can check my last contributions in test.wikidata. I checked the edit summary was wrong because it was "setting es-label" for all, I mean when the bot change a description, an alias or a label; I just fixed it and now it show the correct summary, as you can see in the last three editions. But I have find a bug that I have to fix: if you check this diff, you can see how the bot replaced the existing alias for the new, and what I want is to append the new aliases and keep the old aliases, so I have to fix it.
    I am not worry about the time or if the task is accepted now or in the future, I just wanted to propose it and talk about how it would work. But, being sincere, I have to fill the CSV file yet, so I have many time to fix this type of errors and improve it. For that reason I requested another task.
    Regards, Ivanhercaz Plume pen w.png (Talk) 17:19, 13 May 2018 (UTC)
    For bot approvals, operators generally do about 50 or more edits here at Wikidata. These "test edits" are expected to be of standard quality.
    --- Jura 17:23, 13 May 2018 (UTC)
    I know Jura, but I can make the test edits in Wikidata without authorization or the request of someone because this task is not approved. Well, as you are requesting me these test edits, when the aliases bug has been solved I will run the bot in Wikidata to report here if it works fine or not. Regards, Ivanhercaz Plume pen w.png (Talk) 17:29, 13 May 2018 (UTC)
    I fixed the bug of the aliases, as you can check here. I will notify you, Jura, when I have done the test edits in Wikidata and not in test.wikidata. Regards, Ivanhercaz Plume pen w.png (Talk) 18:26, 13 May 2018 (UTC)
  • @Jura1, Ymblanter: Today I could only make a few test edits. I will make more in the next days to check it better. Regards, Ivanhercaz Plume pen w.png (Talk) 18:15, 14 May 2018 (UTC)
    I forgot to share with you the log and if you check the notebook you can see the generated graph. Regards, Ivanhercaz Plume pen w.png (Talk) 18:26, 14 May 2018 (UTC)

maria research bot[edit]

maria research bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Mahdimoqri (talkcontribslogs)

Task/s: add missing articles and citations information for articles listed on PubMed Central

Code: https://github.com/moqri/wikidata_scientific_citations

Function details: --Mahdimoqri (talk) 06:15, 13 March 2018 (UTC)

Symbol support vote.svg Support Mahir256 (talk) 22:37, 13 March 2018 (UTC)
Pictogram voting comment.svg Comment This Fatameh-based script is useful for most of phase 1 and works fine for PubMed IDs and for some Crossref IDs as well but it does not address the citation part from phase 2 onwards. --Daniel Mietchen (talk) 13:27, 14 March 2018 (UTC)
Thanks Daniel Mietchen, I modified the description of the task here to confirm what the bot does at the moment. Mahdimoqri (talk) 15:52, 14 March 2018 (UTC)
Symbol support vote.svg Support That looks good to me. --Daniel Mietchen (talk) 19:54, 14 March 2018 (UTC)
Symbol support vote.svg Support The Fatameh edits from this bot seems fine so far. It is a nice simple script. I note some Fatameh artifacts for the titles, e.g., "*." in BOOKWORMS AND BOOK COLLECTING (Q50454030). But I suppose we have to live with that... — Finn Årup Nielsen (fnielsen) (talk) 18:44, 14 March 2018 (UTC)
I was going to write the same thing. Can we remove the trailing full stop (".") ? I'm sure some bot could clean up the existing ones as well.
--- Jura 20:37, 14 March 2018 (UTC)
Thanks Finn Årup Nielsen (fnielsen) and Jura, I would be happy to add another script to remove asterisks or to fix any other issues you find, after the PMC items added.Mahdimoqri (talk) 23:10, 14 March 2018 (UTC)
For the final dot, can you remove this before adding it to label/title statement?
--- Jura 23:17, 14 March 2018 (UTC)
Thanks Jura! Unfortunately, as much as I know, Fatameh does not have any out of the box option for such changes. I'd recommend a separate script to be written just for this purpose since there are currently 14 Million other articles have such a problem (https://www.wikidata.org/w/index.php?tagfilter=OAuth+CID%3A+843&limit=50&days=7&title=Special:RecentChanges&urlversion=2). Daniel Mietchen might be interested in such a script too. Mahdimoqri (talk) 02:51, 15 March 2018 (UTC)
@T Arrow, Tobias1984: could you fix Fatameh?
--- Jura 07:21, 15 March 2018 (UTC)
There is a task for it here: https://phabricator.wikimedia.org/T172383 Mahdimoqri (talk) 15:08, 15 March 2018 (UTC)
Do any of the people who wrote the code actually follow phabricator? I tried to find the part of the code where the dot gets added/should be removed, but I was probably in the wrong module. Any ideas?
--- Jura 05:16, 16 March 2018 (UTC)
I'm just not checking it all that regularly. I've replied to the ticket. Fatameh relies on wikidatintegrator to do most of the heavy lifting. This uses PubMed as the data source and (unfortunately?) they actually report all the titles as ending in a period (or other punctuation). I think we need to find a reference for the titles without the period rather than just changing all the existing statements. There was a short discussion on the WikiCite Mailing List as well. I'm happy to work on a solution but I'm not really sure what is the best way forward. T Arrow (talk) 09:26, 16 March 2018 (UTC)
Jura, I added the fix for the trailing dots and asterisks in a separate script (fatameh_sister_bot). Any other issues that I can address to have your support?Mahdimoqri (talk) 06:22, 17 March 2018 (UTC)

Thanks all for providing feedback and offering solutions/help to address the issue with Fatameh. It seems it will be a fix eaither for Fatameh or a separate script. In eaither case, it is to be applied to all article items which I beleive could be done independently of this bot. Meanwhile, could you support and accept this bot so I can get it started and maybe set up a new bot for fixing other issues? Mahdimoqri (talk) 21:12, 16 March 2018 (UTC)

Symbol oppose vote.svg Oppose I don't think we should approve another Fatameh based bot until major concerns are fixed. --Succu (talk) 21:24, 16 March 2018 (UTC)
Thanks for your feedback Succu. I just created a bot (Fatameh_sister_bot) that fixes the issue with the label for the items created using Fatameh. I'll make sure I run it on everything maria research bot creates to address the concern with the titles. Are there any other issues that I can address? Mahdimoqri (talk) 06:04, 17 March 2018 (UTC)
@Succu: I also fixed this issue from the root in Fatameh source code here so new items are created without the trailing dot.
Title statements would need the same fix and some labels have already been duplicated into other languages (maybe this is taken care of, but I haven't seen any in the samples).
--- Jura 09:35, 18 March 2018 (UTC)
Thanks for the feedback Jura. The translated labels (if any) are added to labels. I will take care of the title statement now.
@Jura1: the titles are also fixed and the code has been updated (https://github.com/moqri/wikidata_scientific_citations/blob/master/fatameh_sister_bot/fix_labels_and_titles.py). Any other issues that I can address to have your support for the bot?
I think the cleanup bot/task can be authorized.
--- Jura 12:30, 21 March 2018 (UTC)
@Jura1: wonderful! this is the request for the cleanup bot: fatameh_sister_bot. Could you please state your support there, for a bot flag?
I don't think edits like this one are OK, Mahdimoqri, because you are ignoring the reference given. And please wait with this kind of corrections until you got the flag. --Succu (talk) 22:36, 22 March 2018 (UTC)
@Succu: the title in the reference is not exactly correct. Please refer to this reference or this reference for the correct title. Would you like the bot to change the reference as well?  – The preceding unsigned comment was added by [[User:{{{1}}}|{{{1}}}]] ([[User talk:{{{1}}}|talk]] • [[Special:Contributions/{{{1}}}|contribs]]).
The cleanup should be fine. It just strips an artifact PMD adds.
--- Jura 09:24, 23 March 2018 (UTC)
Translated titles are enclosed within brackets. This should be changed. The current version overwrites existing page(s) (P304) with incomplete values. --Succu (talk) 10:08, 18 March 2018 (UTC)
@Succu: thanks for the feedback! I could not find any instance of either of the issues! Could you please reply with one instance of each of these two issues that is created by my bot so that I can address them? Mahdimoqri (talk) 03:17, 19 March 2018 (UTC)
[Sexually-transmitted infection in a high-risk group from Montería, Colombia]. (Q50804547) is an example for the first issue. Removing the brackets only is not the solution. --Succu (talk) 22:30, 22 March 2018 (UTC)
I will not import any items with translated titles (until there is a consensus on what is the solution on this). Mahdimoqri (talk) 14:06, 30 March 2018 (UTC)
We should try figure out how to handle them (e.g. import the original language and delete "title"-statement, possibly find the original title and add that as title and label in that language). For new imports, it would just need to skip adding the title statement and add a language of work or name (P407).
--- Jura 09:24, 23 March 2018 (UTC)
Or use original language of work (P364). Anyway, it should be made clear that the original title is not English. -- JakobVoss (talk) 14:45, 24 March 2018 (UTC)
Should attempt to add a statement that identifies them as not being in English before we actually manage to determine the original language?
--- Jura 21:12, 24 March 2018 (UTC)

AmpersandBot[edit]

AmpersandBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: PinkAmpersand (talkcontribslogs)

Task/s: Generate descriptions for village items in the format of "village in <place>, <place>, <country>"

Code: https://github.com/PinkAmpersand/AmpersandBot/blob/master/village.py

Function details: With my first approved task (approved in July 2016, but not completed until recently), I set descriptions for about 20,000 Ukrainian villages based on their country (P17), instance of (P31), and located in the administrative territorial entity (P131) values. Now, I would like to use the latter two values to generalize this script to—ominous music—every village in the world!

The script works as follows:

  1. It pulls up 5,000 items backlinking to village (Q532)
  2. It checks whether an item is instance of (P31)  village (Q532)
  3. It then labels items as follows:
    1. It removes disambiguation from labels in any language:
      1. It runs a RegEx search for ,| \(
      2. It removes those characters and any following ones
      3. It sets the old label as an alias for the given language
      4. If the alias is in Unicode, it creates an ASCII version and sets that as an alias as well
      5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
    2. It sets labels in all Latin-script languages:
      1. It checks if the current Latin-script languages all use the same label.
      2. If they don't, it does nothing except log the item for further review.
      3. If they do, it sets that label as the label for all other Latin-script languages, using a list of 196 (viewable in the source code)
      4. If the label is in Unicode, it also sets an ASCII version of the label as an alias
      5. It compiles a new list of labels and aliases for the relevant languages, and updates the item with all of them at once
  4. And describes items as follows:
    1. It checks whether the item either a) lacks an English description or b) has an English description that merely says "village in <country>" or "village in <region>". (I've manually coded into the RegEx the names of every multi-word country. This still leaves a blind spot for multi-word entities other than countries. I welcome advice on how to fix this.)
    2. If so, it gets the item's parent entity. If that entity is a country, it describes the item as "village in <parent>"
    3. If the parent entity is not a country, it checks the grandparent entity. If that is a country, it describes the item as "village in <parent>, <grandparent>"
    4. Next onto the great-grandparent entity. "village in <parent>, <grandparent>, <great-grandparent>"
    5. For the great-great-grandparent entity, only the top three levels are used: "village in <grandparent>, <great-grandparent>, <great-great-grandparent>". This is slightly more likely to result in dupe errors, but the code handles those.
    6. Ditto the thrice-great-grandparent entity.
    7. If even the thrice-great-grandparent is not a country, the item is logged for further review. If people think I should go deeper, I am willing to; I may do so of my own initiative if the test run turns up too many of these errors.
  5. After 5,000 items have been processed, another 5,000 are pulled. The script continues until there are no backlinks left to describe.

Does this sound good? — PinkAmpers&(Je vous invite à me parler) 01:43, 22 February 2018 (UTC) Updated 22:17, 3 March 2018 (UTC)

Test run here. The only issue that arose was some items, like Koro-ni-O (Q25694), being listed in my command line as updated, but not actually updating. It's a bug, and I'll look into it, but its only effect is to limit the bot's potential, not to introduce any unwanted behavior. — PinkAmpers&(Je vous invite à me parler) 02:16, 22 February 2018 (UTC)
I will approve the bot in a couple of days provided no objections have been raised.--Ymblanter (talk) 08:39, 25 February 2018 (UTC)
Cool, thanks! But actually, I'm working on a few more things for the bot to do to these village items while it's "in the neighborhood", so would you mind holding off until I can post a second test run? — PinkAmpers&(Je vous invite à me parler) 00:23, 26 February 2018 (UTC)
This is fine, no problem.--Ymblanter (talk) 10:42, 26 February 2018 (UTC)
@Ymblanter:. Okay. I'm all done. I've updated the bot's description above. Diff of changes here. New test run here. There was one glitch in this test run, namely that the bot failed to add ASCII aliases for Unicode labels while performing the Latin-script label unanimity function. This was due to a stray space before the word aliases in line 247. I fixed that here, and ran a test edit here to check that that worked. But I'm happy to run a few dozen more test edits if you want to see that fix working in action. — PinkAmpers&(Je vous invite à me parler) 22:17, 3 March 2018 (UTC)
Concerning the Latin script languages, not all of them use the same spelling. For example, here I am sure that in lv it is not Utvin (most likely Utvins), in lt it is not Utvin, and possibly in some other languages it is not Utvin (for example, crh uses fonetic spelling, Utvin may be fine, but other names will not be fine). I would suggest to restrict this part of the task to major languages (say German, French, Spanish, Portuguese, Italian, Danish, Swedish, may be a couple of more) and for others make some research - I have no ideas for example what Navajo uses). The rest seems to be fine.--Ymblanter (talk) 07:48, 4 March 2018 (UTC)
I'm concerned about exonyms too. Even if a language uses the same name variant as other Latin-script languages for most settlements, then there are particular settlements for which it may not do so. 90.191.81.65 14:30, 4 March 2018 (UTC)
I considered that, 90.191.81.65, but IMHO it's not a problem. The script will never overwrite an earlier label, and indeed won't change the labels unless all existing Latin-script labels are in agreement. So the worst-case scenario here is that an item would go from having no label in one language to having one that is imperfect but not incorrect. An endonym will always be a valid alias, after all. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
I'm not sure that all languages consider an endonym as a valid alias if there's an exonym too. And if it is considered technically not incorrect then for some cases an endonym would still be rather odd. My concern on this is similar to one currently brought up in project chat. 90.191.81.65 07:58, 5 March 2018 (UTC)
I would think that an endonym is by definition a valid alias. The bar for "valid alias" is pretty low, after all. So if there isn't consensus to use endonyms as labels, I can set them as aliases instead. — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
Also, all romanized names are probably problematic. Many languages may use the same romanization system (the same as in English or the one recommended by the UN) for particular foreign language, but there are also languages which have their own romanization system. So a couple of the current Latin-script languages using the same romanization would be merely a coincidence. 90.191.81.65 14:49, 4 March 2018 (UTC)
I'm confused about your concern here. The only romanization that the script does is in setting aliases, not labels. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
All Ukrainian, Georgian, Arab etc. place names apart from exonyms are romanized in Latin-script languages. And there are different romanization systems, some are specific to particular language, e.g. Ukrainian-Estonian transcription. For instance, currently all four Latin labes for Burhunka (Q4099444) happen to be "Burhunka", but that wouldn't be correct in Estonian. 90.191.81.65 07:58, 5 March 2018 (UTC)
Well that's part of why I'm using a smaller set of languages now. Can you give me examples of languages within the set that have this same problem? — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
Thanks for the feedback, Ymblanter. I've pared back the list, and posted at project chat asking for help with re-expanding it. See Wikidata:Project chat § Help needed with l10n for bot. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)

I note that here bot picks up name of a former territorial entity, though preferred rank is set for current parish. Also, is the whole territorial hierarchy really necessary in description if there's no need to disambiguate from other villages with the same name in the same country? For a small country like Estonia I'd prefer simpler descriptions. 90.191.81.65 14:30, 4 March 2018 (UTC)

The format I'm using is standard for English-language labels. See Help:Description § Go from more specific to less specific. — PinkAmpers&(Je vous invite à me parler) 21:37, 4 March 2018 (UTC)
The section you refer to concerns with in what order you go more specific in a description. As for how specific you should go it leaves it open. Apart from saying in above section that adding one subregion of a country is common and bringing two examples where whole administrative hierarchy is not shown. 90.191.81.65 07:58, 5 March 2018 (UTC)
To me, the takeaway from Help:Description is that using a second-level subregion is not required, but also not discouraged. It comes down to an individual editor's choice. — PinkAmpers&(Je vous invite à me parler) 17:51, 5 March 2018 (UTC)
  • Pictogram voting comment.svg Comment I'm somewhat concerned about the absence of a plan to maintain this going forward. If descriptions in 200 languages for 100,000s items are being added, this becomes virtually impossible to correct manually. Descriptions can need to be maintained if the names changes, if the P131 is found to be incorrect or irrelevant. Already now default labels for items that may seem static (e.g. categories/lists) aren't maintained once the are added, this would just add another chunk of redundant data that isn't maintained. The field already suffers from absence of the maintenance of cebwiki imports, so please don't add more to it. Maybe one would want to focus on English descriptions and native label statements instead.
    --- Jura 10:16, 12 March 2018 (UTC)

Arasaacbot[edit]

Arasaacbot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Lmorillas (talkcontribslogs)

Task/s:Search info, taxonimies and translate our images at http://arasaac.org

Code: https://github.com/lmorillas/aradata

Function details: Search image names at wikidata and get info about them --Arasaacbot (talk) 12:28, 15 January 2018 (UTC)

@Arasaacbot, Lmorillas: Your GitHub repository doesn't have any actual code in it. It would be helpful if you could upload the source code. Also, can we please see a test run of 50-250 edits?Assuming you still plan on using this bot. — PinkAmpers&(Je vous invite à me parler) 23:27, 23 February 2018 (UTC)
@Arasaacbot, Lmorillas: Still interested? Matěj Suchánek (talk) 09:28, 3 August 2018 (UTC)
@Matěj Suchánek, PinkAmpersand: Sorry for the delay. I want to use wikidata for improving the content or our images service. I asked a friend that uses wikidata and he said that if we only need read permission there ar not needed special permissions, aren't they? Lmorillas (talk) 09:43, 7 August 2018 (UTC)
No, except some situations like big queries etc. Matěj Suchánek (talk) 11:24, 8 August 2018 (UTC)

taiwan democracy common bot[edit]

taiwan democracy common bot (talkcontribsnew itemsSULBlock logUser rights logUser rights) Operator: Twly.tw (talkcontribslogs)

Task/s: Input Taiwan politician data, it's a project from mySociety

Code:

Function details: follow this step to input politician data, mainly in P39 statement and related terms, constituency and political party. --~~~~

The operator can't be the bot itself. So who's going to operate the bot? Mbch331 (talk) 14:46, 18 February 2018 (UTC)
Operator will be Twly.tw (talkcontribslogs), bot: taiwan democracy common bot.
This is Twly.tw (talkcontribslogs), based on Wikidata:Requests for permissions/Bot/taiwan democracy common--Ymblanter (talk) 09:42, 25 February 2018 (UTC)
I would like to get some input from uninvolved users here before we can proceed.--Ymblanter (talk) 18:56, 1 March 2018 (UTC)
  • The bot might need a fix for date precision (9→7). It seems that everybody is born on January 1: Q19825688, Q8274933, Q8274088, Q8350110. As these items already had more precise dates, it might want to skip them.
    --- Jura 11:00, 12 March 2018 (UTC) Fixed, thanks.
@Jura1:, can we proceed here?--Ymblanter (talk) 21:06, 21 March 2018 (UTC)
I have a hard time trying to figure out what it's trying to do. Maybe some new test edits could help. Is the date precision for the start date in the qualifier of Q19825688 correct.
--- Jura 21:37, 21 March 2018 (UTC)

Newswirebot[edit]

Newswirebot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Dhx1 (talkcontribslogs)

Task/s:

  1. Create items for news articles that are published by a collection of popular/widespread newspapers around the world.

Code:

  • To be developed.

Function details:

Purpose:

  • New items created by this bot can be used in described by source (P1343) and other references within Wikidata.
  • New items created by this bot can be referred to in Wikinews articles.

Process:

  1. For each candidate news article, check whether a Wikidata item of the same title exists with a publication date (P577) +/- 1 day.
    1. If an existing Wikidata item is found, check whether publisher (P123) is a match as well.
    2. If publisher (P123) matches, ignore the candidate news article.
  2. For each candidate news article, check whether an existing Wikidata item has the same official website (P856) (full URL to the published news article).
    1. If official website (P856) matches, ignore the candidate news article.
  3. If no existing Wikidata item is found, create a new item.
  4. Add a label in English which is the article title.
  5. Add descriptions in multiple languages the format of "news article published by PUBLISHER on DATE".
  6. Add statement instance of (P31) news article (Q5707594).
  7. Add statement language of work or name (P407) English (Q1860).
  8. Add statement publisher (P123).
  9. Add statement publication date (P577).
  10. Add statement official website (P856).
  11. Add statement author name string (P2093) which represents the byline (Q1425760). Note that this could the name of a news agency or combination of news agency and publisher if the writer is not identified.
  12. Add statement title (P1476) which represents the headline (Q1313396).

Example sources and copyright discussions:

--Dhx1 (talk) 13:00, 8 February 2018 (UTC)

Interesting initiative. How many articles do you plan to create per day? --Pasleim (talk) 08:44, 9 February 2018 (UTC)
I was thinking of programming the bot to regularly check Grafana and/or Special:DispatchStats or similar statistics endpoint, raising or lowering the rate of edits to a predefined limit. It appears that larger publishers may publish around 300 articles per day, so if bot was developed to work with 10 sources, that is around 3000 new articles per day, or one new article every 30 seconds. For the initial import, an edit rate of 1 article creation per second (what User:Research_Bot seems to use at the moment) would allow 86,400 articles to be processed per day, or approximately 30 days worth of archives processed per day. At that rate, it might take 4-5 months to complete the initial import. Dhx1 (talk) 10:12, 9 February 2018 (UTC)
We probably need the code and test edits to continue this discussion.--Ymblanter (talk) 08:31, 25 February 2018 (UTC)
@Dhx1: What do you think about Zotero translators? Could they be somehow used in order to speed up the process?--Malore (talk) 16:09, 20 September 2018 (UTC)
@Malore: I have been using scrapy which is trivial to use for crawling and extracting information. The trickier part at the moment is finding matching Wikidata articles that already exist, and writing to Wikidata. Pywikibot doesn't seem to allow writing a large Wikidata item at once with many claims, qualifiers and references. The API allows it however, and the WikidataIntegrator bot also allows it, albeit with little documentation to make it clear how it works. Zotero could be helpful if a large community forms around it with news website metadata scraping (for bibliographies).Dhx1 (talk) 11:53, 23 September 2018 (UTC)

KlosseBot[edit]

KlosseBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Walter Klosse (talkcontribslogs)

Task/s: Bot is making mass editations with Widar.

Function details: This bot will do mass creating of items with QuickStatments predominantly about comic book characters with informations from sites such as marvel.com or Marvel Database.

--Walter Klosse (talk) 20:40, 17 November 2017 (UTC)

Please be more precise to what task you are going to perform with your bot. Permission should be asked task by task. See also Wikidata:Bots. Lymantria (talk) 18:45, 25 November 2017 (UTC)
I am confused. Edits your bot has made since this request do not fall within the scope you described above, but seem to focus on programming languages. Lymantria (talk) 13:03, 8 December 2017 (UTC)
My bad, originally i was thinking that this bot will do only comic book characters, but now i do editations with more topics. --Walter Klosse (talk) 21:17, 15 December 2017 (UTC)
Please take in mind that permission should be requested task by task. See also Wikidata:Bots. But if your tasks are "small" perhaps a bot flag is not needed. Lymantria (talk) 15:17, 17 December 2017 (UTC)
@Walter Klosse: Still interested? Matěj Suchánek (talk) 09:19, 3 August 2018 (UTC)

NIOSH bot[edit]

NIOSH bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Harej (talkcontribslogs)

Task/s: Synchronize Wikidata with the NIOSHTIC-2 research database.

Code: https://github.com/harej/niosh2wikidata

Function details: NIOSHTIC-2 is a database of occupational safety and health research published by NIOSH and/or supported by NIOSH staff. As part of my work with NIOSH I have developed scripts to make sure NIOSHTIC has corresponding entries in Wikidata (but, where possible, it will not create duplicates of entries that already exist on Wikidata). This allows NIOSH's data to be part of a greater network of data, for instance by including data from other sources such as PubMed. Better indexing this data is part of a longer-term effort to make it easier for Wikipedia editors to discover these reliable resources. --Harej (talk) 05:59, 14 November 2017 (UTC)

Please make some test edits.--Ymblanter (talk) 11:51, 19 November 2017 (UTC)
@Harej: Still interested? Matěj Suchánek (talk) 09:19, 3 August 2018 (UTC)
@Matěj Suchánek: In principle yes; however, I'm currently in the process of reworking my scripts so that they will work for Wikidata at its current size. Harej (talk) 15:09, 3 August 2018 (UTC)

Ymblanter, Matěj Suchánek, I have made some test edits. Please let me know if you have any questions. Harej (talk) 17:17, 12 August 2018 (UTC)

I am fine with the test edits and can approve the bot in several days provided there have been no objections raised.--Ymblanter (talk) 21:22, 12 August 2018 (UTC)
@Harej: For the first two test edits I get a „We are sorry, the page you are looking for was not found.“ message. --Succu (talk)
Succu, I generally find that happens when an entry is new enough in the NIOSHTIC database that it has a listing in the search engine but not a corresponding static page. However if you search NIOSHTIC for the date range during which the article was published, the article will still show up in the search results. I would link to search results, but it's one of those search engines where the results expire. (Frustrating, I know.) Harej (talk) 05:21, 13 August 2018 (UTC)
Why not omit them until the changes are online? Sorry for the delayed answer, Harej. --Succu (talk) 19:06, 26 August 2018 (UTC)
Succu, I have no way of distinguishing between which ones are online and which ones aren't. They show up in the search engine results anyway, so I would consider them valid assigned numbers. Harej (talk) 19:42, 26 August 2018 (UTC)
Load the page and look for „We are sorry, the page you are looking for was not found.“ I think if this string is not present all is fine. --Succu (talk) 19:47, 26 August 2018 (UTC)

Ymblanter, do you have any further questions or concerns regarding NIOSH bot? Harej (talk) 19:03, 26 August 2018 (UTC)

I am going to sleep now, I hope you will resolve the above issue by tomorrow, and then I will approve the bot.--Ymblanter (talk) 20:33, 26 August 2018 (UTC)

neonionbot[edit]

neonionbot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Jkatzwinkel (talkcontribslogs)

Task/s: Map semantic annotations made with annotation software neonion to wikidata statements in order to submit either bibliographical evidence, additional predicates or new entities to wikidata. Annotation software neonion is used for collaborative semantic annotating of academic publications. If a text resource being annotated is an open access publication and linked to a wikidata item page holding bibliographical metadata about the corresponding open access publication, verifiable contributions can be made to wikidata by one of the following:

  1. For a semantic annotation, identify an equivalent wikidata statement and provide bibliographical reference for that statement, linking to the item page representing the publication in which the semantic annotation has been created.
  2. If a semantic annotation provides new information about an entity represented by an existing wikidata item page, create a new statement for that item page containing the predicate introduced by the semantic annotation. Attach bibliographic evidence to the new statement analogously to scenario #1.
  3. If a semantic annotation represents a fact about an entity not yet represented by a wikidata item page, create an item page and populate it with at least a label and a P31 statement in order to meet the requirements for scenario #2. Provide bibliographical evidence as in scenario #1.


Code: Implementation of this feature will be published on my neonion fork on github.

Function details: Prerequisite: Map model of neonion's controlled vocabulary to terminological knowledge extracted from wikidata. Analysis of wikidata instance/class relationships ensures that concepts of controlled vocabulary can be mapped to item pages representing wikidata classes.

Task 1: Identify item pages and possibly statements on wikidata that are equivalent to the information contained in semantic annotations made in neonion.

Task 2: Based on the results of task 1, determine if it is appropriate to create additional content on wikidata in form of new statements or new item pages. For the statements at hand, provide an additional reference representing bibliographical evidence referring to the wikidata item page representing the open access publication in which neonion created the semantic annotation.

What data will be added? Proposed scenario is meant to be tried first on articles published in scientific open-access journal Apparatus. --Jkatzwinkel (talk) 06:15, 19 October 2017 (UTC)

I find this proposal very hard to understand without seeing an example - can you run one or mock one (or several) up using the neonionbot account so we can see what it would likely do? ArthurPSmith (talk) 13:12, 19 October 2017 (UTC)

Handelsregister[edit]

Handelsregister (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: SebastianHellmann (talkcontribslogs)

Task/s: Crawl https://www.handelsregister.de/rp_web/mask.do and then go to UT (Unternehmenstraeger) and add an entry for each German organisation with the basic info, especially registering court and assigned id by court into Wikidata.

Code: The code is a fork of https://github.com/pudo-attic/handelsregister (small changes only)

Function details:

Task 1, prerequisite for Task 2 Find all current organisations in Wikidata that are registered in Germany and find the correlating Handelsregister entry. Then add the data for the respective Wikidata items.

What data will be added? The Handelsregister collects information from all German courts, where all organisations in Germany are obliged to register. The data is given from the courts to a private company running the handelsregister, who makes part of the information public (i.e. UT - Unternehmenstraegerdaten, core data) and sells the other part. Each organisation can be uniquely identified by the registering court and the number assigned by this court (only the number is not enough, as two courts might assign the same number). Here is an example of the data:

  • Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH
  • Legal status: Gesellschaft mit beschränkter Haftung
  • Capital: 25.000,00 EUR
  • Date of entry: 29/08/2016
  • (When entering date of entry, wrong data input can occur due to system failures!)
  • Date of removal: -
  • Balance sheet available: -
  • Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
  • Prager Straße 38-40
  • 04317 Leipzig

Most items are stable, i.e. each org is registered, when it is founded and assigned a number by the court: Saxony District court Leipzig HRB 32853 . Then only the address and the status can change. For Wikidata, it is no problem keeping companies that are not existing any more as they should be conserved for historical purposes.

Maintenance should be simple: Once a Wikidata item contains the correct court and the number, the entry can be matched 100% to the entry in Handelsregister. This way Handelsregister can be queried once or twice a year to update the info in Wikidata.

Question 1: bot or other tool How data is added? I am keeping the bot request, but I will look at Mix and Match first. Maybe this tool is better suited for task 1.

Question 2: modeling Which properties should be used in Wikidata? I am particular looking for the property for the court as registering organisation, i.e. that has the authority to define the identity of an org. and then also the number (HRB 32853). The types, i.e. legal status can be matched to existing Wikidata entries. Most exist in the German Wikipedia. Any help for the other properties is appreciated.

Question 3: legal I still need to read up on the right situation for importing crawled data. Here is a hint given on the mailing list:

https://en.wikipedia.org/wiki/Sui_generis_database_rights You'd need to check whether in Germany it applies to official acts and registers too... https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights

Task 2 Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.

It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability

  • 2. It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references. If there is no item about you yet, you are probably not notable.

The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.

--SebastianHellmann (talk) 07:39, 16 October 2017 (UTC)

Could you make a few example entries to illustrate how the items you want to create will look like? What strategy will you use to avoid creating doublicate items? ChristianKl (talk) 12:38, 16 October 2017 (UTC)
I think this is a good idea, but I agree there needs to be a clear approach to avoiding creating duplicates - we have hundreds of thousands of organizations in wikidata now, many of them businesses, many from Germany, so there certainly should be some overlap. Also I'd like to hear how the proposer plans to keep this information up to date in future. ArthurPSmith (talk) 15:13, 16 October 2017 (UTC)
There was a discussion on the mailing list. It would be easier to complete the info for existing entries in Wikidata at first. I will check mix and match for this or other methods. Once this space is clean, we can rediscuss creating new identifiers. SebastianHellmann (talk) 16:01, 16 October 2017 (UTC)
Is there an existing ID that you plan to use for authority control? Otherwise, do we need a new property? ChristianKl (talk) 20:40, 16 October 2017 (UTC)
I think that the ID needs to be combined, i.e. registering court and register number. That might be two properties. SebastianHellmann (talk) 16:05, 29 November 2017 (UTC)
  • Given that this data is fairly frequently updated, how is it planned to maintain it?
    --- Jura 16:38, 16 October 2017 (UTC)
* The frequency of updates is indeed large: A search for deletion announcements alone in the limited timeframe of 1.9.-15.10.17 finds 6682 deletion announcements (which legally is the most seriouss change and makes approx. 10% of all announcements). So within one year, more than 50,000 companies are deleted - which for sure should be reflected in according Wikidata entries. Jneubert (talk) 15:44, 17 October 2017 (UTC)
Hi all, I updated the bot description, trying to answer all questions from the mailing list and here. I still have three questions, which I am investigating. Help and pointers highly appreciated. SebastianHellmann (talk) 23:36, 16 October 2017 (UTC)
  • Given that German is the default language in Germany I would prefer the entry to be "Sachsen Amtsgericht Leipzig HRB 32853" instead of "Saxony District court Leipzig HRB 32853". Afterwards we can store that as an external ID and make a new property for that (which would need a property proposal). ChristianKl (talk) 12:33, 17 October 2017 (UTC)
Thanks for the updated details here. It sounds like a new identifier property may be needed (unless one of the existing ones like Legal Entity ID (P1278) suffices, but I suspect most of the organizations in this list do not have LEI's (yet?)). Ideally an identifier property has some way to turn the identifiers into a URL link with further information on that particular identified entity, that de-referenceability makes it easy to verify - see "formatter URL" examples on some existing identifier properties. Does such a thing exist for the Handelsregister? ArthurPSmith (talk) 14:58, 17 October 2017 (UTC)

Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30

Pictogram voting comment.svg Notified participants of WikiProject Companies for input.

@SebastianHellmann: for task 1, you might also be interested in OpenRefine (make sure you use the German reconciliation interface to get better results). See https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation for details of its reconciliation features. I suspect your dataset might be a bit big though: I think it would be worth trying only on a subset (for instance, filter out those with a low capital). − Pintoch (talk) 14:52, 20 October 2017 (UTC)

Concerning Task 2, I'm a bit worried about the companies' notability (ot lack thereof), since the Handelsregister includes any and all companies. Not just the big ones where there's a good chance that Wikipedia articles, other sources, external IDs, etc exist. But also tiny companies and even one-person-companies, like someone selling stuff on Ebay or some guy selling christmas trees in his village. So it would be very hard to find any data on these companies outside the Handelsregister and the phonebook. --Kam Solusar (talk) 05:35, 21 October 2017 (UTC)

Agreed. Do we really need to be a complete copy of the Handelsregister? What for? How about concentrating on a meaningful subset instead that addresses a clear usecase? --LydiaPintscher (talk) 10:35, 21 October 2017 (UTC)
That of course is true. A strict reading of Wikidata:Notability could be seen as that at least two reliable sources are required. But then, that could be the phone book. Do we have to make those criteria more strict? That would require a RfC. Lymantria (talk) 07:58, 1 November 2017 (UTC)
I would at least try an RfC, but I am not immediately sure what to propose.--Ymblanter (talk) 08:05, 1 November 2017 (UTC)
If there's an RfC I would say that it should say that for data-imports of >1000 items the decision whether or not we import the data should be done via a request for bot permissions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)
@SebastianHellmann: is well-intended, but I agree not all companies are notable. Even worse than 1-man shops are inactive companies that nobody bothered to close yet. Just "comes from reputable source" is not enough: eg OpenStreetMaps is reputable, and it would be ok to import all power-stations (eg see Enipedia) but imho not ok to import all recyclable garbage cans. We got 950k BG companies at http://businessgraph.ontotext.com/ but we are hesitant to dump them on Wikidata. Unfortunately official trade registers usually lack measures of size or importance...
It's true the Project Companies has not gelled yet and there's no clear Community of Use for this data. On the other hand, if we don't start somewhere and experiment, we may never get big quantities of company data. So I'd agree to this German data dump by way of experiment --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)

Kopiersperre Jklamo ArthurPSmith S.K. Givegivetake fnielsen rjlabs ChristianKl Vladimir Alexiev User:Pintoch Parikan User:Cardinha00 User:zuphilip MB-one User:Simonmarch User:Jneubert Mathieudu68 User:Kippelboy User:Datawiki30

Pictogram voting comment.svg Notified participants of WikiProject Companies Pictogram voting comment.svg Comment As best I know Project Companies has yet to gel up workable (for the immediate term) notability standard so the area remains fuzzy. Here is my current thinking [[22]] Very much like the above automation of updates. Hopefully the fetching scripts for Germany can be generalizeable to work in most developed countries that publish structured data on public companies. Would love to find WikiData consensus on Notability vs. its IT capacity and stomach for volumes of basically table data. Rjlabs (talk) 16:47, 3 November 2017 (UTC)

  • @Rjlabs: That hope is not founded because each jurisdiction does its own thing. OpenCorporates has a bunch of web crawling scripts (some of them donated) that they consider a significant IP. And as @SebastianHellmann: wrote their data is sorta open but not really. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • I Symbol support vote.svg Support importing the data. Having the data makes it easier to enter the employer when we create items for new people. Companies also engage into other actions that leave marks in databases such as registering patents or trademarks and it makes it easier to import such data when we already have items for the companies. The ability to run queries about the companies that are located in a given area is useful. ChristianKl (talk) 17:20, 3 November 2017 (UTC)
    • @ChristianKl: at least half of the 200M or so companies world-wide will never have notable employees nor patents, so "let's import them just in case" is not a good policy --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • When it comes to these mass imports I would only want to mass import datasets about companies from authoritative sources. If we talk about a country like Uganda, I think it would be great to have an item for all companies that truly exist in Uganda. People in Uganda care about the companies that exist in their country and there government might not have the capability to host that data in a user-friendly way. An African app developer could profit from the existance of a unique identifier that's the same for multiple African countries.
When it comes to the concern about data not being up-to-date there were multiple cases where I would have really liked data about 19th century companies will doing research in Wikidata. Having data that's kept up-to-date is great, but having old data is also great. ChristianKl () 20:11, 13 December 2017 (UTC)
  • @Rjlabs: We did go back and forth with a lot of ideas on how to set some sort of criteria for company notability. I think any public company with a stock market listing should be considered notable, as there's a lot of public data available on those. For private companies we talked about some kind of size cutoff, but I suppose the existence of 2 or more independent reference sources with information about the company might be enough? ArthurPSmith (talk) 18:01, 3 November 2017 (UTC)
  • @ArthurPSmith:@Denny:@LydiaPintscher: Arthur, let's make it any public company that trades on a recognized stock exchange, anywhere worldwide, with a continuous bid and ask quote, that actually trades at least once per week is automatically considered "notable" for WikiData inclusion. This is by virtue that real people wrote real checks to buy shares and there is sufficient continuing trading interest in the stock to make it trade at least once per week, and some exchange somewhere endows that firm to be listed on its exchange. We should also note that passing this hurdle means that SOME data on that firm is automatically allowable on WikiData, provided the data is regularly updated. Rjlabs (talk) 19:35, 3 November 2017 (UTC)
    • @Rjlabs, Denny, LydiaPintscher: Public Companies are a no-brainer because there's only 60k in the world (there are about 2.6k exchanges); compare to about 200M companies world-wide. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)
  • Some data means (for right now) information like LEI, name, address, phone, industry code(s), brief text description of what they do) plus about 10 high level fields that cover the most frequently needed company data (such as: sales, employees, assets, principal exchange(s) down to where at least 20% of the volume is traded, unique symbol on that exchange, CEO, URL to investor relations section of website where detailed financial statements may be found, Central index key (or equivalent) with link to regulatory filings / structured data in the primary country where its regulated. For now that is all that should be "automatically allowable". No detailed financial statements, line by line, going back 10-20 years, with adjustments for stock splits, etc. No bid/offer/last trade time series. Consensus on further detail has to wait further gelling up. I Ping Lydia and Denny here to be sure they are good with this potential volume of linked data. (I think it would be great, a good start and limited. I especially like it if it MANDATES LEI, if one is available). Moving down from here (after 100% of public companies that are alive enough to actually trade) there is of course much more. However its a very murky area. >=2 independent reference sources with information about the company might be too broad causing WikiData capacity issues, or it may be too burdensome if someone has a structured data source that is much more reliable then WikiData to feed in, but lacks that "second source". Even if was one absolutely assured good quality source, and WikiData capacity was not an issue, I'd like to see a "sustainability" requirement up front. Load no private company data where it isn't AUTOMATICALLY updated or expired out. Again, would be great to have further Denny/Lydia input here on any capacity concern. Rjlabs (talk) 19:35, 3 November 2017 (UTC)
    • "A modicum of data" as you describe above is a good criterion for any company. --Vladimir Alexiev (talk)
    • On WikidataCon there was a question from the audience of whether Wikidata would be okay with importing the 400 million entries about items in museums that are currently managed by various museums. User:LydiaPintscher answered by saying that her main concerns aren't technical but whether our communities does well with handling a huge influx of items. Importing data like the Handelsregister will mean that there will be a lot of items that won't be touched by humans but I don't think that's a major concern for our community. Having more data means more work for our community but it also means that new people get interested in interacting with Wikidata. When we make decisions like this, technical capabilities however matter. I think it would be great if a member of the development team would write a longer blog post that explains the technical capabilities, so that we can better factor them into our policy decisions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)
I agree with Lydia. The issue is hardly the scalability of the software - the software is designed in such a way that there *should* not be problems with 400M new items. The question is do we have a story as a community to ensure that these items don't just turn into dead weight. Do we ensure that items in this set are reconciled with existing items if they should be? That we can deal with attacks on that dataset in some way, with targeted vandalism? Whether the software can scale, I am rather convinced. Whether the community can scale, I think we need to learn that.
Also, for the software, I would suggest not to grow 10x at once, but rather to increase the total size of the database with a bit more measure, and never to more than double it in one go. But this is just, basically, for stress-testing it, and to discover, if possible, early unexpected issues. But the architecture itself should accommodate such sizes without much ado (again - "should" - if we really go for 10x, I expect at least one unexpected bug to show up). --Denny (talk) 23:25, 5 November 2017 (UTC)
Speaking of the community being able to handle dead weight, it seems we mostly lack the tools to do so. Currently we are somewhat flooded by items from cebwiki and despite efforts by individual users to deal with one or the other problem, we still haven't tackled them systematically and this lead to countless items with unclear scope complicating every other import.
--- Jura 07:00, 6 November 2017 (UTC)
I don't think we should just add 400M new items in one go either. I don't think that the amount of vandalism that Wikidata faces scales directly with the amount of items that we host if we double the amount of items we don't double the amount of vandalism.
As far as the cebwiki items go, the problem isn't just that there are many items. The problem is that there's unclear scope for a lot of the items. For me that means that when we allow massive data imports we have to make sure that the imported data is up to a high quality where the scope of every item is clear. This means that having a bot approval process for such data imports is important and suggests to me that we should also get clear about the necessarity of having a bot approval for creating a lot of items via QuickStatements.
Currently, we are importing a lot of items via WikiCite and it seems to me that process is working without significant issues.
I agree that scaling the community should be a higher priority than scaling the number of items. One implication of that is that it makes sense to have higher standards for mass imports via bots than for items added by individuals (a newbie is more likely to become involved in our community when we don't great him by deleting the items they created).
Another implication is that the metric we celebrate shouldn't be focused on the number of items or statments/item but the number of active editors. ChristianKl () 09:58, 20 November 2017 (UTC)

Now what?[edit]

Lots of good discussion above. Would anyone care to summarize, and how do we move to a decision? --Vladimir Alexiev (talk) 15:10, 5 December 2017 (UTC)

  • Some seem to consider it too granular. Maybe a test could be done with a subset. If no other criteria can be determined, maybe a start could be with companies with a capital > EUR 100 mio.
    --- Jura 20:21, 13 December 2017 (UTC)

Jntent's Bot 1[edit]

Jntent's Bot (talkcontribsnew itemsSULBlock logUser rights logUser rights) Operator: Jntent (talkcontribslogs)

Task/s:

The task is to add assertions about airports from template pages.

Code:

The code is based on pywikibot's harvest_templates.py  under scripts in https://github.com/wikimedia/pywikibot-core

Function details:


I added some constraints for literal values with regular expressions to parse "Infobox Airport" and similar ones in other languages. See the

I hope to scrape the airport templates from a few languages. Since the "Infobox Airport" template contains a links to pages about airport codes,

{{Infobox airport
| name         = Denver International Airport
| image        = Denver International Airport Logo.svg
| image-width  = 250
| image2       = DIA Airport Roof.jpg
| image2-width = 250
| IATA         = DEN
| ICAO         = KDEN
| FAA          = DEN
| WMO          = 72565
| type         = Public
| owner        = City & County of Denver Department of Aviation
| operator     = City & County of Denver Department of Aviation
| city-served  = [[Denver]], the [[Front Range Urban Corridor]], Eastern Colorado, Southeastern Wyoming, and the [[Nebraska Panhandle]]
| location     = Northeastern [[Denver]], [[Colorado]], U.S.
| hub          =
...
}}

I will use links to pages about airport codes to find airports. One example is:

https://en.wikipedia.org/wiki/International_Air_Transport_Association_airport_code

Template element Property Constraining regex (from properties)
IATA Property:P238 [A-Z]{3}
ICAO Property:P239 ([A-Z]{2}|[CKY][A-Z0-9])[A-Z0-9]{2}
FAA Property:P240 [A-Z0-9]{3,4}
coordinates Property:P625 6 numbers and 2 cardinalities surrounded by "|" from the coord template:
{{coord|39|51|42|N|104|40|23|W|region:US-CO|display=inline,title}}
city-served Property:P931 The first valid link, standard harvest_template.py behavior

 – The preceding unsigned comment was added by Jntent (talk • contribs).

  • Pictogram voting comment.svg Comment I think there were some problems with these infoboxes in one language. Not sure which one it was. Maybe Innocent bystander recalls (I think he mentioned it once).
    --- Jura 11:28, 8 July 2017 (UTC)
    Well, I am not sure if I (today) remember any such problems. But it could be worth to mention that these codes also can be found in sv:Mall:Geobox and ceb:Plantilya:Geobox that are used in the Lsjbot-articles. These templates are not specially adapted to airports, but Lsj used the same template also for this group of articles. The Swedish template has special parameters for this ("IATA-kod" and "ICAO-kod") while the cebwiki articles uses a parameter "free" and "free_type". (Could be worth checking free1, free2 too.) See ceb:Coyoles (tugpahanan) as an example. -- Innocent bystander (talk) 15:17, 8 July 2017 (UTC)
  • @Jntent: in this edit I see the bot replaced FDKS with FDKB, while in en.wp infobox and lead section ar two values for ICAO cadoe : FDKS/FDKB. I would suggest to not change any existing value, or these should be checked manually probably if changed. The most safe way to act here would be to just add missing values. XXN, 14:07, 17 July 2017 (UTC)
  • @Jntent: Still interested? Matěj Suchánek (talk) 09:21, 3 August 2018 (UTC)


WikiProjectFranceBot[edit]

WikiProjectFranceBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Alphos (talkcontribslogs)

Task/s: Replace all located in the administrative territorial entity (P131) statements pointing from communes of France to cantons of France by territory overlaps (P3179) statements pointing from the same communes to the same cantons, including qualifiers (there are currently only date qualifiers), and adding a P794 qualifier on each new statement to indicate the subclass of canton.

Code: Partially available (for the first step) on GitHub

Function details: As has been the plan of WikiProject France since we proposed properties to better reflect the relationship between communes and cantons of France, we're now getting to actually push all the statements corresponding to these relationships from located in the administrative territorial entity (P131) to territory overlaps (P3179), and add the exact kind of P3179 this represents as qualifiers to said statements, without removing the original statements at first. Roughly 80 000 edits are to be expected.

At a later date, after checking everything went fine on the first pass, we plan on removing the (faulty) P131 statements between communes and cantons entirely, which will also be done by this bot.

VIGNERON
Mathieudu68
Ayack
Aga
Ash Crow
Tubezlob
PAC2
Thierry Caro
Pymouss
Pintoch
Alphos
Nomen ad hoc
GAllegre
Jean-Frédéric
Manu1400
Thibdx
Pictogram voting comment.svg Notified participants of WikiProject France

--Alphos (talk) 20:00, 8 May 2017 (UTC)

@Alphos: Could you provide an example please? Thanks. — Ayack (talk) 09:05, 9 May 2017 (UTC)
Of course.
Nielles-lès-Bléquin (Q1000003) located in the administrative territorial entity (P131) canton of Lumbres (Q1726007)
would be replaced by :
Nielles-lès-Bléquin (Q1000003) territory overlaps (P3179) canton of Lumbres (Q1726007) (no label (P794) canton of France (Q18524218))
and
Sainte-Croix (Q1002122) located in the administrative territorial entity (P131) canton of Montluel (Q1726339) (end time (P582) 2015-03-21)
would be replaced by :
Sainte-Croix (Q1002122) located in the administrative territorial entity (P131) canton of Montluel (Q1726339) (end time (P582) 2015-03-21 ; no label (P794) canton of France (until 2015) (Q184188))
Other "examples" (in fact the whole list) can be found here :
The following query uses these:
  • Properties: subclass of (P279) View with Reasonator View with SQID, instance of (P31) View with Reasonator View with SQID, located in the administrative territorial entity (P131) View with Reasonator View with SQID
     1 SELECT DISTINCT ?commune ?canton ?qualProp ?time ?precision ?timezone ?calendar WHERE {
     2   ?commune p:P31/ps:P31/wdt:P279* wd:Q484170 .
     3   ?commune p:P131 ?cantonStmt .
     4   ?cantonStmt ps:P131 ?canton .
     5   ?canton wdt:P31 ?cantonType .
     6   VALUES ?cantonType { wd:Q18524218 wd:Q184188 } .
     7   OPTIONAL {
     8     ?cantonStmt ?qualifier ?qualVal .
     9     ?qualProp wikibase:qualifierValue ?qualifier .
    10     ?qualVal wikibase:timePrecision ?precision ;
    11              wikibase:timeValue ?time ;
    12   	         wikibase:timeTimezone ?timezone ;
    13              wikibase:timeCalendarModel ?calendar ;
    14   }
    15 }
    16 ORDER BY ASC(?commune) ASC(?canton)
    
(which is what the bot works on)
Alphos (talk) 09:44, 9 May 2017 (UTC)
Symbol support vote.svg Support The query seems good to me. Can you run a sample batch? -Ash Crow (talk) 18:26, 14 May 2017 (UTC)
The query is undeniably good, but I noticed an issue with edge cases on cantons with double status, working on it and running a small batch (LIMIT 20 or maybe a small french departement), probably later this week. Alphos (talk) 00:05, 16 May 2017 (UTC)
Symbol support vote.svg SupportAyack (talk) 09:02, 16 May 2017 (UTC)
Please, let the bot run a couple of test edits. Besides, please, create the user page of the bot account (e.g. {{bot|Alphos}}). Lymantria (talk) 20:40, 25 June 2017 (UTC)
@Alphos: Any progress to be expected? Lymantria (talk) 13:51, 31 May 2018 (UTC)

Jefft0Bot[edit]

Jefft0Bot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Jefft0 (talkcontribslogs)

Task/s: Add references to external ontologies

Code:

Function details: Add equivalent class (P1709) for an external ontology when that ontology already defines mappings to Wikipedia or Wikidata.
For example, Umbel version 1.50 has mappings to Wikipedia here: https://raw.githubusercontent.com/structureddynamics/UMBEL/d3d1d6c0a566fed335fecfadb75f5501437f9163/External%20Ontologies/wikipedia.n3
such as
<http://umbel.org/umbel/rc/MaoriLanguage> umbel:isRelatedTo <http://wikipedia.org/wiki/Māori_language> .
and that Wikipedia page links to Wikidata item Māori (Q36451) . So this item should have equivalent class (P1709) to http://umbel.org/umbel/rc/MaoriLanguage with a reference URL (P854) to the file above. --Jefft0Bot (talk) 15:15, 17 April 2017 (UTC)

Please make several test edits.--Ymblanter (talk) 19:48, 28 July 2017 (UTC)
@Jefft0: Still interested? Matěj Suchánek (talk) 09:18, 3 August 2018 (UTC)

MexBot 2[edit]

MexBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: MarcAbonce (talkcontribslogs)

Task/s: Add official population data for Mexican municipalities.

Code: https://gitlab.com/a01200356/MexBot/blob/master/poblaciones.py

Function details:
The script finds all Mexican municipalities with an INEGI municipality ID and gets all the official population data available from INEGI's (Mexican public institute that does the census) API.
It will either add or update this data, with INEGI as the source.
It will also add census as the method for the year ends in 0, when the census is made.
MarcAbonce (talk) 21:45, 8 June 2017 (UTC)

Symbol support vote.svg Support --PokestarFan • Drink some tea and talk with me • Stalk my edits • I'm not shouting, I just like this font! 23:16, 8 June 2017 (UTC)
Pictogram voting comment.svg Comment: Under which license INEGI publishes population data? XXN, 14:41, 9 June 2017 (UTC)
Not explicitated but it is like a CC BY, see point f in section "Del libre uso de la información del INEGI" of Términos de uso. I don't think is compatible. --ValterVB (talk) 17:35, 9 June 2017 (UTC)
Indeed, it only requires attribution, which is precisely what my script intends to add. Why would it be incompatible? Most of this data has already been manually added by people and apparently a Wikipedia scraping script too, but it's mostly unsourced. --MarcAbonce (talk)
Here we use CC0, if data here need citation the data is incompatible with the license. --ValterVB (talk) 05:47, 11 June 2017 (UTC)
Can census data even be licensed, though? As far as I know, facts cannot be licensed anywhere. If this is the case, this license would only be enforceable with the statistical data they generate (which I'm not using) but it wouldn't be enforceable for a simple, "natural" fact such as a total population.
Also, as I mentioned, this data is already allowed in practice. Wikipedia importing bots have added census data into Wikidata by claiming Wikipedia as the source (which is also CC0 incompatible, by the way), but this data is not generated by Wikipedia, but rather taken from INEGI and imported without source.
So, unless you actually plan to delete all the unsourced and Wikipedia sourced Mexican population data from this site, the most reasonable thing to do would be to treat this data the way it has been treated so far, for the sake of consistency.
--MarcAbonce (talk)
Symbol support vote.svg Support Mexico is outside of the EU and thus there are no suis genesis concerns. Population data itself is about facts that in their nature aren't protected by copyright. ChristianKl (talk) 09:31, 25 June 2017 (UTC)
The license not depend if Mexico is in or out of EU. Wikidata use CC0, INEGI ask explicity "Must give credit for the INEGI as an author", for me they aren't compatible. --ValterVB (talk) 14:32, 25 June 2017 (UTC)

Emijrpbot 8[edit]

Emijrpbot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Emijrp (talkcontribslogs)

Task/s:

Bot adds imported from Wikimedia project (P143) references to Wikinews article (Q17633526) items. Particularly it adds references to instance of (P31) and language of work or name (P407) properties. See example

Code: no coded yet

Function details:

Bot uses sitelink to detect which language version of Wikinews hosts the article, and it adds the imported from Wikimedia project (P143) reference. When there are more than one sitelink, it picks just one (the largest Wikinews), based on number of articles.

--Emijrp (talk) 11:42, 25 March 2017 (UTC)

For my opinion, see my comment in the previous request for permission. Matěj Suchánek (talk) 17:53, 25 March 2017 (UTC)
  • Pictogram voting comment.svg Comment It's good to add "imported from" as a "source" when importing data from Wikipedia (or Wikinews here), but I don't think it adds much in terms of references. To calculate ratios, one might as well ignore it. For P31, such ratios probably don't add much anyways.
    --- Jura 18:19, 25 March 2017 (UTC)
  • @Matěj Suchánek, Jura1:, are we ready for approval given that the previous one was withdrawn?--Ymblanter (talk) 16:04, 7 April 2017 (UTC)

ZacheBot[edit]

ZacheBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Zache (talkcontribslogs)

Task/s: Import data from pre-created CSV lists.

Code: based on Pywikibot (Q15169668), sample import scripts [23]

Function details:

--Zache (talk) 23:29, 4 March 2017 (UTC)

@Zache:, could you pls make a couple of test edits, I do not see any lakes in the contribution of the bot.--Ymblanter (talk) 21:20, 14 March 2017 (UTC)
@Zache: Are you still planning to do this taks? If so, please provide a few test edits. --Pasleim (talk) 08:13, 11 July 2017 (UTC)
Hi, i did the vaalidatahack without bot permissions so that one is done already. The lake thing is ongoing project and currently done using quickstatements for single lakes and CC0 licence screening for larger imports is still the same. Most likely there is also WLM related data imports in this summer by me, but i am not sure how big (most like under < 2000 items which some are updates for existing items and some are new) User Susannaanas started this and i am continuing with filling the details to the WLM the targets. Most likely this WLM stuff is made using pywikibot instead of quickstatements because i can do consistency checks with the code. --Zache (talk) 11:12, 11 July 2017 (UTC)

YULbot[edit]

YULbot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: YULdigitalpreservation (talkcontribslogs)

Task/s:

  • YULbot has the task of creating new items for pieces of software that do not yet have items in Wikidata.
  • YULbot will also make statements about those newly-created software items.

Code: I haven't written this bot yet.

Function details:

This bot will set the English language label for these items and create statements using publisher (P123), ISBN-13 (P212), ISBN-10 (P957), place of publication (P291), publication date (P577). --YULdigitalpreservation (talk) 18:04, 21 February 2017 (UTC)

good to run a test with a few examples so we can see what you're planning! ArthurPSmith (talk) 20:46, 22 February 2017 (UTC)
Interesting. Where does the data come from? Emijrp (talk) 12:04, 25 February 2017 (UTC)
The data is coming from the pieces of software themselves. These are pieces of software that are in the Yale Library collection. We could also supplement with data from oldversions.com.YULdigitalpreservation (talk) 13:07, 28 February 2017 (UTC)
Please let us know when the bot is ready for approval.--Ymblanter (talk) 21:12, 14 March 2017 (UTC)

YBot[edit]

YBot (talkcontribsnew itemsSULBlock logUser rights logUser rights)
Operator: Superyetkin (talkcontribslogs)

Task/s: import data from Turkish Wikipedia

Code: The bot, currently active on trwiki, uses the Wikibot framework.

Function details: The code imports data (properties and identifiers) from trwiki, aiming to ease the path to Wikidata Phase 3 (to have items that store the data served on infoboxes) --Superyetkin (talk) 16:42, 12 January 2017 (UTC)

It would be good if you could check for constraint violations insteaf of just blindly copying data from trwiki. These violations are probably all caused by the bot. --Pasleim (talk) 19:26, 15 January 2017 (UTC)
Yes, I am still interested in this. --Superyetkin (talk) 12:20, 4 March 2018 (UTC)
@Superyetkin: If that is the case, can you take away concerns as indicated by Pasleim, by showing how you'll avoid the constraint violations? Lymantria (talk) 13:53, 31 May 2018 (UTC)
I think I can check for constraint violations using the related API method --Superyetkin (talk) 17:55, 1 June 2018 (UTC)
@Pasleim: Would that be sufficient? Lymantria (talk) 09:10, 3 June 2018 (UTC)
That API method works only for statements which are already added to Wikidata. It would be good if some consistency check could be made prior adding a statement. For example, the unique value constraint of YerelNet village ID (P2123) can be checked be downloading all current values [24], importing them into an array and then prior saving a statement the bot checks if the value is already in the array. Format constraint can be realized in php by preg_match(). Item constraints don't need be be checked because they only indicate missing data but not wrong data. --Pasleim (talk) 17:52, 3 June 2018 (UTC)