Wikidata:Requests for permissions/Bot/Cewbot 4
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 06:17, 3 March 2022 (UTC)[reply]
Cewbot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Kanashimi (talk • contribs • logs)
Task/s: Import new articles from online recsources.
Code: https://github.com/kanasimi/wikibot/blob/master/routine/20210701.import_PubMed_to_wikidata.js
Function details: Please refer to Wikidata:Bot requests/Archive/2020/11#weekly import of new articles (periodic data import). The task will import new articles from PubMed Central, about 30K articles every week. Maybe imports from other resources in the future. --Kanashimi (talk) 03:49, 27 November 2020 (UTC)[reply]
- What steps are taken to avoid creating duplicates? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 11:49, 27 November 2020 (UTC)[reply]
- PubMed publication ID (P698) will be used to avoid duplicates for articles from PubMed Central. For other resources, identifier and article title, author(s) will be checked. --Kanashimi (talk) 20:04, 27 November 2020 (UTC)[reply]
- 1. You should check DOI too (but some does not have a DOI). 2. What source you will use to resolve the authors? Many does not provide enough information (i.e. ORCID) to resolve them.--GZWDer (talk) 05:09, 1 December 2020 (UTC)[reply]
- PubMed publication ID (P698) will be used to avoid duplicates for articles from PubMed Central. For other resources, identifier and article title, author(s) will be checked. --Kanashimi (talk) 20:04, 27 November 2020 (UTC)[reply]
┌────────────────────────────────────────────────────────────────────────────────────────────────────┘
- Thanks for doing these. I don't think complex author resolution is needed, but if it can be done, why not. OTH Inclusion of journal or other publication venue would be useful. Previous imports sometimes skipped them when an item wasn't created (meaning the bot or its operator needs to create it when one is encountered). User:Research_Bot/issues lists a few past problems. GZWDer talk page has a few others. --- Jura 17:12, 1 December 2020 (UTC)[reply]
- @Jura1: Thank you. User:Research_Bot/issues is very useful. @GZWDer: I will also try DOI (P356). If there is no information of author(s), i will skip the check. --Kanashimi (talk) 00:24, 2 December 2020 (UTC)[reply]
- @Jura1: Sorry for late. I have coded most things. But I find the biggest problem is still the amount. For example, there are 15,148 articles in 2021/05/01 (all 131,359 during May 2021). Maybe we can just import some kinds of articles? Or it is better import them all? --Kanashimi (talk) 07:23, 30 June 2021 (UTC)[reply]
- Comment: Well, I find LargeDatasetBot also imports data from PubMed, but it is stopped now. So the amount seems no problem now. I will start some test edits. --Kanashimi (talk) 02:50, 1 July 2021 (UTC)[reply]
- @GZWDer Is LargeDatasetBot still work? If it is still working, I think this task is not necessary. Kanashimi (talk) 03:34, 1 July 2021 (UTC)[reply]
- Related discussion: Wikidata:Requests for permissions/Bot/So9qBot... Looks a bit serious... --Kanashimi (talk) 22:26, 4 July 2021 (UTC)[reply]
@Ymblanter, Lymantria: Could you look at this proposal please? It seems to be pending approval to make test edits per [1]. Thanks. Mike Peel (talk) 21:51, 18 January 2022 (UTC)[reply]
- it is certainly fine for any (non-blocked) bot to make about 50 test edits, bots do not need our permission for this.--Ymblanter (talk) 21:55, 18 January 2022 (UTC)[reply]
- OK, @Kanashimi: go ahead with the test edits? Thanks. Mike Peel (talk) 21:57, 18 January 2022 (UTC)[reply]
- Thank you. My main concern is the feasibility. Due to situation mentioned in Wikidata:Requests for permissions/Bot/So9qBot, it seems even the code is fine, wikidata cannot handle the workload... Kanashimi (talk) 22:02, 18 January 2022 (UTC)[reply]
- @Kanashimi: I think the workload needs a phabricator ticket, per my comment on the other bot request - but that should probably go in parallel to the test edits here. Thanks. Mike Peel (talk) 22:04, 18 January 2022 (UTC)[reply]
- It sounds good Kanashimi (talk) 22:05, 18 January 2022 (UTC)[reply]
- @Jura1 Since it has been half a year, please tell me if there is still demand, thank you. Kanashimi (talk) 22:10, 18 January 2022 (UTC)[reply]
- As far as I am concerned, this would be great. Let's see the test edits first and decide based on those. Vojtěch Dostál (talk) 07:50, 19 January 2022 (UTC)[reply]
- Thank you. Coding... Kanashimi (talk) 19:36, 19 January 2022 (UTC)[reply]
- As far as I am concerned, this would be great. Let's see the test edits first and decide based on those. Vojtěch Dostál (talk) 07:50, 19 January 2022 (UTC)[reply]
- @Kanashimi: I think the workload needs a phabricator ticket, per my comment on the other bot request - but that should probably go in parallel to the test edits here. Thanks. Mike Peel (talk) 22:04, 18 January 2022 (UTC)[reply]
- Thank you. My main concern is the feasibility. Due to situation mentioned in Wikidata:Requests for permissions/Bot/So9qBot, it seems even the code is fine, wikidata cannot handle the workload... Kanashimi (talk) 22:02, 18 January 2022 (UTC)[reply]
- OK, @Kanashimi: go ahead with the test edits? Thanks. Mike Peel (talk) 21:57, 18 January 2022 (UTC)[reply]
I will make a duplicate list in User:Cewbot/log/20210701/PubMed ID duplicates. Is there a better method to handle this? --Kanashimi (talk) 20:26, 19 January 2022 (UTC)[reply]
- @Kanashimi Interesting! Are you sure these are duplicates? They have different PubMed Central IDs and one was published in March and the other in May.
- To your question: I think true duplicates should be merged instantly. If you don't have time for that, keep that list and post a link to it to eg. Wikidata talk:WikiProject Source MetaData Vojtěch Dostál (talk) 08:39, 20 January 2022 (UTC)[reply]
- @Vojtěch Dostál I will list the same title / PubMed id / PMC id, etc. By the way, I find there are warnings in the Selenium supplementation to improve bone health in postmenopausal women: the SeMS three-arm RCT (Q110624332). Are there a way to solve the warnings? Kanashimi (talk) 10:09, 20 January 2022 (UTC)[reply]
- @Jura1@Vojtěch Dostál I tested 50 pages and fixed not ideal editings. It looks ok now. Here are 10 new tests: PubMed ID 33513662, 33513663, 33513664, 33513665, 33513666, 33513667, 33513668, 33513669, 33513670, 33513671. If there is anything that needs to be corrected, please let me know, thank you. Kanashimi (talk) 22:42, 22 January 2022 (UTC)[reply]
- It looks perfect to me, but I may not be qualified to say that. Can people who have done similar imports of papers have a look, please? Tagging @Daniel Mietchen, @Mahdimoqri, @Sic19. Thank you! Vojtěch Dostál (talk) 08:40, 23 January 2022 (UTC)[reply]
- Thanks for doing this. It's crucial to do this, otherwise our corpus would go stale and become useless. Looks good afaik. Some minor stuff: [2] --- Jura 08:52, 23 January 2022 (UTC)[reply]
- Thank you. Just fixed. --Kanashimi (talk) 09:14, 23 January 2022 (UTC)[reply]
- I think I can also start a new thread to run from PubMed ID 1 to fill the missing data. Kanashimi (talk) 12:02, 24 January 2022 (UTC)[reply]
- And I am planning to create researcher items from ORCID API. Kanashimi (talk) 05:50, 3 February 2022 (UTC)[reply]
- @Jura1@Vojtěch Dostál I tested 50 pages and fixed not ideal editings. It looks ok now. Here are 10 new tests: PubMed ID 33513662, 33513663, 33513664, 33513665, 33513666, 33513667, 33513668, 33513669, 33513670, 33513671. If there is anything that needs to be corrected, please let me know, thank you. Kanashimi (talk) 22:42, 22 January 2022 (UTC)[reply]
- @Vojtěch Dostál I will list the same title / PubMed id / PMC id, etc. By the way, I find there are warnings in the Selenium supplementation to improve bone health in postmenopausal women: the SeMS three-arm RCT (Q110624332). Are there a way to solve the warnings? Kanashimi (talk) 10:09, 20 January 2022 (UTC)[reply]
- @Ymblanter, Mike Peel: Hi. Can you take a look at the test edits listed above? Thank you. --Kanashimi (talk) 21:21, 4 February 2022 (UTC)[reply]