Wikidata:Requests for permissions/Bot/KlimatkollenGarboBot 1
KlimatkollenGarboBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Klimatfrida (talk • contribs • logs) NL-Moritz (talk • contribs • logs)
Task/s: upload carbon footprint data from Klimatkollen to Wikidata
Change Frequency: Changes are only pushed when new reports are passed, this mostly happens in Q2 of each year. There will be one possible change per report per company.
Code: https://github.com/Klimatbyran/ Developed in fork: https://github.com/Klimatbyran/garbo/compare/main...okis-netlight:garbo:feat/wikidata-update
Function details: After carbon footprint data is verified by a human from Klimatkollen, the bot will push this data to the corresponding company entity's carbon footprint section on Wikidata. If there is already data in this section, for specific reporting period and scope, the bot will update this data, and otherwise create a new data point. --KlimatkollenGarboBot (talk) 09:19, 20 February 2025 (UTC)
- Looks good. @Klimatfrida, can the bot make around 50 test edits so that we can verify that it is working as intended? Ainali (talk) 08:34, 21 February 2025 (UTC)
- Yes, we are trying this today. :) Klimatfrida (talk) 08:57, 21 February 2025 (UTC)
- We did the requested edits, also we underestimated the number of new datapoints for AstraZeneca a bit, so the bot did a bit more than 50 edits, we hope that is not a problem. NL-Moritz (talk) 12:58, 21 February 2025 (UTC)
- While the result looks good, those 69 edits should be grouped into one single edit. Please rewrite the bot to make it easier to follow. Ainali (talk) 14:05, 21 February 2025 (UTC)
- For reference, it is the API function
wbeditentitythat can make bundled edits to the same item. Ainali (talk) 19:01, 21 February 2025 (UTC)- Thanks for the input, I implemented this change. The only thing is that we have to do the edits for scope 1 + 2 and scope 3 in two groups as scope 3 has an additional qualifier and the library does has problems if this qualifier is marked to be compared when it is actual missing for scope 1 and 2. Currently, I have grouped the edits for each scope, but will combine 1 + 2 shortly and hope that this is a good solution. NL-Moritz (talk) 12:40, 24 February 2025 (UTC)
- Sure, let us know when you have implemented the changes and have made a few more test edits with the new implementation. Ainali (talk) 15:12, 24 February 2025 (UTC)
- @Klimatfrida @NL-Moritz: Something is going wrong. The bot has added a second set of statements on Inter IKEA Holding, duplicating the existing ones, see this combined diff. Please clean that up and fix your code before doing any further edits. Ainali (talk) 16:04, 24 February 2025 (UTC)
- Thanks for making us aware. The bot itself works correctly in the sense that it does not create duplication's in the sense of the same data for the identical time period, scope and category. But, the error showed an issue in our data as the values of the datapoints are verified, but the reporting period in some cases was not. We will directly tackle the issue to solve it asap. I will clean up the wrong datapoints in the meantime. Is it okay if we go for another test run, when we checked our data for this error, as I think the bot itself functions correctly? NL-Moritz (talk) 16:31, 24 February 2025 (UTC)
- The IKEA Holding is cleaned up, I removed the duplicated values and changed the reporting periods for new datapoints that were not there before and also verified everything using the latest report. NL-Moritz (talk) 17:11, 24 February 2025 (UTC)
- I am not sure what you mean by it is not duplicates. When looking at, for example, these two added statements at H&M (Q188326) they look exactly the same to me. 1: Q188326#Q188326$55F6B2DF-4571-477F-9DE1-E24432D72F5C, and 2: Q188326#Q188326$FE3EBB72-A414-4463-9409-89D9660B5909. (This item also needs cleaning.)
- But to your last question, yes, but please make only a few edits one at a time to verify that any error (regardless if it is in the code or the data) is not propagated too far and in a scale small enough to clean up. Ainali (talk) 19:15, 24 February 2025 (UTC)
- Ok we did some more testing and feel now quite comfortable that the bot works as intended with just a single edit per entity. We also restricted the data we want to upload to the most recent one (last reporting period) as we currently don't know if the community wants all of the historic data or not. This restriction leads to the problem that our bot currently cannot add anything new in most cases, so finding entities to make edits is a bit hard. I did run one successful edit on Holmen AB Q1467848. The thing is that there is a statement which shows the carbon footprint for two different scopes at once (Scope 1 and Scope 3) as the value is the same. I personally find this statement ambiguous and would prefer the separate statements our bot added, it also shows that the bot relies on the qualifiers to distinguish between different statements and cannot detect these special cases. As it is quite hard to make test runs in the live system as most of the recent data is already there, I also did some in the sandbox https://test.wikidata.org/w/index.php?title=Q238638&action=history if this is viable. NL-Moritz (talk) 07:53, 28 February 2025 (UTC)
- Thanks for pondering the conundrum with multiple timepoints! Indeed, all historical data may be a bit much for now (as there is a hard limit of the size of an item). For that, we should rather look into storing the data as .tab files in the Data namespace on Wikimedia Commons. However, just adding one new year at a time going forward should be fairly safe, as the growth rate is very limited.
- On Holmen AB (Q1467848), I believe it was this edit going wrong in an OpenRefine batch: https://www.wikidata.org/w/index.php?title=Q1467848&diff=next&oldid=2180340963 and I think the "Scope 3" qualifier can just be removed there.
- Yes, it is certainly viable to do test edits there when there are no current updates to make here. The edits there look good. I can't see any duplicated statements either, so I guess you have checked for that, is that correct? Ainali (talk) 09:48, 28 February 2025 (UTC)
- Yes so I do a comparison between the items already in the carbon footprint statements and the items we have. Items describe the same datapoint if the start and end date of the reporting period and the scope are equal. Additionally, for scope 3 the category also has to be the same. If there is a match, I check if the value is the same, if so no update is done, if not I update the value. If I find items of ours that don't have a match to on of the existing items I add this item to the statement. NL-Moritz (talk) 10:54, 28 February 2025 (UTC)
- Ok we did some more testing and feel now quite comfortable that the bot works as intended with just a single edit per entity. We also restricted the data we want to upload to the most recent one (last reporting period) as we currently don't know if the community wants all of the historic data or not. This restriction leads to the problem that our bot currently cannot add anything new in most cases, so finding entities to make edits is a bit hard. I did run one successful edit on Holmen AB Q1467848. The thing is that there is a statement which shows the carbon footprint for two different scopes at once (Scope 1 and Scope 3) as the value is the same. I personally find this statement ambiguous and would prefer the separate statements our bot added, it also shows that the bot relies on the qualifiers to distinguish between different statements and cannot detect these special cases. As it is quite hard to make test runs in the live system as most of the recent data is already there, I also did some in the sandbox https://test.wikidata.org/w/index.php?title=Q238638&action=history if this is viable. NL-Moritz (talk) 07:53, 28 February 2025 (UTC)
- The IKEA Holding is cleaned up, I removed the duplicated values and changed the reporting periods for new datapoints that were not there before and also verified everything using the latest report. NL-Moritz (talk) 17:11, 24 February 2025 (UTC)
- Thanks for making us aware. The bot itself works correctly in the sense that it does not create duplication's in the sense of the same data for the identical time period, scope and category. But, the error showed an issue in our data as the values of the datapoints are verified, but the reporting period in some cases was not. We will directly tackle the issue to solve it asap. I will clean up the wrong datapoints in the meantime. Is it okay if we go for another test run, when we checked our data for this error, as I think the bot itself functions correctly? NL-Moritz (talk) 16:31, 24 February 2025 (UTC)
- @Klimatfrida @NL-Moritz: Something is going wrong. The bot has added a second set of statements on Inter IKEA Holding, duplicating the existing ones, see this combined diff. Please clean that up and fix your code before doing any further edits. Ainali (talk) 16:04, 24 February 2025 (UTC)
- Sure, let us know when you have implemented the changes and have made a few more test edits with the new implementation. Ainali (talk) 15:12, 24 February 2025 (UTC)
- Thanks for the input, I implemented this change. The only thing is that we have to do the edits for scope 1 + 2 and scope 3 in two groups as scope 3 has an additional qualifier and the library does has problems if this qualifier is marked to be compared when it is actual missing for scope 1 and 2. Currently, I have grouped the edits for each scope, but will combine 1 + 2 shortly and hope that this is a good solution. NL-Moritz (talk) 12:40, 24 February 2025 (UTC)
- For reference, it is the API function
- While the result looks good, those 69 edits should be grouped into one single edit. Please rewrite the bot to make it easier to follow. Ainali (talk) 14:05, 21 February 2025 (UTC)
- Just a general question regarding the process. Are we currently expected to do more test runs or are there any other requested changes pending on our side? NL-Moritz (talk) 07:23, 10 March 2025 (UTC)
- @NL-Moritz The latest edit from the bot on Systembolaget (Q1476113) looks weird, when all different scope 3 values was replaced by the same one. Was there some change in your code creating this error? Ainali (talk) 13:27, 16 March 2025 (UTC)
- @Ainali I think that was a previous bug and the issue is fixed now. I will look into this edit to fix any faults caused by the bug. Apart from that the newest runs of the bot were done in the sandbox https://test.wikidata.org/wiki/Special:Contributions/KlimatkollenGarboBot as wikidata is pretty much up to date and the bot cannot contribute something new. NL-Moritz (talk) 08:27, 17 March 2025 (UTC)
- @NL-Moritz Ok! Nice that you also added edit summaries. Could it be made more granular, so that it from the summary is clear if it is a pure addition or if existing statements are getting updated, or both (or removed as I saw some edits on test.wikidata)? Ainali (talk) 09:49, 17 March 2025 (UTC)
- @NL-Moritz Slightly related question, but thinking about that you don't have any more edits to do as tests right now, what is the estimated number of edits for the bot per year? Ainali (talk) 09:03, 18 March 2025 (UTC)
- @Ainali a rough estimation would be that we update the data of the companies we have once a year. If with estimate that a company has around 10 carbon footprint datapoints and we currently have around 300 companies, this would lead to 3000 edits a year. But, we want to increase the number of companies in the future, so the number of edits will scale with it. NL-Moritz (talk) 09:24, 18 March 2025 (UTC)
- Thanks for the estimation! Even with a ten or hundred fold increase, it seems like a reasonable amount. Ainali (talk) 09:28, 18 March 2025 (UTC)
- @Ainali a rough estimation would be that we update the data of the companies we have once a year. If with estimate that a company has around 10 carbon footprint datapoints and we currently have around 300 companies, this would lead to 3000 edits a year. But, we want to increase the number of companies in the future, so the number of edits will scale with it. NL-Moritz (talk) 09:24, 18 March 2025 (UTC)
- I will look into this to make clear which data points were newly added and which was just replaced with newer data. NL-Moritz (talk) 09:26, 18 March 2025 (UTC)
- @Ainali Had a bit of a deeper look in to the summary. Initial I thought I can add a change summary to every claim, but as I do the change in one edit that is not possible. Therefore, do you want me to split the edits up into one for additions, one for updates and one for removals or should I try to write a longer change summary which covers everything in one edit. NL-Moritz (talk) 12:30, 20 March 2025 (UTC)
- @NL-Moritz I don't think that is needed, if the edit is a combination of additions and updates, the current summary is fine. I was more thinking about the future, when there might be "pure" additions. Ainali (talk) 17:02, 20 March 2025 (UTC)
- @NL-Moritz Slightly related question, but thinking about that you don't have any more edits to do as tests right now, what is the estimated number of edits for the bot per year? Ainali (talk) 09:03, 18 March 2025 (UTC)
- @NL-Moritz Ok! Nice that you also added edit summaries. Could it be made more granular, so that it from the summary is clear if it is a pure addition or if existing statements are getting updated, or both (or removed as I saw some edits on test.wikidata)? Ainali (talk) 09:49, 17 March 2025 (UTC)
- @Ainali I think that was a previous bug and the issue is fixed now. I will look into this edit to fix any faults caused by the bug. Apart from that the newest runs of the bot were done in the sandbox https://test.wikidata.org/wiki/Special:Contributions/KlimatkollenGarboBot as wikidata is pretty much up to date and the bot cannot contribute something new. NL-Moritz (talk) 08:27, 17 March 2025 (UTC)
- @NL-Moritz The latest edit from the bot on Systembolaget (Q1476113) looks weird, when all different scope 3 values was replaced by the same one. Was there some change in your code creating this error? Ainali (talk) 13:27, 16 March 2025 (UTC)
- @Ainali After adding a claim for the total emissions to the majority of the companies we track. I noticed that there are still a few gaps in the emissions data of some companies. I would love to fill these up with our current data using the bot as after a lot of testing, I feel quite comfortable that with some supervision it should work fine. One thing I am bit unsure of is the removal of older claims. We discussed that the number of claims per entity is limited and therefore only the most recent data should be in the entity. The bot will therefore remove/replace older data. An example can be seen here: https://www.wikidata.org/w/index.php?title=Q47508289&action=history. If this is okay I would go through with updating all companies with our data. NL-Moritz (talk) 12:10, 16 April 2025 (UTC)
- @NL-Moritz Please don't delete already added statements! Especially this edit that claims an update but is pure deletion is not good. There's plenty of room if you just update once per year, the comment about limits were more a safeguard if you would add historical data stretching back a couple of decades, that might reach the limits. Ainali (talk) 14:21, 16 April 2025 (UTC)
- @Ainali Sorry for that, I guess we had a misunderstanding early regarding this then. I reverted the change and will update the bot not remove data from previous years. NL-Moritz (talk) 14:52, 16 April 2025 (UTC)
- @NL-Moritz Please don't delete already added statements! Especially this edit that claims an update but is pure deletion is not good. There's plenty of room if you just update once per year, the comment about limits were more a safeguard if you would add historical data stretching back a couple of decades, that might reach the limits. Ainali (talk) 14:21, 16 April 2025 (UTC)
Notified participants of WikiProject Climate Change as we are using that emissions model. Ainali (talk) 09:59, 28 February 2025 (UTC)
Comment I do not see issues in terms of the emissions model but I am wondering about the references. Instead of a reference URL (P854) statement pointing to a PDF of the report, we might want to go for a stated in (P248) statement pointing to an item about the report, with a link to the PDF and an Internet Archive copy. However, I have no idea how diverse the references are that the bot would be citing. If they are all essentially PDFs, maybe the above workflow would be useful. If it's sometimes a PDF, sometimes a URL, sometimes something else, then I'd keep the bot's settings for now. --Daniel Mietchen (talk) 13:20, 28 February 2025 (UTC)
- The reference documents (reports) should all be PDFs so your proposed structure would work. We aligned our structure of the datapoints so far to this model Wikidata:WikiProject Climate Change/Models#Emissions. Implementing your changes would be possible. As the name for the reports I would suggest "<company name> GHG Protocol <year>" to uniquely identify these reports to avoid any duplicates. Regarding the linking of the file, we also want to try to host a copy of every report at klimatkollen.se to ensure that these reports are available long-term and we could also think about a backup of the reports at the Internet Archive. If it is okay for everybody we would do this as an ongoing process with first linking to the original source, as soon as we store a copy at our site to this copy and add a link to a copy at the Internet Archive in the future.
- One more thing about the properties in the reference, in the model I referenced before there is also the property determination method or standard with our AI garbo for extracting data from the written reports as a value. Should we also fill out this property and if so, our plan is to only upload data after it is verified by a human so using garbo would not fit that will. Instead, we would need another entity describing this method. Any idea how to call that? NL-Moritz (talk) 13:42, 4 March 2025 (UTC)
- @NL-Moritz If all data will be manually verified, we can just skip the determination method as I modeled it in the model example, because that is then just like how we normally do. I modeled that when I was expecting a totally automated process. Ainali (talk) 14:20, 4 March 2025 (UTC)
- @Daniel Mietchen I am not sure we want to create items for each annual report, that seems excessive. I'd rather keep the reference URLs as is. Ainali (talk) 14:17, 4 March 2025 (UTC)
- @AinaliDo you have input here? Klimatfrida (talk) 11:46, 17 March 2025 (UTC)
- @Klimatfrida I am a bit uncertain what you refer to, since you replied to the input I had. Ainali (talk) 18:02, 17 March 2025 (UTC)
- Hello! I am currently working with @Klimatfrida on developing this bot. I agree with @Daniel Mietchen that it would be excessive to create new items for each report. In my opinion the current way of presenting the report URL is a good start, and later on we could just switch to using URLs that point to some archive if we find that necessary. Oliver-NL (talk) 12:53, 18 March 2025 (UTC)
- @AinaliDo you have input here? Klimatfrida (talk) 11:46, 17 March 2025 (UTC)
Ping @:NL-Moritz Please see this discussion on Swedish Wikipedia about some odd values added in the testing: w:sv:Wikipediadiskussion:Projekt_klimatförändringar#Misstänkt_felräkning/dubbelräkning_av_koldioxidavtryck_inlagda_av_KlimatkollenGarboBot. Ainali (talk) 06:20, 19 April 2025 (UTC)