Wikidata:Requests for permissions/Bot/Bekicot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 02:16, 15 August 2017 (UTC)[reply]
bekicot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Yana agun (talk • contribs • logs)
Task/s: Update references on coordinates that is imported by ceb wikipedia’s bot. Scoped to Indonesian islands
Code: https://github.com/Wikimedia-ID/anak-pulau/blob/ceb-updater/indonesian_island_geonames_updater/
Function details: Ceb wikipedia is known to a wikipedia that have a lots of bot created contents. One of the contents that affected Indonesia is the Islands of Indonesia Wikipedia (https://ceb.wikipedia.org/wiki/Kategoriya:Kapuloan_sa_Indonesia). Mass importing the data without adding references into the integral part of properties can be problematics for future improvements. One of the example of the problem would be an island that have the same name, see Official Page of Indonesian Small Island Directory (https://en.wikipedia.org/wiki/Indonesian_Small_Islands_Directory) and there is a lot of duplicates names (e.g burung, babi). It is hard to check whether the coordinates is correct without knowing the source of the coordinate.
To fix this issue, this bot will create references to coordinate properties on properties that has empty references. Scopes of the tasks Updates island that was imported from ceb wikipedia which coordinate has no references and the coordinate is “Exact Match” with the coordinate provided by Geonames. Which means 5596 of entities will be affected. The counting is done using:
SELECT DISTINCT ?item ?itemLabel ?coordinate ?geonamesId WHERE { ?item wdt:P31/wdt:P279* wd:Q23442 . ?item wdt:P17 wd:Q252 . ?item wdt:P625 ?coordinate . ?sitelink schema:about ?item . ?sitelink schema:inLanguage 'ceb' . ?item wdt:P1566 ?geonamesId SERVICE wikibase:label { bd:serviceParam wikibase:language "id,en". } } ORDER BY ?item
And substracted with the island that doesn’t have matching coordinate or have the matching coordinate but already have reference to geonames
Therefore When running the query it will get 5829 items Those items will be - 233 (items that already doesn’t have matching coordinate with geonames) and the final result is 5596
The same query is also used for verification process
The Algorithm Pre Processing Retrieve geonames datasets from http://download.geonames.org/export/dump/ID.zip Remove any entities in geonames dataset that is not an Island to speed up lookup process and save it as “geonames_islands.csv” Retrieves the islands data using the query above Save the island data as “indonesian_islands_from_sparql.json” Retrieved islands id then split into an array of 50 elements Bulk retrieve all claims from each islands, 50 at a times using “wbgetentities” endpoint and Save it to disk as “1_indonesian_island_entities.json” On each entities on wikidata, filter the entities that have “Exact Match” coordinate with the entities on geonames Save the filtered entities into disks as “2_matching_indonesian_islands_with_geonames.json” Add references (“stated in=geonames”) on “2_claims_with_geonames.json” save it as “3_claims_updated_references.json Posts upstream Upload each claim on “3_claims_updated_references.json” to wikidata Verify the data by re-retrieving the data using the same query as retrieving it
Preprocessing script: https://github.com/Wikimedia-ID/anak-pulau/blob/ceb-updater/indonesian_island_geonames_updater/prepare_data.rb
Post upstream script: https://github.com/Wikimedia-ID/anak-pulau/blob/ceb-updater/indonesian_island_geonames_updater/post_to_wikidata.rb
Success Indicator The success indicator is, by the end of the project, the number of Island in Indonesia that have ceb imports, which the coordinate have an exact match with geonames coordinate that doesn’t have references is 0.
- License of data and script being used
- Geonames: Creative Commons Attribution 3.0 License (Although it is CC-By license, The bot "DO NOT" modify the value of property P625 based on the geonames data. The license of the data is preserved as is)
- Bots Scripts: MIT
Full Script: https://github.com/Wikimedia-ID/anak-pulau/blob/ceb-updater/indonesian_island_geonames_updater/ --Yana agun (talk) 12:51, 2 August 2017 (UTC)[reply]
- Are you sure that Creative Commons Attribution 3.0 License used in Geonames is compatible with CC0 used id Wikidata? --ValterVB (talk) 18:23, 2 August 2017 (UTC)[reply]
- Probably it is the mistakes of using "License of the data" i was using "License of the data for import" hence i've been fixed it. The bot itself doesn't import data from geonames to wikidata. It matching 2 data. if it does match, then add "Stated In Geonames" reference to it. Yana agun (talk) 06:51, 3 August 2017 (UTC)[reply]
- The Coordinate data already exist in wikidata by previous bot.
- The bot that is known for doing such addition is Mr.Ibrahembot.
- Example of the coordinate added by Mr.Ibrahembot which is going to be added "Stated At" references
- https://www.wikidata.org/w/index.php?title=Q24819276&diff=514807388&oldid=442577162
- https://www.wikidata.org/w/index.php?title=Q24819397&diff=514817409&oldid=442482627
- https://www.wikidata.org/w/index.php?title=Q24897201&diff=next&oldid=442547143
- See more at https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20%3Fsitelink%20%3FgeonamesId%20%3Fcoordinate%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ23442.%0A%20%20%3Fitem%20wdt%3AP17%20wd%3AQ252%20.%0A%20%20%3Fsitelink%20schema%3Aabout%20%3Fitem%20.%0A%20%20%3Fsitelink%20schema%3AinLanguage%20'ceb'%20.%0A%20%20%3Fitem%20wdt%3AP625%20%3Fcoordinate%20.%0A%20%20%3Fitem%20wdt%3AP1566%20%3FgeonamesId%20.%0A%20%20MINUS%20%7B%0A%20%20%20%20%3Fitem%20p%3AP625%20%3FcoordinateStatement%20.%0A%20%20%20%20%3FcoordinateStatement%20prov%3AwasDerivedFrom%20%3Fref%20.%0A%20%20%20%20%3Fref%20%3FanyReference%20%3Fvalue%20.%0A%20%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22id%2Cen%22.%20%7D%0A%7D%0AORDER%20BY%20%3Fitem
- see more Mr.Ibrahembot contribution on https://www.wikidata.org/wiki/User:Mr.Ibrahembot
- The aim is to add confidence that the data is also mentioned in other source with the same value. No modification to the coordinate. Therefor, the license is still derived from previous license that is (For some island i found) made available by Mr.Ibrahembot
- Some other mention with authoritative resource that have the same value as the one that is being used by wikidata (But not part of this project to add their name to "Stated In" reference of the item)
- http://geonames.nga.mil/namesgaz/ for https://www.wikidata.org/wiki/Q148440
- It also make a fact that a lot of data in geonames in fact is matched with the data in public domain published by http://geonames.nga.mil/namesgaz/
- Another example
- Pulau Jawa, Pulau Bali, Pulau Sumatra, Borneo, Sulawesi
- I am going to approve the bot in a couple of days provided there have been no objections raised.--Ymblanter (talk) 01:53, 12 August 2017 (UTC)[reply]
- I have found Island in Wikidata which actually a group of island according to Geonames entry that is on wikidata Item, ceb & sv wikipedia text said it as Islands as well. And I have corrected the data by changing the instance of (P31). (e.g https://www.wikidata.org/wiki/Q24827438) The total number of Island with ceb wiki is now 5657 and the total exact match (Without the reference to geonames) is 5433. It is now my target. Yana agun (talk) 03:36, 15 August 2017 (UTC)[reply]
- This Task is now completed. Wating for Wikidata:Requests for permissions/Bot/Bekicot 2 - Yana agun (talk) 11:05, 16 August 2017 (UTC)[reply]