Wikidata:Requests for permissions/Bot/Heinmabot
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 20:00, 11 January 2023 (UTC)[reply]
Heinmabot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Heinmaman (talk • contribs • logs)
Task/s: Add GOV ids to Polish places.
Code: https://github.com/heinmade/wd_gov_refs
Function details: This bot adds GOV ids to the P2503 property of Wikidata items representing geographical/administrative entities in Poland, mainly settlements. The Wikidata-GOV references were derived beforehand by combining references present in various datasets (e.g. GOV, GND, Geonames, TERYT, Wikidata). Afterwards, additional plausibility checks between the linked Wikidata and GOV datasets were performed (mainly based on name similarity and geographical proximity). If a Wikidata item already contains a GOV reference, it is left unchanged.
--Heinmaman (talk) 13:26, 2 January 2023 (UTC)[reply]
- Were the plausibility checks done by hand? Is there code available? Could we see some sample edits? BrokenSegue (talk) 18:09, 2 January 2023 (UTC)[reply]
- Automated plausibility checks between all linked Wikidata and GOV datasets were performed, using the PostgreSQL Levenshtein function (name similarity) and PostGIS capabilities (geographical proximity), backing the reference claims (high name similarity and geographical proximity of the pairs).
- The code is available here.
- The bot already made 50 sample edits. Heinmaman (talk) 11:21, 3 January 2023 (UTC)[reply]
- Do you need more information to support the request? Heinmaman (talk) 08:33, 9 January 2023 (UTC)[reply]
- looking at the source code I'm not seeing the Levenshtein or GIS logic you mentioned? BrokenSegue (talk) 18:06, 10 January 2023 (UTC)[reply]
- We did the plausibility checks in psql, the PostgreSQL CLI, so they are not part of the code. The code just writes references into Wikidata that we collected, combined and then checked beforehand.
- I will illustrate the whole process a bit more:
- 1. Firstly "same as" / reference data was aggregated from authority sources (GOV, GND, Geonames, Wikidata) by various means (e.g. Wikidata SPARQL queries, DB dumps) and then imported into PostgreSQL resulting in a collection of "same as" statements. These statements were combined (e.g. Wikidata.X - same as - Geonames.Y, GND.Z - same as - Geonames.Y > Wikidata.X - same as - GND.Z)
- 2. From this data set we extracted the Wikidata - same as - GOV pairs (about 30.000), joined them with the data needed for comparison (mainly corresponding names, coordinates) and checked the "similarity" them via Levenshtein and PostGIS in psql with various queries. This (actually optional) step showed that the data from the authority sources was high quality data, as expected.
- 3. The bot then writes this information into the corresponding (about 30.000) Wikidata items. Heinmaman (talk) 20:32, 10 January 2023 (UTC)[reply]
- alright ok. I'll Support then. sorry for the delay. optimally the entire pipeline would be open sourced though. BrokenSegue (talk) 00:55, 11 January 2023 (UTC)[reply]
- looking at the source code I'm not seeing the Levenshtein or GIS logic you mentioned? BrokenSegue (talk) 18:06, 10 January 2023 (UTC)[reply]