Wikidata:Requests for permissions/Bot/Josh404Bot 3
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Not done No follow-up on the request to see if this is still active. @Josh404: feel free to re-open this if you want to follow up on it (revert this edit, add it back to the list of bot requests). Thanks. Mike Peel (talk) 21:24, 4 February 2022 (UTC)[reply]
Josh404Bot 3[edit]
Josh404Bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Josh404 (talk • contribs • logs)
Task/s:
Fill missing TMDB TV series ID (P4983) that have an associated IMDb ID (P345) or TheTVDB series ID (P4835) via the TMDb API.
Code:
https://github.com/josh/wikidatabots/compare/d2dec28...4cd7cea
SELECT ?item ?imdb ?tvdb ?random WHERE {
# Items with either IMDb or TVDB IDs
{ ?item wdt:P4835 []. }
UNION
{ ?item wdt:P345 []. }
# P4983's type constraint
VALUES ?classes {
wd:Q15416
}
?item (wdt:P31/(wdt:P279*)) ?classes.
# Get IMDb and TVDB IDs
OPTIONAL { ?item wdt:P345 ?imdb. }
OPTIONAL { ?item wdt:P4835 ?tvdb. }
# Exclude items that already have a TMDB TV ID
OPTIONAL { ?item wdt:P4985 ?tmdb. }
FILTER(!(BOUND(?tmdb)))
# Generate random sorting key
BIND(MD5(CONCAT(STR(?item), STR(RAND()))) AS ?random)
}
ORDER BY ?random
LIMIT 1000
Function details:
This is a follow-up to Wikidata:Requests for permissions/Bot/Josh404Bot 2 and Wikidata:Requests for permissions/Bot/Josh404Bot 1. The task operates on TMDb related external IDs and shares similar code. One larger difference is that this bot task also cross-references TheTVDB series ID (P4835) in addition to IMDb ID (P345).
- Via SPARQL, find items that have either a IMDb ID (P345) or TheTVDB series ID (P4835) but DO NOT have a TMDB TV series ID (P4983). Accumulate results and remove duplicates.
- Use TMDb API's to lookup the TV show ID by either IMDB or TVDB id when present.
- For any matches, add a new statements for the item. This MAY add multiple distinct statements for a given item if the IMDB and TVDB IDs conflict with either other or when multiple IMDb IDs exist on a single item.
Recapping some notes that came up in previous reviews:
- SPARQL results are accumulated in a client side in a Python dictionary to better handle items with multiple IMDb IDs or TVDB IDs.
- TMDB API rejects invalid IDs on lookup. Handles theoretical case of a "nm" IMDB ID present on a tv Wikidata item.
- TMDB TV IDs are accumulated in a set to remove duplicates before generating statements.
- Statements are submitted in batch via QuickStatements which also acts a failsafe for preventing duplicate statements.
--Josh404 (talk) 19:01, 16 May 2021 (UTC)[reply]
- Mentioning User:BrokenSegue, since they've given great feedback on past requests. Thanks! Josh404 (talk) 19:08, 16 May 2021 (UTC)[reply]
- Support So instead of doing optional -> filter not bound you can do FILTER NOT EXISTS but it's probably equivalent and fine both ways. Otherwise looks good to me. BrokenSegue (talk) 00:40, 17 May 2021 (UTC)[reply]
- Thanks for the review again!
- I originally recall writing this type of query with
FILTER NOT EXISTS
but started running into performance issues. Then saw some suggestions to try the bind/filter approach. I'm not sure why the latter is often faster. Maybe the filter not exist approach evaluates the subquery entirely rather for each matching statement? I really haven't dug into the query optimizer stuff that much. - 16 secs https://w.wiki/3LXH vs 22 secs https://w.wiki/3LXK. The total number of records is at least small for this property but I've seen even timeouts on larger sets. Josh404 (talk) 01:18, 17 May 2021 (UTC)[reply]
@Josh404: This seems to be stale, is this still active? Perhaps @Ymblanter, Lymantria: could comment? Thanks. Mike Peel (talk) 22:00, 18 January 2022 (UTC)[reply]