Topic on User talk:Magnus Manske

Jump to navigation Jump to search
Vladimir Alexiev (talkcontribs)

I'm trying to add 1.7M WorldCat Identities ID (P7859) (amounting to 81Mb together with VIAF id as source). The browser UI of https://tools.wmflabs.org/quickstatements seems to take forever, and the API "invoke by URL" takes the data in the URL, which won't work with such size.

Magnus Manske (talkcontribs)
Vladimir Alexiev (talkcontribs)

Thanks @Magnus Manske! Before trying it on such large scale, I tried to add a few statements the interactive way and I got a bunch of errors: https://imgur.com/3mVLKlC.

  • I was able to add one statement
  • The row about Henri Leveneaux initially gave me 2 errors but now gives 1
  • "Try to reset errors" and then re-run fixed a couple but left 5 errors...

So it's really unreliable. Maybe that's related to Topic:Vf67sq5dyh5ql2j9 or some other WD outage?

I understand from https://www.wikidata.org/w/index.php?title=Topic:Vbgypuu9k0q1pvz5&topic_showPostId=vfpvgr5lebir3v3m#flow-post-vfpvgr5lebir3v3m that QS batches use a different data path from QS interactive, would you recommend I try the batch even though the interactive path is unreliable?

Vladimir Alexiev (talkcontribs)

hi @Magnus Manske ! Could you please:

curl https://tools.wmflabs.org/quickstatements/api.php \
  -d action=import -d submit=1 -d username=Vladimir_Alexiev -d format=csv \
  -d "batchname=1.7M WorldCat Identities" \
  --data-raw "token=%24..." \
  --data-urlencode data@wd-identities-QS.csv
Vladimir Alexiev (talkcontribs)

After about 15min I got {"status":"OK"} but not "batch_id":ID_OF_THE_NEW_BATCH. And I can't see a new batch added at https://tools.wmflabs.org/quickstatements/#/batches/Vladimir%20Alexiev.

The data was parsed ok because I tried a broken file and a wrong token, and got appropriate errors:

  • {"status":"No commands"}
  • {"status":"User name and token do not match"}
Magnus Manske (talkcontribs)

Can you try a default file instead of CSV? Maybe there is a bug in there.

Magnus Manske (talkcontribs)
Magnus Manske (talkcontribs)

Huh. Batch was submitted successfully but the edits didn't run. Investigating.

Magnus Manske (talkcontribs)
Vladimir Alexiev (talkcontribs)

Tried with a couple in-browser ("Run") batches, again got a mix of Done (10) and errors (24).

Tried with a "Run in background" batch. It parses the CSV, asks me for batch name, starts running and THEN says "Not logged in": https://imgur.com/j9T4O7N.

Tried the big batch using single quotes around 'token=...' Again got {"status":"OK"} but no batch ID, and https://tools.wmflabs.org/quickstatements/#/batches/Vladimir%20Alexiev doesn't show any recent ones.

I've shared the file (wd-identities-QS.csv) at https://drive.google.com/file/d/1IbEFZdgl5_gwl-3dZ9BOGAY6y4vaRJHM/view , maybe you coold try it?

Magnus Manske (talkcontribs)

Got it running:

  • Your CSV file contained Mac OS 9 (!) line endings. Converted them to Unix.
  • I only used the first 8000 lines. Apparently it silently fails if its too big.
Vladimir Alexiev (talkcontribs)

Thanks @Magnus Manske, that is progress.

  • So should I split it to 8k-line parts? That would be 212 parts, quite doable.
  • You have only one error and Q10309969#Q10309969$086C7454-80FB-48DF-AB68-6D6D95A492D3 was left without reference. But that's a VERY minor problem
  • OAuth:
  • On fresh login to QS, I get "Error retrieving token: mwoauthdatastore-request-token-not-found".
  • On reload it shows me buttons for New Batch but no user name
  • On third reload it shows my user name.

Tomorrow I'll post update on how it goes with 8k-line pieces.

Magnus Manske (talkcontribs)

Try larger pieces first, 8k was just a number I pulled out of my virtual hat.

Vladimir Alexiev (talkcontribs)

I fixed newlines to Unix and split into 20k-line pieces. Got this error, will try again.

head -n 1  wd-identities-QS.csv > wd-identities-head.csv
tail -n +2 wd-identities-QS.csv | split --filter='cat wd-identities-head.csv - > split/$FILE' -d -l 20000 - wd-identities-
curl https://tools.wmflabs.org/quickstatements/api.php \
  -d action=import -d submit=1 -d username=Vladimir_Alexiev -d format=csv \
  -d "batchname=1.7M WorldCat Identities" \
  --data-raw 'token=...' \
  --data-urlencode data@split/wd-identities-01

<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>openresty/1.15.8.1</center>
</body>
</html>
Vladimir Alexiev (talkcontribs)

Split to 10k-line pieces, same result.

Vladimir Alexiev (talkcontribs)
Vladimir Alexiev (talkcontribs)
Vladimir Alexiev (talkcontribs)

Hi @Magnus Manske, is the speed of 30s per statement normal? My two batches mentioned above are just crawling :-( Perhaps QS needs a restart?

Jura1 (talkcontribs)
ArthurPSmith (talkcontribs)

@Vladimir Alexiev Are you seeing 30s per statement as they go one by one, or is that an average speed over many statements? If the former it suggests you are editing items that are very large (which may cause other trouble, but can certainly happen). If the latter, and QS is occasionally running your edits quickly, but taking long breaks in between, that's because QS is following the "maxlag" limit that constrains bots from editing Wikidata too fast; since November WDQS was included so this allows the WDQS servers to keep up, but recently it has also meant that bots can only edit for a small percentage of the time.

Jheald (talkcontribs)

I do think something may have happened with QS -- or maybe there have been more jobs in contention recently.

I started a batch last night at about 00:30 GMT (batch report), adding an external ID + a "named as" qualifier + a "retrieved" reference, so three triples each for about 945 items --> so a total of 2835 edits to make. It's the kind of small job I would expect to take no more than a couple of hours at most. But it's now 21:15, and there are still 643 edits to go. Maxlag has been bouncing around all day, but there was a clear window from 01:00 last night to 07:30 this morning when it was only about 1 to 2 seconds, so I am not sure quite why things have taken so long.

But something needs to be done to address this by the Dev Team, because this kind of prolonged lock-out from editing is just not sustainable.

Jura1 (talkcontribs)
Jheald (talkcontribs)

One additional thing I note is: no errors. Previously, when maxlag throttling has been severe, anything up to 30 to 40% of the edits have sometimes had to be re-run. This time, even though the batch is going very very slowly, it does all seem to be going through. Which is a good thing. So @Magnus, if this is down to a recent change you have made, with slightly more aggressive backing-off, but then not giving up on edits, thank you!

Hogü-456 (talkcontribs)

I asked Vladimir Alexiev if he has thinked about using a bot. And he tried it and it did not worik. I have a batch what is running since a few days now. Vladimir said, that he could upload this number of claims in a normal RDF repo with a SPARQL Update in a few seconds. I think this is a interesting thing and maybe it is possible to do that in Wikidata too. Also it should be possible for big uploads like this to use a bot that other users can use QuickStatements. I think it were great if there are scirpts for that who are easy to understand.

Vladimir Alexiev (talkcontribs)

@ArthurPSmith I think the extreme slowness happens on all updates. Currently there are only 2 of my batches running, and it seems it's one triple per minute. All these items are people so some of them are pretty large, but why should an update of a large item take more time? That's a serious problem in the architecture of Wikidata cc @Lea Lacroix (WMDE), @Lucas Werkmeister (WMDE), @Lydia_Pintscher_(WMDE).

@Jheald and @Magnus Manske the error rate on some of the batches is very bad, this batch below is 50%. Curiously, all statements "Worldcat id" have gone through, and all references "from VIAF id" have failed:

#25587 Vladimir Alexiev wd-identities-03 DONE:27254, ERROR:12739

@Jura1 Nice joke :-) It definitely is lohi-lohi.

@Hogü-456 I haven't tried with a bot, but I think QS itself is a sort of bot, and any bot that respects maxlag will face the same problems. Using SPARQL Update is not possible on WD because its main store is relational not RDF. Each of these statements is recorded as a diff, can be discussed, I can be thanked for it... so WD has a lot more features compared to a normal triplestore in this respect. But still: the extreme slowness of Wikidata will be its doom, if the dev team cannot fix it soon.

ArthurPSmith (talkcontribs)

@Vladimir Alexiev updates of large items take more time due to how Quickstatements acts. Before it updates anything it reads in the entire current state of the item it is updating, and checks whether what you are trying to add is already there (if so it does nothing and proceeds to the next update). If you are adding qualifiers or sources to a statement, it has to first find the statement you are modifying. That explains the 50% error rate also - when it proceeds to the next update the previous update probably hasn't completed yet, or at least it can't see it yet, and so it fails to add the reference since there is nothing there to add the reference to.

For that reason a custom bot probably could do the work much faster - if you know what you are adding is definitely not there, you don't need to check that; also you can add references and qualifiers in one step rather than the multi-step process that Quickstatements uses.

Vladimir Alexiev (talkcontribs)
Reply to "how to add 1.7M claims with QS?"