Wikidata talk:Database download

From Wikidata
Jump to navigation Jump to search

truncated dumps[edit]

I downloaded latest-all.json.bz2 on 6th May and the last entity is truncated. It starts

{"type":"item","id":"Q89247542","labels":{"en":{"language":"en","value":"LEDA 2599117"}, ...

and ends

... ,"datatype":"quantity"},"type":"sta

Is the code used to create the data dumps available (github)? If so it would be possible for it to be maintained by the community.

Also as a suggestion maybe it would be good to consider modern alternatives to bzip2. For example, xz potentially offers equivalent or better compression ratio and, more importantly for users, much better decompression speeds. Although memory use and compression time can be higher. Zstandard (developed within (Evil) Facebook) is also reporting excellent performance.  – The preceding unsigned comment was added by Neilireson (talk • contribs) at 12:49, 14 May 2021‎ (UTC).[reply]

I noticed the latest gzip compressed NTriples file is 41 bytes at the time of writing this. If you are dumping the archives straight to a folder that is web-accessible, would you please consider prefixing the filename with a dot to make it hidden and then make it visible once it is complete?
I also support ditching bzip2. gzip is nice because it is widely supported and a good compromise between speed and space. But for more space efficient compression xz is a better choice that bzip2, it AFAIK supports multiple streams and is also widely available. Infrastruktur (talk) 12:55, 21 July 2022 (UTC)[reply]

frequency of dumps[edit]

The Database of Wikidata should be from the start right available for experimenting and starting own Wikidata clients. What frequency of supply does community prefer? Conny (talk) 08:54, 4 November 2012 (UTC).[reply]

It's likely to be only done at the same frequency as the dumps of the other projects, and the first one is already available, though contains little/no content [1]. Reedy (talk) 01:02, 5 November 2012 (UTC)[reply]
Full history of Q1-Q100 is 76.1MB, 7z turns this into 887KB. Reedy (talk) 01:18, 5 November 2012 (UTC)[reply]
If you want a dump as of earlier today from strongspace (hosted by me) or archive.org. 1.4GB uncompress, 27MB when 7zipped. Reedy (talk) 22:46, 5 November 2012 (UTC)[reply]

Multilicense[edit]

In look of the fixed text, we have to change german Text under Edit window:

Mit dem Klick auf die Schaltfläche „Seite speichern“ erklärst du dich mit den Nutzungsbedingungen einverstanden und lizenzierst deine Bearbeitung unwiderruflich unter der Lizenz Creative Commons „Namensnennung / Weitergabe unter gleichen Bedingungen 3.0“ und der GFDL. Du stimmst zu, dass ein Hyperlink oder eine URL zur Seite für die notwendige Zuschreibung, gemäß der Creative-Commons-Lizenz, ausreichend ist.

I think this is not so easy, because content in articles till now is under GFDL also... --Conny (talk) 21:36, 4 November 2012 (UTC).[reply]

Unfortunately MediaWiki:Wikimedia-copyright and MediaWiki:Wikimedia-copyrightwarning differ - they need to be synchronised, urgently. Jdforrester (talk) 06:17, 6 November 2012 (UTC)[reply]

Question about data[edit]

In the file wikidatawiki-latest-abstract.xml, why aren't the full titles of some pages given? For example, in Q13, the Japanese title given for 'triskaidekaphobia' is '13', whereas the actual Wikipedia title is '13 (忌み数)'. The title is displayed correctly on this page, but not in the data dump (unless I'm missing something). Any reason for this? — 70.26.26.153 17:05, 6 July 2013 (UTC)[reply]

Broken link to incremental download[edit]

The link to the incremental download is broken: http://releases.wikimedia.org/other/incr/wikidatawiki/ . Are incremental downloads being provided? Jefft0 (talk)

I figured out the correct link and fixed it. https://www.wikidata.org/w/index.php?title=Wikidata%3ADatabase_download&diff=162829416&oldid=147713105 Jefft0 (talk)

Short explanation on kind of dumps and structure of dumps[edit]

I am slightly confused of having so many files. Is there anywhere a page that details a little, what to find in which file? I'T a bit hard to download several gigabytes only to check what's in a file. Is it broadly xml vs. sql formats? Is there any redundancy between the xml dumps? XamDe (talk) 14:58, 7 November 2014 (UTC)[reply]

Incremental JSON dumps[edit]

Would it be possible to have incremental JSON dumps, perhaps daily, in the same format as the existing JSON dumps, but containing only the entries that have been changed or added since the previous incremental dump? This would greatly speed some of the bot work I'm currently doing. -- The Anome (talk) 19:31, 27 July 2015 (UTC)[reply]

You can use the recent changes API to find out which entities changed and then Special:EntityData (like https://www.wikidata.org/wiki/Special:EntityData/Q42.json ) to get the dump for each entity. This will work as long as you are not 30days or more behind. - Jan Zerebecki 23:01, 27 July 2015 (UTC)[reply]
I also added Wikidata:Data_access#Incremental_updates. - Jan Zerebecki 23:29, 27 July 2015 (UTC)[reply]
Thanks for letting me know. I could certainly do that, but it seems like a lot of work to generate thousands of API calls, a process that would take many hours, rather than just download a single change-file, a process that would most likely take only minutes.

Since we have no guarantee of either dump ordering or JSON serialization stability, I would imagine that we can't just use diff, but assuming we can use timestamps to detect changes, there is an obvious format for these delta files, which would be the same as the previous file format, but with just the content of the latest revisions of each item that had been changed (and perhaps a special stub entry of some sort for those items that had been deleted). This would be really, really useful for bot operators. I'd be happy to write the code for a program to generate it, given two dumpfiles, as well as a program to apply the pseudo-diff to one dumpfile to create another. -- The Anome (talk) 14:31, 9 May 2016 (UTC)[reply]

There are no daily dumps, so we can't just diff two dumps. There was phab:T72246, please reopen it, add a link to this discussion and tell me there for what bot work you want use these incremental dumps. (There are some different ways of doing incremental dumps, with different features, so we need to know which features these incremental dumps would need to have.) -- Jan Zerebecki 18:48, 9 May 2016 (UTC)[reply]

Discussion pages dumps[edit]

Hi everyone. I am using the XML dumps to carry out some analyses on Wikidata, does anybody know where (under which tag) I can find the revisions of the discussion pages? In particular, I am looking for the discussion pages of properties, where constraints are specified. Are they also in the XML dumps or should I use other dumps? Thanks,--Ale Batacchi (talk) 15:06, 12 August 2015 (UTC)AleBatacchi[reply]

I reply to my question: the discussion pages are in the XML dumps, as well as all the other pages. My script just skipped them. Ale Batacchi (talk) 11:09, 24 August 2015 (UTC)AleBatacchi[reply]

"latest" links for JSON dumps[edit]

Would it be possible to create a 'latest' link in the wikidatawiki/entities/ folder, like there is in the wikidatawiki/ folder? A permanent link to the latest JSON download would be handy. HYanWong (talk) 10:01, 10 November 2015 (UTC)[reply]

Likely yes. Would you be so kind an open a ticket on phabricator.wikimedia.org for this? --Lydia Pintscher (WMDE) (talk) 12:13, 10 November 2015 (UTC)[reply]
Now at https://phabricator.wikimedia.org/T118457. Thanks HYanWong (talk) 11:40, 12 November 2015 (UTC)[reply]

Reading the JSON dump with Python[edit]

Hello! How can I read the JSON dump using Python? The .readlines() function loads the whole file in memory and it isn't efficient. The .readline() function only returns 18 lines and ends (probably it doesn't work because the parallel compression). I know that there are a reader in PHP but I prefer to use Python. So, who can help me? Emijrp (talk) 10:10, 15 November 2016 (UTC)[reply]

@Emijrp: I can't give you any advice for reading line by line in Python offhand, but regarding the parallel (multistream) compression: This is only a "problem" with the gzip dumps, the bzip2 ones are in a single stream. Most (all productively used?) gzip clients support multiple streams, while only some bzip2 ones do, that's why we have chosen to use multistream gzip, but single-stream bzip2. Cheers, Hoo man (talk) 10:41, 15 November 2016 (UTC)[reply]

Here is solution to stream-decompress the .gz: https://stackoverflow.com/a/40548567/140837

lastrevid and modified not present in JSON dumps[edit]

I was going through the JSON dumps and wanted to see the freshness aspect of each entity. I noticed that entities dont have lastrevid and modified fields in the JSON dumps. Is this expected ? Kaaro01 (talk) 09:21, 14 December 2016 (UTC)[reply]

Yes, currently they only contain the actual data and no meta data. Cheers, Hoo man (talk) 09:44, 15 December 2016 (UTC)[reply]

Redirects info in the JSON dumps is missing[edit]

imho: As I see in the "Wikidata JSON dumps" the information about wikidataid redirects is missing. and maybe this is true for deleted wikidata-ids ( Wikidata:Requests for deletions ) Context: I need this type of information for validating wikidata-id in the external 3rd party database.

  • Are there any suggestions to get this type of JSON dumps? ( included or in the separated files? )
  • What is the quick&dirty temporary solution suggestions to processing this information? ( XML or RDF dumps? )

--ImreSamu (talk) 13:10, 10 November 2017 (UTC)[reply]

JSON-LD[edit]

Will there be en:JSON-LD? 85.179.161.128 09:42, 18 May 2018 (UTC)[reply]

Possible to change formats of RDF dumps in nt and ttl?[edit]

- In the RDF - TTL(turtle) format dump, is it possible to put the entities (ended with '.') in one line?

Currently, there're line breaks with every ',' or ';'. This makes it difficult to analyze using any buffered readers. For example, with Spark, the file will be split up into partitions and loaded with chunks. It would require additional logic to make sure the partitioning doesn't happen within one entity.

I can understand with the current linebreak it is easier for humans to read, but I guess most people would want to analyze this 300GB+ file with tools rather than bare eyes.


- Also in the NT triplets dump, would be nicer if the subjects and predicates are separated by TAB('\t').

Currently, it's using space(' '), which makes reading tasks complicated because some of the subjects contain spaces.

I would not expect specific kind of Turtle or NT serialization, this is not specified and unstable. Better use use a parser that can fully handle the serialization (e.g. understand both spaces and tabs as separator). If your consuming application cannot handle this efficiently you may pipe the RDF dump through a parser that normalizes to your requirements. -- JakobVoss (talk) 09:13, 29 November 2018 (UTC)[reply]

Who is responsible for dumps at Archive.org?[edit]

RDF and JSON dumps are available at archive.org but it is unclear when, how often and by whom this is done. By now there is always a lag of at least half a year dumps neither available anymore at https://dumps.wikimedia.org/wikidatawiki/entities/ nor available yet at Internet Archive. -- JakobVoss (talk) 09:13, 29 November 2018 (UTC)[reply]

@Hoo man: :) --Lydia Pintscher (WMDE) (talk) 17:34, 30 November 2018 (UTC)[reply]

Latest Wikidata dump files are "empty"[edit]

My apologies if this is not the right forum. If not, any guidance on where to best report this will be much appreciated.

When looking at the latest dump files on https://dumps.wikimedia.org/wikidatawiki/entities/ like "latest-all.json.bz2" and "latest-all.json.gz", they are both only a few bytes.

The latest "dated" dump is in the https://dumps.wikimedia.org/wikidatawiki/entities/20190701/ directory. There the two files have an extra extension .not for some reason.

Just wondering whether you are already aware of the issue and whether there is anything I can do to help? Thanks, --Anders-sandholm (talk) 11:08, 5 July 2019 (UTC)[reply]

@Anders-sandholm: Please note No JSON dumps for the weeks of June 24th and July 1st. --Succu (talk) 12:20, 5 July 2019 (UTC)[reply]
@Succu: Thanks for the reference. I hadn't seen that. That explains everything. --Anders-sandholm (talk) 12:59, 5 July 2019 (UTC)[reply]

Pages Meta-History Names[edit]

The complete revision history of Wikipedia articles are distributed into multiple bzip dumps. Is there any way to know which bzip file contains which article? Is there any mapping present for this? Descentis (talk) 18:00, 6 September 2019 (UTC)[reply]

Item's history: are XML dumps the only solution?[edit]

Hi all,

I'm doing analysis on the history of Wikidata Items. If I understand correctly the json dumps just contains an snapshot (current versions). The only historical versions that I've seen so far, are the ones in the XML file, but those dumps are "not recommended". Is that the only way to get the full history of all Wikidata items? Diego (WMF) (talk) 04:23, 24 March 2020 (UTC)[reply]

ShEx schema[edit]

Are the ShEx schema pages available as a separate download file. As far as I can see the only options for download of all schemas is by API or the big database dump. — Finn Årup Nielsen (fnielsen) (talk) 11:41, 28 May 2021 (UTC)[reply]

Dumps cannot be decompressed in parallel[edit]

This page states that "the files are using parallel compression, which means that some decompressors cannot reliably unpack the files". However, when analyzing the bzip2 dumps I see that they're likely not created in parallel as they only contain a single stream. This means that it takes hours to decompress the larger dumps, thus making them time consuming to work with.

This can be verified by downloading latest-lexemes.json.bz2 and decompressing it with both bunzip2 and pbzip2

$ time bunzip2 -c -k latest-lexemes.json.bz2 > serial-1.json

real	0m49,269s
user	0m47,266s
sys	0m1,515s

$ time pbzip2 -d -k -c latest-lexemes.json.bz2 > serial-2.json

real	0m45,491s
user	0m45,354s
sys	0m2,050s

Compressing the file in parallel then decompressing it in parallel makes a huge difference in decompression time:

pbzip2 -z < serial-1.json > parallel.json.bz2

time pbzip2 -d -k -c parallel.json.bz2 > parallel.json

real	0m4,238s
user	1m3,728s
sys	0m2,207s

User:Bigbump (talk) 17:46, 09 June 2021 (UTC)[reply]

It seems only gzip is multistream while bzip2 is singlestream. See this answer. Mitar (talk) 05:17, 20 June 2021 (UTC)[reply]
In fact, it seems the problem is in pbzip2. It is unable to decompress in parallel files which are not compressed with pbzip2. You should use lbunzip2 instead, which is able to decompress in parallel any bzip2 file, it seems, including Wikidata dumps. See here. Mitar (talk) 04:04, 21 June 2021 (UTC)[reply]
See https://phabricator.wikimedia.org/T222985 --Sascha (talk) 12:45, 8 May 2023 (UTC)[reply]

problem in magnet link value[edit]

The magnet link leads to knowhere. Is the syntax ok ? can somebody corect the value ? Thanks. -- Christian 🇫🇷 FR (talk) 20:47, 4 February 2022 (UTC)[reply]

Unnecessary text is displayed in the translated text[edit]

The sentence "<span id="JSON_dumps_(recommended)_">" is incorrectly displayed. Afaz (talk) 02:20, 13 December 2022 (UTC)[reply]

Dead link to documentation[edit]

The link to "the JSON structure documentation" is dead; it should be to https://doc.wikimedia.org/Wikibase/master/php/docs_topics_json.html. Alex Chamberlain (talk) 15:20, 8 July 2023 (UTC)[reply]

I figured out how to fix this, so I've just done it. Alex Chamberlain (talk) 08:09, 11 August 2023 (UTC)[reply]

Torrent download[edit]

The first torrent

   You can currently download a fairly recent dump using a torrent. wikidata-20220103-all.json.gz (109.04 GiB) on academictorrents.com ( magnet)

Can now be replaced (or added) by https://academictorrents.com/details/0852ef544a4694995fcbef7132477c688ded7d9a which is a dump of Wikidata of January 1st, 2024. Jjkoehorst (talk) 11:02, 4 February 2024 (UTC)[reply]

An official (monthly renewable) torrent file and magnet link with the full archive[edit]

Does anyone in the WMDE team care to implement the [subj]? :-)

Value: huge improvement of the download speed for the clients and lower utilization of the current http server.

It takes to (once) install and keep running a torrent app anywhere on a trusted machine.

And (once) create a script that would monthly:

- generate a .torrent file and a magnet link

- publish it on the Dumps page on schedule.

Some of the community members (including me) would be happy to download it on regular basis and stay as seeding source.

Nikolay Komarov (talk) 10:49, 6 February 2024 (UTC)[reply]

All right, so I managed to download the file wikidata-20240129-all.json.gz, it took around 20 hours to download 122 GB.
In the morning (CET) the speed was about 2-4 MBytes/s, in the evening it dropped to 0,5-1 Mbyte/s.
I created a torrent file and a magnet link: magnet:?xt=urn:btih:58281b1f749ea9c3c2f21858d62cb140b302e886&dn=wikidata-20240129-all.json.gz Feel free to use it to download the file wikidata-20240129-all.json.gz (122 GB)! :-)
Note: after downloading and unpacking you might want to check that the MD5 or SHA1 of the file is the same as MD5 or SHA1 from the official source; it will mean that the JSON file is unchanged. Nikolay Komarov (talk) 22:21, 7 February 2024 (UTC)[reply]
https://academictorrents.com/ is a good tracker to upload dumps to. Nikolay Komarov (talk) 15:37, 8 February 2024 (UTC)[reply]