Wikidata talk:Database download

From Wikidata
Jump to navigation Jump to search

frequency of dumps[edit]

The Database of Wikidata should be from the start right available for experimenting and starting own Wikidata clients. What frequency of supply does community prefer? Conny (talk) 08:54, 4 November 2012 (UTC).

It's likely to be only done at the same frequency as the dumps of the other projects, and the first one is already available, though contains little/no content [1]. Reedy (talk) 01:02, 5 November 2012 (UTC)
Full history of Q1-Q100 is 76.1MB, 7z turns this into 887KB. Reedy (talk) 01:18, 5 November 2012 (UTC)
If you want a dump as of earlier today from strongspace (hosted by me) or archive.org. 1.4GB uncompress, 27MB when 7zipped. Reedy (talk) 22:46, 5 November 2012 (UTC)

Multilicense[edit]

In look of the fixed text, we have to change german Text under Edit window:

Mit dem Klick auf die Schaltfläche „Seite speichern“ erklärst du dich mit den Nutzungsbedingungen einverstanden und lizenzierst deine Bearbeitung unwiderruflich unter der Lizenz Creative Commons „Namensnennung / Weitergabe unter gleichen Bedingungen 3.0“ und der GFDL. Du stimmst zu, dass ein Hyperlink oder eine URL zur Seite für die notwendige Zuschreibung, gemäß der Creative-Commons-Lizenz, ausreichend ist.

I think this is not so easy, because content in articles till now is under GFDL also... --Conny (talk) 21:36, 4 November 2012 (UTC).

Unfortunately MediaWiki:Wikimedia-copyright and MediaWiki:Wikimedia-copyrightwarning differ - they need to be synchronised, urgently. Jdforrester (talk) 06:17, 6 November 2012 (UTC)

Question about data[edit]

In the file wikidatawiki-latest-abstract.xml, why aren't the full titles of some pages given? For example, in Q13, the Japanese title given for 'triskaidekaphobia' is '13', whereas the actual Wikipedia title is '13 (忌み数)'. The title is displayed correctly on this page, but not in the data dump (unless I'm missing something). Any reason for this? — 70.26.26.153 17:05, 6 July 2013 (UTC)

Broken link to incremental download[edit]

The link to the incremental download is broken: http://releases.wikimedia.org/other/incr/wikidatawiki/ . Are incremental downloads being provided? Jefft0 (talk)

I figured out the correct link and fixed it. https://www.wikidata.org/w/index.php?title=Wikidata%3ADatabase_download&diff=162829416&oldid=147713105 Jefft0 (talk)

Short explanation on kind of dumps and structure of dumps[edit]

I am slightly confused of having so many files. Is there anywhere a page that details a little, what to find in which file? I'T a bit hard to download several gigabytes only to check what's in a file. Is it broadly xml vs. sql formats? Is there any redundancy between the xml dumps? XamDe (talk) 14:58, 7 November 2014 (UTC)

Incremental JSON dumps[edit]

Would it be possible to have incremental JSON dumps, perhaps daily, in the same format as the existing JSON dumps, but containing only the entries that have been changed or added since the previous incremental dump? This would greatly speed some of the bot work I'm currently doing. -- The Anome (talk) 19:31, 27 July 2015 (UTC)

You can use the recent changes API to find out which entities changed and then Special:EntityData (like https://www.wikidata.org/wiki/Special:EntityData/Q42.json ) to get the dump for each entity. This will work as long as you are not 30days or more behind. - Jan Zerebecki 23:01, 27 July 2015 (UTC)
I also added Wikidata:Data_access#Incremental_updates. - Jan Zerebecki 23:29, 27 July 2015 (UTC)
Thanks for letting me know. I could certainly do that, but it seems like a lot of work to generate thousands of API calls, a process that would take many hours, rather than just download a single change-file, a process that would most likely take only minutes.

Since we have no guarantee of either dump ordering or JSON serialization stability, I would imagine that we can't just use diff, but assuming we can use timestamps to detect changes, there is an obvious format for these delta files, which would be the same as the previous file format, but with just the content of the latest revisions of each item that had been changed (and perhaps a special stub entry of some sort for those items that had been deleted). This would be really, really useful for bot operators. I'd be happy to write the code for a program to generate it, given two dumpfiles, as well as a program to apply the pseudo-diff to one dumpfile to create another. -- The Anome (talk) 14:31, 9 May 2016 (UTC)

There are no daily dumps, so we can't just diff two dumps. There was phab:T72246, please reopen it, add a link to this discussion and tell me there for what bot work you want use these incremental dumps. (There are some different ways of doing incremental dumps, with different features, so we need to know which features these incremental dumps would need to have.) -- Jan Zerebecki 18:48, 9 May 2016 (UTC)

Discussion pages dumps[edit]

Hi everyone. I am using the XML dumps to carry out some analyses on Wikidata, does anybody know where (under which tag) I can find the revisions of the discussion pages? In particular, I am looking for the discussion pages of properties, where constraints are specified. Are they also in the XML dumps or should I use other dumps? Thanks,--Ale Batacchi (talk) 15:06, 12 August 2015 (UTC)AleBatacchi

I reply to my question: the discussion pages are in the XML dumps, as well as all the other pages. My script just skipped them. Ale Batacchi (talk) 11:09, 24 August 2015 (UTC)AleBatacchi

"latest" links for JSON dumps[edit]

Would it be possible to create a 'latest' link in the wikidatawiki/entities/ folder, like there is in the wikidatawiki/ folder? A permanent link to the latest JSON download would be handy. HYanWong (talk) 10:01, 10 November 2015 (UTC)

Likely yes. Would you be so kind an open a ticket on phabricator.wikimedia.org for this? --Lydia Pintscher (WMDE) (talk) 12:13, 10 November 2015 (UTC)
Now at https://phabricator.wikimedia.org/T118457. Thanks HYanWong (talk) 11:40, 12 November 2015 (UTC)

Reading the JSON dump with Python[edit]

Hello! How can I read the JSON dump using Python? The .readlines() function loads the whole file in memory and it isn't efficient. The .readline() function only returns 18 lines and ends (probably it doesn't work because the parallel compression). I know that there are a reader in PHP but I prefer to use Python. So, who can help me? Emijrp (talk) 10:10, 15 November 2016 (UTC)

@Emijrp: I can't give you any advice for reading line by line in Python offhand, but regarding the parallel (multistream) compression: This is only a "problem" with the gzip dumps, the bzip2 ones are in a single stream. Most (all productively used?) gzip clients support multiple streams, while only some bzip2 ones do, that's why we have chosen to use multistream gzip, but single-stream bzip2. Cheers, Hoo man (talk) 10:41, 15 November 2016 (UTC)

lastrevid and modified not present in JSON dumps[edit]

I was going through the JSON dumps and wanted to see the freshness aspect of each entity. I noticed that entities dont have lastrevid and modified fields in the JSON dumps. Is this expected ? Kaaro01 (talk) 09:21, 14 December 2016 (UTC)

Yes, currently they only contain the actual data and no meta data. Cheers, Hoo man (talk) 09:44, 15 December 2016 (UTC)

Redirects info in the JSON dumps is missing[edit]

imho: As I see in the "Wikidata JSON dumps" the information about wikidataid redirects is missing. and maybe this is true for deleted wikidata-ids ( Wikidata:Requests for deletions ) Context: I need this type of information for validating wikidata-id in the external 3rd party database.

  • Are there any suggestions to get this type of JSON dumps? ( included or in the separated files? )
  • What is the quick&dirty temporary solution suggestions to processing this information? ( XML or RDF dumps? )

--ImreSamu (talk) 13:10, 10 November 2017 (UTC)

JSON-LD[edit]

Will there be en:JSON-LD? 85.179.161.128 09:42, 18 May 2018 (UTC)

Possible to change formats of RDF dumps in nt and ttl?[edit]

- In the RDF - TTL(turtle) format dump, is it possible to put the entities (ended with '.') in one line?

Currently, there're line breaks with every ',' or ';'. This makes it difficult to analyze using any buffered readers. For example, with Spark, the file will be split up into partitions and loaded with chunks. It would require additional logic to make sure the partitioning doesn't happen within one entity.

I can understand with the current linebreak it is easier for humans to read, but I guess most people would want to analyze this 300GB+ file with tools rather than bare eyes.


- Also in the NT triplets dump, would be nicer if the subjects and predicates are separated by TAB('\t').

Currently, it's using space(' '), which makes reading tasks complicated because some of the subjects contain spaces.

I would not expect specific kind of Turtle or NT serialization, this is not specified and unstable. Better use use a parser that can fully handle the serialization (e.g. understand both spaces and tabs as separator). If your consuming application cannot handle this efficiently you may pipe the RDF dump through a parser that normalizes to your requirements. -- JakobVoss (talk) 09:13, 29 November 2018 (UTC)

Who is responsible for dumps at Archive.org?[edit]

RDF and JSON dumps are available at archive.org but it is unclear when, how often and by whom this is done. By now there is always a lag of at least half a year dumps neither available anymore at https://dumps.wikimedia.org/wikidatawiki/entities/ nor available yet at Internet Archive. -- JakobVoss (talk) 09:13, 29 November 2018 (UTC)

@Hoo man: :) --Lydia Pintscher (WMDE) (talk) 17:34, 30 November 2018 (UTC)