Talk:Q475488

From Wikidata
Jump to navigation Jump to search

Autodescription — EPUB (Q475488)

description: e-book file format standard by the International Digital Publishing Forum
Useful links:
See also


Untitled[edit]

@Toto256: I recently noticed that mimetypeapplication/epub+zip is listed as a ffid for this item. Is this info from PRONOM? Is it possible that this is meant as a container signature? A colleague brought this up to me in work to extend Siegfried (Q59982409) to reuse data from Wikidata. We are interested in discussing how to structure container signatures in Wikidata. We'd like to get your input on modeling this data, if you're interested. YULdigitalpreservation (talk) 19:33, 9 July 2020 (UTC)[reply]

Indeed, an well-formed epub is supposed to have as it's first item a text file containing the mimetype with no compression (see https://www.w3.org/publishing/epub3/epub-ocf.html#sec-zip-container-mime). So the string mimetypeapplication/epub+zip is a magic string for this format. Toto256 (talk) 20:49, 9 July 2020 (UTC)[reply]
Hi @Toto256:, it would be great to have you involved in a further discussion about how to structure that particular signature. My question is, that, do you think there is enough information encoded in the Wikidata record here? We know there's going to be a file with mimetype/... as the magic number that we can find in the EPUB structure. But will everyone arriving at this record know that they have to process a ZIP file first to reach that file and read the magic number? As Kat says, we have been discussing it a bit the last few weeks. I need to sit down to write a shape expression. If you're interested in reviewing that, it would be great. Beet keeper (talk) 00:05, 13 July 2020 (UTC)[reply]
Hi @Beet keeper:, sorry for not answering before... In fact, in the case of an epub, we are talking about a "normal" signature. Indeed, the specifications require that the file with the mimetype is the first and it MUST not be compressed (directly stored). So there is no need for processing any ZIP (the ZIP processing is only needed when you need to access the actual content). You just need to jump to the 30th bytes and directly read the string. Moreover, these signatures are referenced in the specifications under "Magic numbers". I hope my explanations are clear and that they answered your questions. Toto256 (talk) 13:01, 15 July 2020 (UTC)[reply]
Furthermore, be aware that this is a quite unique case for formats using ZIP container. Usually, like for docx, xlsx, and others, you really need to process the ZIP container to find specific patterns in order to find the format. This is a matter to investigate and probably need another property like "container signature" to be properly handle... Cheers Toto256 (talk) 13:08, 15 July 2020 (UTC)[reply]
Thanks for the clarification @Toto256:. I spent a bit of time with the signature and some samples today and can see how this appears in the byte-stream. Beet keeper (talk) 20:07, 25 July 2020 (UTC)[reply]

@Toto256: Another question. What do you think of the way the file format identification pattern is recorded here in Wikidata? It looks like the reference needs updating so that both reference Kessler's format page. Do you agree? The relativity is only set for the ASCII statement too. Would it be more consistent to have it set for both? What do you think about how usable this is in general? Could it be better to record a signature like this using PRONOM-like syntax? fmt/483: 504B0304{26}6D696D65747970656170706C69636174696F6E2F657075622B7A6970? I'd argue it captures the EPUB specification for the magic number a little more accurately. We wouldn't have to parse and take into account the two different encoding values. Beet keeper (talk) 20:07, 25 July 2020 (UTC)[reply]

Hi @Beet keeper: you are correct saying the single PRONOM-like signature is probably a better choice: indeed, I don't think there is concensus if two signature statements are linked by an OR or an AND. If I remember correctly when I insert the statements it was also a way to demonstrate different ways of expression. So feel free to make it better ;-) Toto256 (talk) 08:38, 26 July 2020 (UTC)[reply]