Wikidata:Property proposal/magic numbers

From Wikidata
Jump to navigation Jump to search

File format magic numbers

[edit]
Descriptionmagic numbers used to incorporate file format metadata in form of a string coded hexadecimal number (usual encoding, "0" = 0 and "F" = 15, space ignored). Qualifiers can specify an offset and a padding value for this number.
Data typeString
Template parameterTemplate:Infobox file format (Q10986167) magic number parameter
Domainfile format (Q235557)
ExampleGIF (Q2192) -> 47 49 46 38 39 61
SourceGary Kessler's File Signatures Table
Planned useI plan to add magic numbers to Wikidata items for the corresponding file formats.
Motivation

Magic numbers are constant numerical or text values used to identify file formats. Having this data in Wikidata will help make file format information more complete. Magic numbers are part of how we verify file signatures and are used in forensic computing. This is also a parameter of the Infobox:File format. It will be possible to transfer all of the magic numbers stored in infoboxes to Wikidata if we create this property. There is also this list [List of file signatures] that we could transfer to Wikidata. YULdigitalpreservation (talk) 15:43, 17 October 2016 (UTC)[reply]

Qualifier

[edit]

(Additions to the proposal by TomT0m)

Offset
[edit]
   Done: offset (P4153) (Talk and documentation)
Descriptionqualifier of "magic number" for the number of bytes before the magic number to be searched in a file
Data typeNumber (not available yet)
ExampleModelling the format "RVT" "
[512 (0x200) byte offset]
00 00 00 00 00 00 00 00 	  	[512 (0x200) byte offset]
........
RVT 	  	Revit Project File subheader
gives
⟨ RVF ⟩ magic number Search ⟨ 00 00 00 00 00 00 00 00 ⟩
offset (P4153) View with SQID ⟨  512 ⟩
.
SourceGary Kessler's File Signatures Table
Planned usequalifier for the property above

Talk

[edit]
  •  Support  Wait. author  TomT0m / talk page 17:31, 17 October 2016 (UTC)[reply]
  •  Support but there are several other kinds of "magic numbers" so I think the name needs to be more descriptive - maybe "file format magic numbers"? Also, with string value isn't there some room for ambiguity in how the numbers are to be represented here? ArthurPSmith (talk) 18:18, 17 October 2016 (UTC)[reply]
    • Good point. If the numbers are string coded hexadecimal, this should be made explicit. The spaces seems also totally irrelevant and adds burden to parse. I can also see in the files that spec also specifies offsets : [11 byte offset] and [512 (0x200) byte offset]. This could be handled better than with an unspecified string format in a structured data projects. Also see if the string can't encode the non-hex version such as directly the string, for example in
    46 41 58 43 4F 56 45 52 FAXCOVER
    2D 56 45 52 	        -VER
    
    it should be possible to store more efficiently directly "FAXCOVER-VER", maybe an offset with a qualifier, and maybe a "padding value" also with a qualifier, something like
    ⟨ format ⟩ magic string Search ⟨ FAXCOVER-VER ⟩
    offset (P4153) View with SQID ⟨ 0 byte ⟩
    padding Search ⟨ 0 ⟩
    . author  TomT0m / talk page 18:32, 17 October 2016 (UTC)[reply]
  •  Support It will be very useful for data regarding file type identification. CC0 (talk) 11:45, 28 October 2016 (UTC)[reply]
  •  Comment Could this property be specified to contain values which are Perl Compatible Regular Expressions (PCRE), allowing for more advanced signatures to be specified if desired? For example, "\x89PNG\x0D\x0A\x1A\x0A" for the PNG family, "\x00\x01\x00\x00Standard Jet DB" for Microsoft Access MDB, "GIF8[79]a" for the GIF family, etc. The advantages are: for ASCII-only-signatures (GIF), it's human-readable. For signatures containing binary/non-ASCII data (PNG), it's in a readily usable format (C/C++ strings for example) and for optionally complex signatures, it's in a format ready to use with a PCRE compliant parser. Pixeldomain (talk) 02:44, 17 November 2016 (UTC)[reply]
     Comment The offset could be identified in the PCRE expression, as an example: "(?s)^\x00\x01\x02.{38}ANSWERTOEVERYTHING" would look from the start of the file for \x00\x01\x02 then skip 38 bytes to offset 42 in the file where it would look for "ANSWERTOEVERYTHING". More advanced expressions could look at bytes from the end of the file (ZIP archives have a central directory tacked on the end of the file), perform negative look-aheads, etc. Whilst there is extra complexity with PCRE, it does not have to be used, and the fall-back is a simple C/C++ string representing binary data. Pixeldomain (talk) 03:09, 17 November 2016 (UTC)[reply]
  •  Comment Also worth taking a look at is how the magic file of the "file" command stores file type signatures: https://github.com/file/file/tree/master/magic/Magdir Pixeldomain (talk) 03:32, 17 November 2016 (UTC)[reply]
     Comment Also take a look at the FIDO PRONOM database at https://raw.githubusercontent.com/openpreserve/fido/af3fc47791855ad7b955eb4272411113bfcff54d/fido/conf/formats-v88.xml which uses PCRE to define signatures for each file type. Pixeldomain (talk) 04:04, 17 November 2016 (UTC)[reply]
  • @Pixeldmain, cc0, YULdigitalpreservation, TomT0m, ArthurPSmith: what is the status of this proposal? Thryduulf (talk) 16:32, 22 April 2017 (UTC)[reply]
    • @Pixeldomain, CC0: (fixing pings) - obviously there was some debate here about the string format for this property. Of the proposals for format, I think the PCRE idea has a lot of merit. But I'd be ok with the original space-separated hexadecimal pairs too. No strong preference. ArthurPSmith (talk) 13:24, 24 April 2017 (UTC)[reply]
      •  Comment @ArthurPSmith: My current view is that magic numbers or patterns are not a good property for a file format. See use of described at URL (P973) on GIF (Q2192) for an example of an alternative approach I prefer for the identification and description of file formats. Pixeldomain (talk) 01:31, 26 April 2017 (UTC)[reply]
        • that means relying on a third party to provide the actual details of the format, but in some cases that may be all we have, so it's at least a good option to use. I still think a wikidata property specifically for something like this is useful though. ArthurPSmith (talk) 13:37, 26 April 2017 (UTC)[reply]
        • An alternative I have also considered is detailing the structure of file formats on Wikidata by creating items for each data structure and field within each format. This moves the data from external sources into Wikidata, whilst allowing external references and sources (typically international standards, RFCs, etc) be used to describe each new item. Do you have any thoughts on this possible approach? Pixeldomain (talk) 00:58, 27 April 2017 (UTC)[reply]

WikiProject Informatics has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Are there additional opinions about whether we should implement this property? ChristianKl (talk) 20:32, 24 May 2017 (UTC)[reply]