Property talk:P4152

From Wikidata
Jump to navigation Jump to search

Documentation

file format identification pattern
pattern or string which is used to identify a file as having a particular known format
Descriptionmagic numbers used to incorporate file format metadata in form of a string coded hexadecimal number (usual encoding, "0" = 0 and "F" = 15, space ignored). Qualifiers can specify an offset and a padding value for this number.
Representsmagic number (Q284099)
Data typeString
Template parameterTemplate:Infobox file format (Q10986167) magic number parameter
Domain
According to this template: file format (Q235557)
According to statements in the property:
file format (Q235557) or file format family (Q26085352)
When possible, data should only be stored as statements
ExampleGIF (Q2192) → 47 49 46 38 39 61
eXtensible Business Reporting Language (Q959950) → 3C
FOAF (Q1389366) → 3C
Sourcehttp://www.garykessler.net/library/file_sigs.html
Tracking: usageCategory:Pages using Wikidata property P4152 (Q38536992)
See alsooffset (P4153), PRONOM file format ID (P2748)
Lists
Proposal discussionProposal discussion
Current uses
Total9,496
Main statement9,490>99.9% of uses
Qualifier5<0.1% of uses
Reference1<0.1% of uses
Search for values
[create Create a translatable help page (preferably in English) for this property to be included here]
Type “file format (Q235557), file format family (Q26085352): item must contain property “instance of (P31)” with classes “file format (Q235557), file format family (Q26085352)” or their subclasses (defined using subclass of (P279)). (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P4152#Type Q235557, Q26085352, SPARQL
Required qualifier “encoding (P3294): this property should be used with the listed qualifier. (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P4152#mandatory qualifier, SPARQL
Scope is as main value (Q54828448): the property must be used by specified way only (Help)
List of violations of this constraint: Database reports/Constraint violations/P4152#Scope, hourly updated report, SPARQL
This property is being used by:

Please notify projects that use this property before big changes (renaming, deletion, merge with another property, etc.)

Value for XML-based formats[edit]

The digital signatures as expressed by this property rely on the identification of a binary hex pattern at a certain offset. This proves to be inadequate for XML-based formats (e.g., SVG) whose accurate identification method implies the analysis of the root node element name and namespace.

Indeed, P4152 for some XML-based formats was filled with values from the tool TrID that are, for most of them, not specific enough to identify the format.

I see two solutions:

  • 1: We consider that this property does not apply to XML-based formats and we recommend using Property:P7510 as the best way of identifying an XML-based file format;
  • 2: We try to provide good-enough binary signatures for XML-based formats that take into account the root element name and namespace, the presence or absence of an XML header, the double or simple quotes to specify namespaces, etc. Some guidance is needed to produce signatures that address these peculiarities.

Dipsode87 (talk) 10:09, 13 December 2022 (UTC)[reply]

Isn't file format identification pattern (P4152) only for magic numbers at the start of files?
I think that trying to come up with such magic numbers for XML-based formats is a futile endeavor because XML documents can generally contain arbitrarily much whitespace between attributes.
--Push-f (talk) Push-f (talk) 10:25, 13 December 2022 (UTC)[reply]
file format identification pattern (P4152) isn't described as being just for sequences at the beginning of the file @Push-f. Though it is a relatable position. Many magic numbers started out this way. Consider this query https://w.wiki/66va using something called a PRONOM internal signature. These introduce a regular expression syntax, and are also relative to the beginning or end of a file (also variable positions). Beet keeper (talk) 11:14, 13 December 2022 (UTC)[reply]
Ah ok I didn't know about PRONOM internal signature (Q35432091). That format is apparently described here. I am gonna opt out of this discussion because I am not familiar with PRONOM signatures. Sidenote: Can't we just use PRONOM file format ID (P2748) instead? --Push-f (talk) 11:50, 13 December 2022 (UTC)[reply]
Good question @Dipsode87. Continuing a previous discussion, I can envision some rules here to reduce false-negatives/and false-positives. Using some combination of "3C 3F 78 6D 6C" (<?xml) as a beginning of file anchor and then a series of variably positioned sequences describing namespaces afterwards, e.g. if an XML file format of note typically used three namespaces, something like: "3C3F786D6C{m-n}<variable-seq-1>{m-n}<variable-seq-1>{m-n}<variable-seq-1>". This can be achieved a few ways, e.g. with (seq1|seq1|seq3) style regex inline. Or for DROID/PRONOM using the "variable" offset. Wikidata doesn't have PRONOM's variable offset implemented (https://w.wiki/66vd). I if something like this can be combined with your second suggestion? Beet keeper (talk) 11:33, 13 December 2022 (UTC)[reply]
@Beet keeper I suppose creating a new subclass of Q1156822 for a variable offset is a possibility. But that would mean that we look anywhere in the file, and wrapped snippets would match (e.g., METS files would be identified as Q624610 or Q2108820 XML data), wouldn't they? Shouldn't we specify an offset of 0-1024 from BOF instead? Dipsode87 (talk) 15:10, 13 December 2022 (UTC)[reply]
@Dipsode87 excellent point. Plus one on that! Beet keeper (talk) 08:30, 2 January 2023 (UTC)[reply]
A further question, can we collect good examples of some of the formats most of interest to folk? Beet keeper (talk) 11:33, 13 December 2022 (UTC)[reply]
I shared this discussion with the PRONOM team Friday just gone and it was received positively. We learned a bit more that in general, standardizing "patterns" for writing file format signatures is also of interest/in the works. The next step for this XML work looks like, we write up a summary of things to think about here, or perhaps in an online doc. And we share some of those initial points with the team at the next meetup and discuss those. Beet keeper (talk) 08:02, 9 January 2023 (UTC)[reply]
@Dipsode87 for tomorrow's PRONOM meetup, I will bring this to the discussion. Please add comments as you feel fit: https://docs.google.com/document/d/1jMXcbFHVtw8mdNNJNZED7DweXhDvOQ5K2SW_ZVfNJi0/edit#heading=h.5vt8lej7674m Beet keeper (talk) 13:10, 15 February 2023 (UTC)[reply]