Wikidata talk:WikiProject Informatics/Structures/File formats

Nice job, and I am really interested in joining the project! Nevertheless, I am not convinced by the transformation of every "file format" instance into a "file format family" instance. These "file formats" are almost always versions of a specific file format (indeed, TIFF is a file format, not a file format family!). And by doing so, you create Wikidata items with no Wikipedia element associated, which is not a good practice. Instead, you can handle easily assertions about a specific version of a file format by using a qualifier "software version" to state the version for which the assertion is true. I would prefer this solution. – The preceding unsigned comment was added by Dipsode87 (talk • contribs).

Hi, I fell like Dipsode84. As stated in Notability, an item should be link to a valid sitelink. Using the concept of "file format family" for things like TIFF (Q215106), make the sitelinks more weak, especially if we want to use the information stated in wikidata into the infobox (the "real" information will be in the related "file format"). Toto256 (talk) 12:31, 15 October 2016 (UTC)[reply]

@Dipsode87, Toto256: I only just saw these comments, so this is the first opportunity to respond. I'm interested in suggestions for improving how file formats and "groups of file formats" are differentiated in Wikidata. If we take Portable Document Format (Q42332) for example, there is one Wikipedia article covering all versions of the PDF file format. This seems correct as each version of a file format is not necessarily notable enough to warrant a separate article in Wikipedia. However, I would argue that there is a structural need to have a different Wikidata item for each version of a file format. Some reasons include:

A particular version of a file format (such as Portable Document Format, version 1.7 (Q26085317) may be defined by a different specification, different Internet Media Type and different file extension. If all versions of a file format were handled by the single Wikidata item, properties such as described by source (P1343) could be ambiguous to use, as the qualifying property edition number (P393) or software version identifier (P348) could be interpreted to refer to either the file format version the statement applies to, or the version of the specification referred to.
If in the future someone wanted to create a list in Wikipedia of the specific versions of a file format, it would be easier to query Wikidata if each file format version had a unique Wikidata item associated. This could extend to writing queries which count the number of specific file format versions (think OOXML with dozens of specific versions), list all specific file formats with an extension starting with "p", etc.
Different versions of software may be capable of reading and writing different versions of a file format. Some software may fully or partially comply with the required file format specification. In the case of OOXML implementations by Microsoft, there is a ~1000 page erratum explaining how Microsoft Office products don't fully meet the required standard. Use of a qualifying properties has characteristic (P1552) and corrigendum / erratum (P2507) would be useful to denote how well a particular version of software can read or write a particular version of a file format. If all versions of a file format were represented by a single Wikidata item, it would make it use of readable file format (P1072) and writable file format (P1073) harder to use, and possibly prevent more detailed has characteristic (P1552) and corrigendum / erratum (P2507) qualifiers from being used.
External tools or websites would be able to easily and uniquely identify specific file formats in a way not currently possible. I am thinking along the lines of a "file" command which returns a Wikidata item ID for the file being queried. If all versions of a file format were represented by a single Wikidata item, unique references would look something like "Q42332 v1.7" instead of "Q26085317"
Notability has three criteria, of which (2--"clearly identifiable conceptual or material entity") and (3--"fulfills some structural need") are on on the list and support the creation of a Wikidata item for each version of a file format. The intention is that separate Wikidata items for each file format version make it clearer, less ambiguous and easier for Wikipedia articles and other external websites, tools, etc to understand file formats, the related specifications and which software can be used.

Any thoughts or feedback on the above? Pixeldomain (talk) 07:35, 27 October 2016 (UTC)[reply]

@Dipsode87, Toto256, Pixeldomain: Thank you for raising this issue for discussion. I hold the opinion that it is important to have separate items for different versions of file formats. Let's say someone wanted to use data from Wikidata to describe software, the precision of having an individual URI (which we get from the individual item) for each version of file format would be very helpful. I agree with Pixeldomain's point that the notability criteria allow for the approach of creating individual items for different versions of file formats. YULdigitalpreservation (talk) 19:01, 1 November 2016 (UTC)[reply]

Adding to the discussion of structural/notability reasons for having file format versions included as data items in wikidata: research (I led) at Archives New Zealand indicates that software pacakges can define format version variants when their developers interpret the format version standard in ways that differ from either how the standard was written or how other software developers interpreted the standard. This is important to people working in digital preservation as we need to be able to identify exactly how files are structured in order to ascertain both:

what software to use to interact with the content in files (e.g. via emulation in the future) and how it will present content to users (which may differ depending on the format version variant)
how to migrate content from the existing format to a new reusable format (understanding the structure of the files is essential to creating/identifying effective migration tools)

If we don't know how files are structured, we can't do either of those. With the scale of the digital preservation challenge (trillions of files to be preserved) we need to automate solutions to these problems. In order to automate identifying these format version variants/implementations the first step is to uniquely and persistently identify them and document them (e.g. in wikidata).

For these reasons I would like, in addition to endorsing the previous arguments, to propose that the solution that would be most useful in Wikidata would be to be able to not just identify file format versions, but also file format version variants (or file format version "implementations"). For example, Open Document Spreadsheet version 1.0 files as created by Open Office v'x' or Open Document Spreadsheet version 1.0 files as created by Microsoft Office 2007. See the "variants" section [1] for evidence of the difference between those two varients and significance between the two.Cnaue (talk) 20:18, 3 November 2016 (UTC)[reply]

@Cnaue: Thanks for the insight and link to the research paper. It makes an interesting read, and presents a significant challenge of how to capture information about software version specific implementations of file format standards, and software version specific rendering of the same file. The difficult part for Wikidata will likely be ensuring that all information is referenced properly--a challenge when a lot of the information is currently in the format of comments on websites (Stackoverflow, etc) and various bug trackers, wikis, blogs, etc. Are you aware of any ontologies or other prior work which could be used to inform Wikidata's approach to capturing this information? Pixeldomain (talk) 03:31, 4 November 2016 (UTC)[reply]

@Pixeldomain: Here is a list of ontologies and vocabularies that could inform our work:

I'll continue to add more as I learn about them. YULdigitalpreservation (talk) 17:50, 8 December 2016 (UTC)[reply]

There was a recent blog post at Wikidata as a digital preservation knowledgebase which provides further information on current work and aspirations of the digital preservation community to use Wikidata. A good read! Pixeldomain (talk) 03:31, 4 November 2016 (UTC)[reply]

A critique of of Wikipedia articles as concepts: Problematizing and Addressing the Article as Concept Assumption in Wikipedia YULdigitalpreservation (talk) 15:21, 21 November 2016 (UTC)[reply]

Have people seen this project Minimal metadata schemas for science software and code, in JSON and XML . Looks like there are a number of schemas we have not yet discussed. YULdigitalpreservation (talk) 12:12, 3 May 2017 (UTC)[reply]

Contains code[edit]

@Pixeldomain: I think that this diff, where the "programming language" property is added, is incorrect. I think the correct property should be has part Search, as "has part" could mean "an instance of html contains javascript", for example. This is assuming that a file format represent the class of all file that conforms to this format. author TomT0m / talk page 11:23, 11 November 2016 (UTC)[reply]

@TomT0m: How would we differentiate between "part of" being used to state "this file format contains a header section" or "this ZIP file format contains a central directory", and "part of" being used to also state "this file format contains Javascript code" or "this file formats contains DEFLATE compressed data". Is this differentiation necessary? I was envisaging that someone could potentially build a graph using the "part of" property, showing nested data structures within a file format. Pixeldomain (talk) 03:30, 14 November 2016 (UTC)[reply]

Files (formats) with a distinctive filename[edit]

There exists a class of file formats/files (the line blurs) which have a (or several) distinctive filenames (e.g. Windows thumbnail cache (Q930281) or FILE_ID.DIZ (Q1952708)). I have modeled their filename using native label (P1705). I believe it is quite a good fit, but maybe a new property would be in order? It would only be used on a handful of items, but it might beat someone later down the line taking offense to use of native label. title (P1476) might be an option as well. The advantage of a new specific property would be that we could safely use it on e.g. robots.txt (Q80776), to which I'm currently hesitant to add a native label statement.

Thoughts? See here for a selection of these kinds of files. --Azertus (talk) 19:25, 16 December 2016 (UTC)[reply]

How to express relations between formats[edit]

WikiProject Informatics has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. I wonder about best practice to express relations between file formats. For instance a OpenDocument Text file format family (Q29167477) file is a ZIP (Q136218) or MARCXML (Q28770433) uses Extensible Markup Language (Q2115) and MARC standards (Q722609). Should I just use based on (P144) for all cases and differentiate later if needed? -- JakobVoss (talk) 16:27, 25 October 2017 (UTC)[reply]

I would use based on (P144) in this case, considering the aliases of the property like 'derived from' etc. John Samuel 20:30, 25 October 2017 (UTC)[reply]

File formats seen as the class of the files meeting the file format specification[edit]

A proposition of an ontology for file formats and file format families : File formats are a way to discriminates files. Each format defines a class of files, for example from the XML specifications we can create an algorithm which will answer the question « is this specific file an XML file » ? The file format specification can then be seen as a predicate to decide what is or what is not XML, the same way given the definition of a screwdriver, if we’re given a screwdriver we can answer the question « is this a screwdriver » ?

With this logic, we get both :

⟨ XML ⟩ subclass of (P279) ⟨ file ⟩
⟨ SVG ⟩ subclass of (P279) ⟨ XML ⟩

Then we have the concept of a file format itself which does not fit in this hierarchy, a zip file instance is not a file format itself. The class of all « File format » is then best modeled as a metaclass:

⟨ SVG ⟩ instance of (P31) ⟨ file format ⟩
⟨ XML ⟩ subclass of (P279) ⟨ file format ⟩
⟨ zip ⟩ subclass of (P279) ⟨ file format ⟩
…

When it comes to families like XML dialect, where the file-types are hierarchically ordered by subclass (all svg files are xml files) we can go a little bit further and express that XML is also a file format family, some XML files are instances of a more specific file format, for example SVG is an « XML dialect ». We can express this by

⟨ XML ⟩ instance of (P31) ⟨ file format family (Q26085352)    ⟩
⟨ SVG ⟩ instance of (P31) ⟨ XML dialect (Q57695955)    ⟩

It’s possible to generalize the « XML dialect »-class a bit by creating a class for all file formats which are a dialect of a more general file format : « dialect of a file format ». Indeed the statement « a(n instance of) XML dialect like SVG is a(n instance of) dialect » should not be very shockinq. We get

⟨ XML dialect ⟩ subclass of (P279) ⟨ dialect of a file format ⟩
and then we can express that there is a « subclass of » relationship between a dialect (such as SVG) and its more general file type (XML in this case) by the property « metasubclass of » :
⟨ dialect of a file format (Q57696248)    ⟩ metasubclass of (P2445) ⟨ file format family (Q26085352)    ⟩
}} which express that a file format dialect instance (like SVG) should be a subclass of a file format family instance (like XML), which is the starting of this discussion (SVG is a subclass of XML as any SVG file is an XML file).

What I did not figure out yet is the relationship, if any, between « file format » and « file format family » in this framework. For example we could define a class for all « compressed files », which would be an instance of « file format family » but not an instance of « file format » itself. A class « compressed file format » ZIP would be an instance of, which would itself be a subclass of « file format », but « compressed file format » would still be a metasubclass of « file format family » as any zip-file is also an instance of « compressed file ». As a result, there is always a « subclass of » relationship between a « compressed file format » instance and a « file format family » instance (« compressed file »). .

It seems that the intuitions like « XML is a file format » are respected here, while it’s also allowed to be a file format family, under the initial assumption, which might be the most difficult to get. author TomT0m / talk page 12:12, 24 October 2018 (UTC)[reply]

This makes sense from a theoretical point of view but the practical point of view should start from how file formats and file format families are modeled in Wikidata by now. It's unclear to me how your proposal relates to the current data model. I asked about relations between file formats with my question above, the current relationship is based on (P144). Do you suggest to replace this usage? Could you also have a look at the examples I gave in my question: OpenDocument Text file format family (Q29167477), ZIP (Q136218), MARCXML (Q28770433), Extensible Markup Language (Q2115), and MARC standards (Q722609) how do they fit in your proposal? -- JakobVoss (talk) 19:46, 4 November 2018 (UTC)[reply]

Template:Pnig Just one stuff right now on based on (P144), please read the description : it’s suitable for artistic works who inspired other artistic works, translations in other languages … Really really different from a file format relationship . Moreother, there is relationships between file formats that are suitable for based on (P144) but are far from the relationship of XML and SVG, say Interchange File Format (Q1569639) and its incompatible variant (according to the enwp article), found thanks to google : RIFF . So yes, I think current practice is not good enough and a lot of those statements should be replaced, if not by subclass of by another property. That’s all for now but I’ll look at your examples. author TomT0m / talk page 20:49, 4 November 2018 (UTC)[reply]

How to express serializations?[edit]

For instance JSON-LD (Q6108942), N-Triples (Q44044), and Turtle (Q114409) are serializations of Resource Description Framework (Q54872) and MARCXML (Q28770433) is a serialization of MARC standards (Q722609). I've seen notation (P913) and has use (P366) but this is not documented at Wikidata:WikiProject_Informatics/File_formats, is there a consensus yet? -- JakobVoss (talk) 19:56, 4 November 2018 (UTC)[reply]