Wikidata:WikiProject Informatics/File formats

From Wikidata
Jump to navigation Jump to search

Subpages[edit]

Goals[edit]

Long term goals[edit]

  • For Wikidata to become the most comprehensive resource for information on file formats
  • For Wikipedia to extensively use data from Wikidata on articles relating to file formats and software

Short term goals[edit]

  • Define and reach agreement on an ontology for file formats
  • Advertise for and encourage new contributors to join the project, particularly from digital preservation organisations
  • Commence detailed definition of common file formats (PDF, JPEG, etc) to encourage development of an ontology, and to raise awareness of this project

Automatic lists[edit]

Useful queries[edit]

  • Return a list of all items for which there is a LocFDD identifier:
SELECT ?format ?formatLabel ?fdd

WHERE {
  
  ?format wdt:P3266 ?fdd .
            
  SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
    
}
  }

Try it!

  • Return the names of all file formats for which there is a PRONOM file format identifier in Wikidata:
SELECT ?format ?formatLabel ?puid

WHERE {
  
  ?format wdt:P2748 ?puid .
            
  SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
    
}
  }

Try it!

  • Find items sharing the same LocFDD identifier
SELECT * WHERE {
  {
    SELECT ?id (COUNT(?obj) AS ?count) (GROUP_CONCAT(?obj; SEPARATOR = " , ") AS ?items) WHERE { ?obj wdt:P3266 ?id. }
    GROUP BY ?id
  }
  FILTER(?count > 1)
}

Try it!

  • Find items sharing the same PRONOM file format identifier
SELECT * WHERE {
  {
    SELECT ?id (COUNT(?obj) AS ?count) (GROUP_CONCAT(?obj; SEPARATOR = " , ") AS ?items) WHERE { ?obj wdt:P2748 ?id. }
    GROUP BY ?id
  }
  FILTER(?count > 1)
}

Try it!

  • Return a list of software applications ranked in descending order by the number of writable file formats that have been listed in Wikidata:
#defaultView:BubbleChart
SELECT ?app ?appLabel (COUNT(?format) AS ?count)

WHERE {
  ?app (p:P31/ps:P31/wdt:P279*) wd:Q7397 .
  ?app wdt:P1073 ?format .
            
  SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
    
}
  }

GROUP BY ?app ?appLabel
ORDER BY DESC(?count)

Try it!

  • Return a list of software applications ranked in descending order by the number of readable file formats that have been listed in Wikidata:
#defaultView:BubbleChart
SELECT ?app ?appLabel (COUNT(?format) AS ?count)

WHERE {
  ?app (p:P31/ps:P31/wdt:P279*) wd:Q7397 .
  ?app wdt:P1072 ?format .
            
  SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
    
}
  }

GROUP BY ?app ?appLabel
ORDER BY DESC(?count)

Try it!

  • Return a list of items that have PUIDs, LoCFDD ids, and File Formats Wiki ids:
SELECT DISTINCT ?format ?formatLabel ?puid ?fdd ?solve

WHERE {
  
  ?format wdt:P2748 ?puid .
  ?format wdt:P3266 ?fdd .
  ?format wdt:P3381 ?solve .         
  SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en" .
    
}
  }

Try it!

Properties & structure of items[edit]

Ontology for an item which is an instance of (P31) file format family (Q26085352)[edit]

A file format family (Q26085352) is a group of file formats which are closely associated with each other, for example:

  • File formats are incremental versions of earlier file formats
  • File formats are variations of a base or common file format
Property Expected values Expected qualifier properties
instance of (P31) file format family (Q26085352) none
has part (P527) one or more item which is an instance of (P31) file format (Q235557) none
based on (P144) one or more of the following: none
developer (P178) one or more of the following: none
PRONOM file format identifier (P2748) valid PRONOM database identifier where the PRONOM database entry is for a file format family (supertype/group of related file formats) none
LocFDD ID (P3266) valid Library of Congress Format Description Document identifier where the LoCFDD ID is for a file format family (supertype/group of related file formats) none
File Format Wiki page ID (P3381) Wiki page identifier from the Just Solve the File Format Problem wiki none
topic's main category (P910) one item which is an instance of (P31) Wikimedia category (Q4167836) none
Commons category (P373) valid category name on Wikimedia Commons none
Stack Exchange tag (P1482) valid URL for tag associated with file format family on Stack Overflow none
official website (P856) valid URL for the official website of the developer/maintainer of the file format none

Ontology for an item which is an instance of (P31) file format (Q235557)[edit]

A file format (Q235557) should generally be defined by a document (standard, specification or otherwise)

Property Expected values Expected qualifier properties
instance of (P31) file format (Q235557) none
part of (P361) where applicable, one item which is an instance of (P31) file format family (Q26085352) none
based on (P144) where applicable, one or more of the following: none
replaces (P1365) where applicable, one or more items which is an instance of (P31) file format (Q235557) none
replaced by (P1366) where applicable, one or more items which is an instance of (P31) file format (Q235557) none
described by source (P1343) one or more of the following: none
developer (P178) one or more of the following: none
media type (P1163) where applicable, one or more Internet media types none
Uniform Type Identifier (P3641) where applicable, one or more Uniform Type Identifiers (see Apple developer documentation for an example source) none
file extension (P1195) one or more file extensions none
programming language (P277) where the file format contains computer code, one or more items which is an instance of (P31) or instance of (P31) subclass of (P279) computer language (Q629206) none
endianness (P3374) one item which is an instance of (P31) or instance of (P31) subclass of (P279) of endianness (Q339338) none
file format identification pattern (P4152) one or more file format identification patterns
PRONOM file format identifier (P2748) valid PRONOM database identifier where the PRONOM database entry is for a file format family (supertype/group of related file formats) none
LocFDD ID (P3266) valid Library of Congress Format Description Document identifier where the LoCFDD ID is for a file format family (supertype/group of related file formats) none
File Format Wiki page ID (P3381) Wiki page identifier from the Just Solve the File Format Problem wiki none
topic's main category (P910) one item which is an instance of (P31) Wikimedia category (Q4167836) none
Commons category (P373) valid category name on Wikimedia Commons none
Stack Exchange tag (P1482) valid URL for tag associated with file format on Stack Overflow none
official website (P856) valid URL for the official website of the developer/maintainer of the file format none
URL (P2699) valid URL of a resource related to the file format (for example, a schema which can be used to validate the correct formatting of a file)

Wikipedia Infoboxes[edit]

The template Infobox: File format could be rewritten using lua to pull all values from Wikidata. Here is a first attempt at how the template parameters could be mapped to Wikidata properties:

Infobox file format parameter Wikidata property
name label
icon image (P18)
iconcaption qualifier media legend (P2096) of the icon
iconsize we could recreate this in the lua template
screenshot image (P18)
screenshot size qualifier media legend (P2096) of the screenshot
noextcode this is intended to avoid the use of <code> formatting of the extension property
extension file extension (P1195)
nomimecode this is intended to avoid the use of <code> formatting of the mime property
mime media type (P1163)
type code needs to be created
uniform_type Uniform Type Identifier (P3641)
conforms_to needs to be created
magic file format identification pattern (P4152), needs to handle the qualifiers, especially encoding (P3294)
developer developer (P178)
released publication date (P577)
latest_release_version software version identifier (P348)
latest_release_date would likely need to be handled by a qualifier
genre genre (P136) has a note that suggests main subject (P921) might be more appropriate for this use
container_for has part (P527)
contained_by This would be modeled by the containing item's has part (P527)
extended_from based on (P144)
extended_to This would be modeled by the containing item's based on (P144)
standard described at URL (P973) or ISO standard (P503)
free I'm not sure about this one
url official website (P856)

Let me know what you think of this. Feedback welcome. YULdigitalpreservation (talk) 19:19, 8 November 2016 (UTC)

@YULdigitalpreservation: Merge the "latest release date", "latest realease version" and so on. First the "lastest" part should be handled by a preferred Rank and are useless in Wikidata. Second I think it's better to only have an item for a release and to handle all these by a unique property that can be has edition (P747) View with SQID. For example (the example is for a software but it's the same for a file format) :
noextcode and nomimecode could be modeled using concept of no-value in Wikibase (Q19798647)? --Azertus (talk) 20:13, 5 December 2016 (UTC)
@YULdigitalpreservation: For "icon", shouldn't we use "icon image" (P154) instead?

I also think that "based on" is ambiguous and its use for a file format should be precised. I guess it should be considered as a "is a restriction of" relationship (instances of format B, based on format A, are also valid instances of format A). Is it the way this property has been used so far? For different types of relationships between formats, couldn't the GDFR Format Model be a source of inspiration ? --Dipsode87 (talk) 11:28, 20 January 2017 (UTC)

Properties for specification or standard[edit]

This section intends to describe where the information on the specification or the standard of a file format can be described. The intent of this description is to try to feed the |standard property of the Template:Infobox file format (Q10986167) see above.

Note that this information is different from the official website (P856).

  • easy cases when a property exists:
  1. ISO (ISO standard (P503))
  2. RFC (RfC ID (P892))
  • others cases :
  1. use described at URL (P973) if there is a link (URI) to the standard
  2. use described by source (P1343) if there is a wikidata entity for the standard

Title
ID Data type Description Examples Inverse
ISO standardP503External identifierISO standard: number of the ISO standard which normalizes the objectJPEG 2000 <ISO standard> 15444-
RfC IDP892External identifierRequest for Comments: identifier for an item in Request for Comments, a publication of IETF and the Internet Society (without "RFC" prefix)Opus <RfC ID> 6716-
RfC ID with qualifier publication dateP577Point in timedate of publication: date or point in time when a work was first published or releasedOpus <RfC ID> 6716
<publication date> septembre 2012
-
described at URL with qualifier genreP136Itemgenre: creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topicXML Schema <described at URL> http://www.w3.org/TR/xmlschema-0/
<genre> W3C Recommendation
-
described at URL with qualifier standards bodyP1462Itemstandards organization: organisation that published or maintains the standardJPEG File Interchange Format, version 1.02 <described at URL> https://www.itu.int/rec/T-REC-T.851-200509-I/en
<standards body> International Telecommunication Union
-

Feel free to add or modify the above table. Toto256 (talk) 10:25, 25 February 2017 (UTC)

Wikipedia Navboxes