Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Fun with File Formats (loc.gov)
215 points by todsacerdoti on Dec 14, 2021 | hide | past | favorite | 25 comments


In addition to this resource and UK's equivalent (PRONOM/DROID, also mentioned in the linked post), I've found ArchiveTeam's wiki to be very useful for obscure file format details: http://fileformats.archiveteam.org/

The `shared-mime-info` database from freedesktop-dot-org is probably more worthy of contribution than these government-backed databases, at least in terms of number-of-end-users. New type definitions in their database will improve the entire Linux/BSD ecosystem (both desktop and server!) because it's consumed not only by fd.o's own `update-mime-database` utility but by many language-specific type-identification libraries too https://gitlab.freedesktop.org/xdg/shared-mime-info/-/blob/m...

…including (shameless plug) the new Ractor-based Ruby type library I've been working on in the wake of the `mimemagic` drama earlier this year: https://github.com/okeeblow/DistorteD/tree/NEW%E2%80%85SENSA...


> I've found ArchiveTeam's wiki to be very useful for obscure file format details: http://fileformats.archiveteam.org/

I took a serious look at trying to contribute to the Archive Team's "just fix the file format problem" effort a while ago. I ran into the problem that they require contributions be public domain/CC0. This prohibits material derived from the vast corpus of FOSS (even the most permissively licensed stuff—e.g. MIT, BSD, etc.) Considering how rare public domain/CC0 is, essentially everything would need to be done from scratch.


You should reach out to them and let them know. I don't know what the alternative would be - are there better CC licenses out there? Dual licensing CC+MIT/BSD?

Pragmatically (but not legally speaking, INAL) you could probably get by with summarizing and linking (especially to web archive).


There are two problems that reasonably explain the decision of only accepting public domain info: licensing and provenance.

"Licensing" is hard. The "Open Specifications Promise" [1], which covers a bunch of Microsoft-designed file formats, is merely a covenant not to sue.

"Provenance" is tricky. For example, much of the knowledge of the Apple iWork formats were derived by reverse-engineering the source programs and extracting protobuf definitions. Many open source projects have freely copied from each other, making detailed analysis tricky [2].

[1] https://en.wikipedia.org/wiki/Microsoft_Open_Specification_P...

[2] https://github.com/jazzband/tablib/issues/114


> There are two problems that reasonably explain the decision

Boy do I hate these kinds of after-the-fact, speculative "explanations"—anachronistic rationalizations that come out of a mental exercise to justify what is by retroactively (and non-authoritatively) trying to invent a plausible why.

And if you know the personality and values of the progenitor of the whole thing, then you know that this better-safe-than-sorry deference to hair-splittingly nuanced IP hurdles (that in practice don't really exist) is not the MO.

Besides that, requiring public domain/CC0 doesn't solve either of those problems.


> "Provenance" is tricky. For example, much of the knowledge of the Apple iWork formats were derived by reverse-engineering the source programs and extracting protobuf definitions. Many open source projects have freely copied from each other, making detailed analysis tricky

I wonder if the Google LLC v. Oracle America, Inc. ruling helps this in any way. Defining a specification (either with public docs, RE or empirically) of closed format for the sake of reading the data or long-term archival strikes me as emphatically free-use. But I could also easily seeing it ending up patent-troll bait.


There is also Apache Tika (https://tika.apache.org/) - file format detection & content extraction library


Tika also stores its type definitions in the freedesktop format internally: https://gitbox.apache.org/repos/asf?p=tika.git;a=blob;f=tika...

However their data is mostly focused on text/document types for obvious reasons :)


Also the magic number database for guessing the format of a file:

https://www.darwinsys.com/file/


magic database has its own problems based on the language itself


What kind of problems?


Primarily because it’s very hard to express complex conditions


yeah, detection of file types is a fun. I worked on several implementation of very precise type detection (for use in email & web proxies), and there are so many low-level details that affect that detection. The biggest fun is detection of type for container-based file formats, like, OLE-based office files (office 95-2003), Zip-based formats (MS Office > 2003, OpenOffice/LibreOffice, jar/war/ear, ...) where you need to actually look into the file to determined the type. And to add complexity, the implementation should be very fast if you run inside the web proxy where type is detected for each request (at least twice for request/response stage, or more if you process an archive).

Extraction of content is a separate fun ;-)


how does `file` stack up?


File was slower by at least 10 times slower (it was ~10 years ago).

Second thing, for example it didn’t work with compound objects. Especially for things like word embedded in excel, etc. Similar for XML-based types, zip-based, …

Plus rules aren’t very flexible - I’d used lisp-like language to describe detection patterns - that made it quite flexible in describing complex detection logic.


I wonder if there is some kind of way for software makers to provide all of the data to LOC? There is a lot of proprietary file formats that are in wide use. LOC has really done an amazing work.


The software makers just need to write and release specifications in the public domain. Lotus did this many years ago with the 1-2-3 file formats: https://web.archive.org/web/20140713220544/http://www.schnar...

    The information contained in this document has been released into the
    public domain and is not considered to be confidential or proprietary
    although still the copyright and property of Lotus Development Corporation.
Microsoft specifications are now covered by a "covenant not to sue", not really public domain. Apple doesn't even bother writing specs for their office suite


I just came across specification for one new image format. https://twitter.com/phoboslab/status/1472916981179367425

It would be great if all of the new format were at least documented as this one.


I guess by not providing the exact specs companies want to push customers towards their solutions. I also hate when you have the same software at multiple platforms but their proprietary file format is not compatible across all platforms.


Anyone else feel let down by all the existing container formats out there, and feel this itch to like, write their own? I know, I know, that's like the worst possible idea, xkcd #927 "now there are 15 standards", but like, what if?

What I'm looking for is basically something that ticks all the boxes of LOC's 7 Sustainability Factors. Is able to store arbitrary binary data, along with metadata, in a single file. Easy to pack/unpack, particularly in Python. I'm pretty well covered for image/audio/video formats, but none of those lend themselves to n-dim arrays.

Ogg, matroska, and HDF5 are all container formats which can ostensibly do this, possibly even well. Ogg and matroska are supposed to be able to support arbitrary data steams and metadata, but actually finding tooling to do the operations I want to do has yet to bear fruit. HDF5 is insanely complex, and I have some concerns about performance and data corruption risk.

Tar is tempting but it does some really goofy things, like ending an archive with 1kB of zeros.

What actually looks most promising as an existing "format" is the RIFF family of chunked encoding. Dead simple, python stdlib even has a reader. Only real downside is 2*32 ~= 4 gigabyte size limit, though this can be overcome by using a "stream of chunks" model.

Alternatively, a stream based on Msgpack, with some slight constraints, is very appealing.

Sqlite3, also very tempting.

If I did write such a "format", it would definitely be built on the primitives of either RIFF, Msgpack, or Sqlite-as-application-file, so it'd be less of a "file format" and more of "existing format with some opinions about schema".

Protip for anyone also thinking about making their own file format: make sure the API/format supports some sort of arbitrary metadata model! Apache Feather and Parquet are pretty recent, but don't really support writing metadata, you kind of have to hack around it. IMHO, this is a huge oversight, especially considering the internal metadata is literally just JSON-encoded fields.

Edit: actually just reread the follow-up to "moving away from hdf5" and it mentions ASDF format. This looks really promising. YAML metadata header with binary blocks. Normally it's not a good idea to start with UTF8 and later go binary, but if you know when to stop reading text, it's not the worst. Also, it's a name collision with the asdf version manager (and also anything else named after rolling the left hand home keys)

https://cyrille.rossant.net/should-you-use-hdf5/

https://towardsdatascience.com/saving-metadata-with-datafram...

https://en.m.wikipedia.org/wiki/Resource_Interchange_File_Fo...

https://msgpack.org/index.html

https://www.sqlite.org/appfileformat.html

https://asdf-standard.readthedocs.io/en/latest/


ZIP?

- Has readers/writers/APIs in almost every language under the sun

- It's data tree is a file tree, leaving it very simple from an abstraction standpoint

- No inherent arbitrary metadata mechanism but plenty of ideas (in the wild, even) for storing arbitrary metadata in JSON or XML or YML "files" side-by-side data files

- Used in many examples in the wild (DOCX, XSLX, ODT, JAR, etc)


I've considered it. If I went that route, I'd probably use zstandard instead of PKWARE .zip (even though python stdlib has the zipfile interface built in). Zst is waaay faster, the API is more modern and actually architected from the ground up, supports parallel block processing, it's got rfc8878. It's quite nice. I'd have to still implement the object model on top of the block storage.

With low compression levels, you can actually tune things to get faster write times than raw bytes, depending on CPU speed vs IO speed. So that's cool.

http://facebook.github.io/zstd/


Okay, I've been playing around with ASDF a bit and I kinda really like it! It's quite performant - about 40% slower than reading/writing TIFFs, but supporting arbitrary metadata and array sizes. Supports in-band compression with zlib (ok), bzip2 (slow AF), and lz4 (quite fast, only about 50% slower than no compression, so in line with lz4 compressing the raw data).

There's a couple of squicky things, for one, it looks like the extension model may allow RCE behaviors by deserializing arbitrary classes. It looks like it uses a URI/registration based system so by default, it is limited in what types it can try to load, but it's plugin-based, so a library you import could theoretically expose you. But that seems relatively low risk.

But overall this format is really cool. Supports linking across multiple ASDF files, which I can see getting messy, but this is an incredibly powerful technique and they do it in a sane way.



I mentioned that. The problem with feather at the moment is readily writing metadata. I currently have to use a hacky workaround to write the metadata.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: