There are plenty of existing compression file formats with very nice tools and performance. You don't just go out and create a new one for no reason, right? I am a graphics programmer at heart and do all kinds of stuff where I load assets, like, a lot and want to bundle them into a resource file of some kind. In the past we have had all kinds of weird formats and even used zip files when we could, why not?
So far everything has worked but one small thing has been bugging me and finally set out to do something about it: the storage formats have not been very cooperative with having multi-threading support, which these days virtually every CPU has in one way or another. So I decided to craft a format that addresses the issue once and for all (not really, but this is my first attempt so allow me some room for experimentation and making the errors you have to make to get a better understanding about the problem for the Next Ultimate Version).
The basic idea is that large files are broken into smaller blocks, which are compressed individually making the compression and decompression highly parallel. The next improvement is that all small files are grouped together into blocks of specified size so that the compression ratio is improved a little bit. In classic file formats this is called "a solid archive", which means everything is compressed as one big block. The difference here is that we break the data into blocks which can be individually addressed making random access more practical. The decoder can for example cache blocks so that access to already-decompressed data is available for a time. Then throw in some eviction policy like LRU (Least Recently Used) there to avoid filling the memory over time with junk.
These ideas can still be refined more but that's where we are at currently.
I used the mango multimedia-library with it's compression API for the compression, so the compression algorithm can be easily be changed.. so easily that I made it a command line option in the command line tool when compressing a folder. It's quite fun to just try different compression algorithms with different compression levels against each other with different data sets. I spent a lot of time just fooling around with it while developing. :)
Let's say I have 9981 files, totaling 4927.9 MB. If I compress the data with zip (command line utility) with an i7 4770 (4 core) processor the result is a 942 MB zip file in 7 minutes and 52 seconds. In fact, the process takes nearly identical time on i9 7900x, which is a 10-core processor. That was the problem I set out to solve.
Things get more interesting when using 7zip with "-m0=lzma2 -mx9 -ms=on" command line parameters, which allows the 7z to use multiple threads to compress the files. 568 MB in 3 minutes 7 seconds (using a total of 48 minutes of CPU time, holy balls!). Keep in mind that 7z only does this trick with custom LZMA2 encoder which is designed to be multi-threaded, the other compression methods are single-threaded.
Snitch with different parameters:
method: lzma2-10 bzip2-10 zstd-5 miniz-6 zstd-1 lzfse-5
size: 667 MB 830 MB 827 MB 944 MB 918 MB 943 MB
speed: 56 MB/s 260 MB/s 562 MB/s 624 MB/s 4038 MB/s 979 MB/s
real: 87.8 s 18.9 s 8.8 s 7.9 s 1.2 s 5.0 s
user: 1682.3 s 367.2 s 169.8 s 154.7 s 18.5 s 90.1 s
sys: 18.7 s 4.9 s 2.3 s 1.7 s 3.7 s 3.6 s
The compression speeds are pretty neat, even the LZMA2 at highest compression setting. The file is 20% larger than the 7z's but that is because we break the file into blocks so it is not a totally solid-archive. A trade-off to squeeze more juice out of the CPU and make the resulting archive random-accessible; that is a feature we absolutely want since this is for storing assets for rendering and other real-time uses. Increasing the block size does also increase the compression ratio and I have tuned the parameters to my own tastes with highly sophisticated This-Feels-About-Right metrics.
The compressor scales super nicely with number of CPU cores available. I tested on a 32-core Xeon and it was approximately 29 times faster at compressing opposed to using only a single thread. The next test was on a 64 core Xeon server and the speed-up was "only" 40x, it got stuck there because the poor server couldn't write out the results fast enough so I disabled the writes just for testing's sake and it got up to 48x but got stuck again: we couldn't read data in fast enough. Reading off /dev/zero got a full 64x speedup on a 64 core server. Bottlenecked by I/O, finally! :)
Here's what the output from the tool looks like when it is working:
Compressing 6839 files (236.4 MB)
Compressed: 236.4 MB --> 77.1 MB (32.6%) in 0.5 seconds (miniz-1, 479 MB/s)
Index: 644 KB
I wrote this for debugging purposes originally, . means compression task, + means write to the storage device and s (not seen above) means the block was stored as-is because compression was ineffective (so the access to the block when decoding is cheaper). The output doubles as a progress bar so it's a keeper.
At first the memory consumption blew up into my face when compressing: I just compressed as fast as I could and flushed the blocks into the storage device through a serialization queue. When the compressor was much faster than the device, the compressed blocks kept piling up and memory usage went through the roof. This was on 10-core machine and I was compressing something like 200 GB of files as a stress-test.
The solution was simple and effective: keep the thread pool slot reserved until the write has completed; there is zero point to use CPU until the compressed data is flushed out. After that small surgical modification the compressor debug output started to look like above. Before that change it was: ……………..+...…++………………………….++++++++++++++++++++++++++++++++++, which was horrible as I just described.
There are still features I want in there and will implement in the coming months. I have checksum field for the files but I don't yet calculate it (I just store 0 for now). The reason was that I couldn't decide which checksum algorithm to use but Intel made the decision easier for me as the SSE4.x only does CRC32C polynomial while ARM crypto supports both CRC32 and CRC32C in hardware. Guess which one I will be using?
The other feature I really want in there is encryption. I will not try to be the world's leading security expert or do anything super cool, I will most likely just use AES, salt and password. I don't want it for false sense of security but just to add the smallest possible deterrent for people looking into the resource files when I don't want someone to look there. The problem is now on the application side: how to "enter" the password from executable so that it's still secure.. it would be easy if I could just type it to everyone running my cool program for them, but no, I have to store it.. so there goes that security. F**K.
Some use cases (C++14)
// open a random file from .snitch archive
// open a permanent container mapping (= parse the container only once)
File file(path, "foo/bar.jpg");
ConstMemory memory = file; // see the file as memory ; memory mapping
ConstMemory part = memory.slice(0, 1024); // just want to see the first 1024 bytes..
That's pretty much all there is to it for using .snitch archives from the C++ source code. The .zip and .rar archive support works the same way as the filesystem abstracts archives as folders. The Path is abstraction for any folder and they are recursive, so you have have a parent path and child paths to it stacked as many levels deep you would want. In fact, the case where File is created without a parent path, it creates one internally anyway: that is how the system is designed. If there is no parent path, raw filesystem is accessed. When path contains archive (zip, rar, snitch) those are set as current and use the previous current as parent.
If you have a .snitch file and it has a, say, JPEG file stored in there (not compressed) and you access that specific file, the file is directly mapped from the offset it is at the archive file: zero-cost abstraction in case of stored, non-compressed, non-encrypted files. Of course, if the files are encrypted then they will be decrypted into a buffer and that buffer is mapped to the caller.
It just works. There are downsides: if you have a zip, inside a zip, inside a rar and they are compressed you are an idiot; that invokes decompression multiple times and wastes CPU and memory, like, a lot. But it works. :)
Where to get the goods
Snitch on github
MANGO on github
Don't use this code in serious work or production as it is just proof-of-concept stage at this time. The checksum MUST be implemented before any serious use can be considered because data integrity is super-duper important.
UPDATE: January 21st, 2019
A quick and dirty test on ODROID-XU4, 4-core ARM board. I compressed the build tree of MANGO (25.5 MB) with "zip -9" options and it resulted in 16.0 MB compressed file in 3.8 seconds. The "snitch zstd 1" does the same job 0.6 seconds and results in 15.1 MB file; that is nearly 6.5 times the speed, with only 4 cores! The ARM version is definitely on solid grounds. :)
UPDATE 2: March 19, 2020
The specifications for next generation Xbox Series X and SONY Playstation 5 were announced this week. The Xbox SX has a hardware LZ77 decompression engine which is quoted to decompress at 6 GB/s. The PS5 has a custom format engineered by RAD Game Tools which claims 8-9 GB/s decompression rate, which in software would require 9 Zen 2 CPU cores to achieve similar performance.
The snitch file format was ~2 years ahead of the curve in anticipating multicore-capable encoder/decoder to fill the I/O requirements of the coming years. Let's face it, folks, the hardware we have at our disposal is very capable and fast, but only if it is put into use: the cores won't find work to do by themselves. The current compression engine in MANGO is synchronous but we're already engineering solutions that are asynchronous and play well along with in-hardware decoding engines. It's quite simple: each compression job is given a lambda, which is called after the hardware engine signals completion. If the hardware engine is capable of signalling partial progress, then support for this is also added so that we can callback the user when enough data has been decompressed - the idea is that the data post-processing / consuming can start before the the whole file is decompressed. The actual details depend on the XBSX and PS5 implementations and APIs they provide and if I ever get access to the information needed to realise this in practise. All the same, the plan is to move forward and support the hardware capabilities similar to these next generation consoles as this technology is very good and should migrate to desktop / laptop / portable computing as well. The way I see it it is just a matter of time.