What's this bullshit? A picture of bananas!? They aren't even RIPE!!! You didn't sign up for this crap when you clicked on the article did you? Stay calm; everything shall be explained shortly.
Background story
We must go back in time by about 10 years to the beginning of the story. It was a fine autumn day at Advanced Micro Devices office in Espoo, Finland. It was decided somewhere well outside of reach of my pay grade that there will be a decent OpenCL driver for the mobile Adreno GPU. Where there is a driver, there is also a lot of test cases. One of the tests was an OpenCL JPEG decoder, which I proceeded to develop in record time. It is unfortunate that there is no Guinness Book of Records category for software development speed.
The initial version did the Huffman decompression with the CPU and MCU processing (iDCT, color conversion) with the GPU. The downside at the time was that OpenCL 1.0 did not support writing into the GPU surfaces so the results had to be written back to the system memory. The overhead was so significant that just doing everything in the CPU was faster if you only looked at how long the process took from start to finish with one single image. Of course, when a lot of images were batched together the GPU processing was a net win since the bandwidth was much higher.
This was my first encounter with the internal details of the world famous lossy file format.
"Optimizing"
Fast-forward to 2012. I had worked on GPUs for a decade but it was time to move on to new adventures. I had a bit of free time between jobs and things took a turn for the worse, because I had gotten obsessed with the format. It had to be possible to decode the file faster with the CPU as well.
When you want to optimize something like this you have to first make a choice: should I build on top of existing code like IJG (Independent JPEG Group), or write my own from scratch. The IJG was an option at first but when I started wanting my own data layout and pipeline, it became obvious their API was restricting to doing things in pre-ordained ways. I could have pulled it off, modifying the IJG but that way of doing it was spending most of my energy engineering around the way it worked rather than focusing my time and energy on the problem I was trying to solve. So in the end I wrote my own parser, which is actually very easy since the format resembles RIFF and therefore quite easy to handle.
So I coded and coded, then coded some more. Squeezed one cycle from here, other from there. Oh, the Huffman code extension can be branch-less, nice, 2% extra speed from there. Oh, the Huffman code can be MSB-aligned, another 5% from there. And so on and on. These micro-optimizations started adding up but there was no 4-8x performance increases in sight even with 4-core i7 processor I was using for development at the time.
Why is that? The reason is fairly simple and obvious: the decoder is bottlenecked by the Huffman decoder, which is a serial operation: there is no obvious mechanism to decode the symbols concurrently; you have to completely decode the current symbol to know where the next one begins. Think about it like this: the data is coming through a straw (Huffman decoder) and then we have all these multiple CPU cores to do rest of the transformation to the data at the same time the decoder is working. If we make the MCU processing infinitely fast, we are still limited how fast the Huffman decoder runs. This is where the story should have ended but it didn't.
The Restart Interval
The JPEG has a feature meant for error correction and recovery; Restart Interval, or RST for short. The idea is that any file that has RST defined the interval between RST markers in the stream; how many MCUs are between the RSTs. If there is an error, the decoder can seek the next RST and knows exactly which pixels between the markers are corrupted. The way it works is that the Huffman compression is restarted at the RST marker so that it can start from a fresh, clean, known state.
WAIT A MINUTE? WHAT DID I JUST WRITE THERE? Did you see it!?? Holy ****, does this mean what I think it means? Yes, it does!!! The Huffman compressed intervals are in complete isolation from each other!
Scanning for a specific bit-pattern is super-fast, and at worst, it just prefetches the data into physical RAM (I prefer to work with memory mapped files, so the stream looks like flat memory for me). When I find a new interval, I dispatch a new decoding job which processes the decompressed data and writes to the target buffer immediately. The decoder was immediately more than four times faster with a very small ~20 line change into the source code.
Now the problem was; but most files do not have RSTs, so the "optimization" is exotic and rarely activated. On the other hand, I can do my small part in ensuring that any file that I encode will have the restart interval markers. This is actually very desirable since it allows parallel encoding of the bit-streams; the more CPU cores you have the faster the images will be encoded. Of course, this will still burn CPU-time but it will be distributed along a shorter span of time, in other words, the latency will be much smaller. I measured the CPU budget and it's roughly in the same ballpark so energy consumption is not really affected - you just get your bananas rendered faster.
About the bananas: I took this photograph all myself somewhere close to 2008. It has no other significance than the fact that it was my reference image for all optimization related timings for a longest time when I wanted quick ballpark figure. On top of that I had a corpus of ~500 MB that I ran to get more accurate average figure.
Decoding Results
Let's see how this image fares today with common JPEG encoder/decoder routines out there in the wild.
stb_image is a header-only library with SIMD optimizations.
libjpeg-turbo is SIMD optimized version of IJG's implementation and is 100% API compatible with it, with their own API as an option.
MANGO is the library I have been working on and has it's own custom interface.
nvJPEG is CUDA / GPU accelerated JPEG decoding library.
The first test was done on a Intel Core i7-3770K, gcc 7.3.0, image: 3888x2592 (bananas, the classic go-to-image I used years ago), 8x8 MCU, no RSTs
load save
---------------------------------------------------
stb_image 107 ms 546 ms
libjpeg-turbo 76 ms 62 ms
mango 42 ms 28 ms
nvjpeg 45 ms x
Next we are seeing what happens with a Intel Core i9-7900X, gcc 8.1.0, image: 5184x3456 (lazy cat), this image has RST markers!
load save
---------------------------------------------------
stb_image 250 ms 991 ms
libjpeg-turbo 194 ms 146 ms
mango 28 ms 36 ms
nvjpeg 213 ms x
Time for some ARM action to spice things up! This image does not have RSTs, so the decompression is single-threaded and MCU processing is multi-threaded
ARMv7, gcc 7.3.0, image: 5000x4000 (playing cards)
load save
---------------------------------------------------
stb_image 7256 ms 8821 ms
libjpeg-turbo 395 ms 688 ms
mango 173 ms 267 ms
nvjpeg x x
Without RST's the nvJPEG is roughly equal performance to MANGO. My best guess as to why is that the Huffman decoder is limiting the performance and is roughly the same speed on both implementations.
Batch Processing
We can still further improve the decoding rate by decoding multiple images simultaneously. If we do this, we can decode 10601 images, containing 88 GB of raw image data in 12.6 seconds with MANGO, burning 2 minutes 36 seconds of CPU time on a 10 core Intel i9-7900X powered Linux computer. A quick mental super calculation gives around 8 GB/s per CPU core, or 4 GB/s per hardware thread as the Hyper-Threading is enabled.
I was unable to run the above "10K files" benchmark with nvJPEG as it does not support all types of JPEG files so I had to make a smaller batch-processing test set of files: 73 files with total of 150 MB in size, taking 1.7 GB when decompressed.
nvJPEG batch decoding:
real 0m2.800s
user 0m2.357s
sys 0m0.433s
stb_image:
real 0m0.578s
user 0m6.857s
sys 0m1.100s
libjpeg-turbo:
real 0m0.415s
user 0m5.014s
sys 0m0.592s
MANGO:
real 0m0.384s
user 0m4.156s
sys 0m1.063s
If the reader is not familiar with the UNIX "time" command, the first number (real) means how much time running the command took. The second number (user) tells the amount of CPU time, for example, if we run for 2.0 seconds and 8 threads have a full load it means we used 16.0 seconds of CPU time in 2.0 seconds of time in the real world. Makes sense?
The last number (sys) means how much CPU time was spent on system calls.
From these UNIX times we can determine that MANGO loaded the 73 images in 0.38 seconds and nvJPEG took 2.8 seconds to do the same amount of work.
Conclusions
The other two software decoders were doing pretty nicely for Batch Processing; the results are all in the same general ballpark - this is because the benchmark decodes multiple files (20 on this test run on a 10-core machine) simultaneously which constitutes to overall throughput being 20 times larger it would be if we processed one file at a time serially. 20 times more work is in this case enough to fully occupy all of the CPU cores at maximum load.
If we add up the CPU load for MANGO we get roughly 5.2 seconds of CPU time spent on the workload. The libjpeg-turbo spent 5.6 seconds of CPU time so the difference isn't that great for processing a lot of files simultaneously. The only practical difference is that MANGO's design results in significantly smaller latency, which should give more flexibility in how to utilize the encoder/decoder engine. For batch processing, there is no substantial difference.
Where to get the goods
"coding cave / super secret hideout"