Unreasonable effectiveness of JPEG

28.05.2013 20:41

A colleague at the Institute is working on compressive sensing and its applications in the sensing of the radio frequency spectrum. Since what ever he comes up with is ultimately meant to be implemented on VESNA and the radio hardware I'm working on, I did a bit of reading on this topic as well. And as it occasionally happens, it led me to a completely different line of thought.

The basic idea of compressive sensing is similar to lossy compression of images. You make use of the fact that images (or many other real-world signals) have a very sparse representation in some forms. For images in particular the most well known forms are discrete cosine and wavelet transforms. For instance, you can ignore all but a few percent of the largest coefficients and a photograph will be unnoticeably different from the original. That fact is exploited with great success in common image formats like JPEG.

Gradual decrease of JPEG quality from right to left.

Image by Michael Gäbler CC BY 3.0

What caught my attention though is that even after some searching around, I couldn't find any good explanation of why common images have this nice property. Every text on the topic I saw simply seemed to state that if you transform an image so-and-so, you will find out that a lot of coefficients are near zero and you can ignore them.

As Dušan says, images that make sense to humans only form a very small subset of all possible images. But it's not clear to me why this subset should be so special in this regard.

This discussion is about photographs which are ultimately two-dimensional projections of light reflected from objects around the camera. Is there some kind of a succinct physical explanation why such projections should have in the common case sparse representations in frequency domain? It must be that this somehow follows from macroscopic properties of the world around us. I don't think it has a physiological background as Wikipedia implies - mathematical definition of sparseness doesn't seem to be based on the way human eye or visual processing works.

I say common case, because that certainly doesn't hold for all photographs. I can quite simply take a photograph of a sheet of paper where I printed some random noise and that won't really compress that well.

It's interesting if you think about compressing video, most common tricks I know can be easily connected to the physical properties of the world around us. For instance, differential encoding works well because you have distinct objects and in most scenes only a few objects will move at a time. This means only a part of the image will change against an unmoving background. Storing just differences between frames will obviously be more efficient than working on individual frames. Same goes for example for motion compensation where camera movement in the first approximation just shifts the image in some direction without affecting shapes too much.

However, I see no such easy connection between the physical world and the sparseness in the frequency domain, which is arguably much more basic to compressing images and video than tricks above (after all, as far as I know most video formats still use some form of frequency domain encoding underneath all other processing).

Posted by Tomaž | Categories: Ideas


I you look at the big picture in time/space, smoothing takes place and very few discontinuities occur. If you go deeper, uncertainity of quantum effects take over. There were a lot of debats at efficient 2D coding when I was young :-)

P.S. Nice to see S51FF reference. One of the great SLO engineers.

Posted by MMM

The sparseness in the frequency domain is actually what makes the image hold any information at all. If all the frequencies were equally present, your picture would be noise.

Starting from noise, any information you add is actually removing some frequencies from the noise. That would be substractive synthesis. In the end, a picture with lots of data for your eyes ends up having few frequencies left. This is actually just information theory.

Your eyes are not good at filtering random noise and locating the actual data in it. So, removing the noise is actually helping your eyes do their work.

It also happens that the whole set of tools (theorical as well as hardware) we are using are designed around the way we perceive the world. So, our CCD sensors work just like our eyes, putting the information right where we expect. And all this frequency stuff is actually designed because that's how our senses actually work. the eyes see colors because they are different wavelength of light ; our ears ear different sounds because of the fourier series decomposition of it (with lots of sensors each tuned to a particular frequency). Our eyes are good at high-frequency/small details in their center area, and low-frequency/blurry picture around it, our brain is a very good processor for all this frequency domain data and makes sense (and space/time domain interpretations) out of it.

If our senses and brains had worked otherwise, the frequency domain stuff would not have been too helpful, but surely another mathematic theory would fit the need. It's just a case of tools built around the use.

I'm not an expert in information theory, but doesn't it say that the signal with the highest information content is the one that has no correlation between samples? Such a signal has all frequencies equally present in it. I know that modern digital radio codes have a flat spectral envelope exactly because of that.

Regarding technology copying our natural senses. Of course it does. But from the standpoint of the argument in this blog post, I would say our perception of the world evolved to be like it is exactly because it works well given physical laws. So that doesn't answer the question why it works so well.

Is it possible to show sparseness of the picture data ab initio (no prior knowledge, just from basic physical laws)?

Posted by Tomaž

In the information theory sense, a picture full of noise would have more data, because each pixel can't be predicted from the ones around it. However, our eyes don't deal well with so much information, and they actually analyse the lowest-frequency part of the picture.

That's the key to lossy compression algorithms: most of the "data" that would make the same image so big when compressed with a lossless system is actually irrelevant to our eyes.

JPEG filters this irrelevant data by keeping what matters to our eyes:
- Some very low frequency data to get a general idea of what the picture looks like (most of the retina only see blurry color areas)
- And some medium-frequency data that adds some details to the picture, without the need to go into very high frequencies (the central part of the eye can process this, and will scan the picture fast enough for you to not notice anything).

The fact that the uncompressed picture already has the same features is because we spent years tweaking CMOS sensors (and post processing the results) to remove things irrelevant to our eyes. The sensors use RGB because this matches how our eyes work, and the filtering algorithms applied make the picture look even more like what our eyes expect.

I don't think you'll always find near-zero coefficients when computing the DCT of an image, as you say, it's pretty easy to make a noisy picture. The idea is that you can always remove the smallest frequency peaks, as your sensor (your eyes) is only able to see the most important ones.

Think of it like looking at a complex modulated signal with a frequence meter instead of a spectrum analyzer. You can distort the signal pretty badly without noticing anything with such a tool, as long as you keep the highest frequency peak for it.

So, the fact that the relevant data in the picture is easy to find is because the CMOS sensor already did the filtering work, and the fact that DCT or wavelet transforms work so well is because our eyes actually work in the frequency domain, more than the spacial one, and they do so by working on small areas of the environment at a time (notice how JPEG slices the image in small squares before attempting the frequency domain transposition).

Our brain is very good at processing this data and building up a spacial representation of what's going on, this is rather useful as on the other hand, our hands and other moving parts are driven in the spacial domain.

This has actually nothing to do with the physical objects you took a picture from in the beginning, you could have used a different sensor and got an entirely different capture of the same environment, likely one that wouldn't compress as well with JPEG. So, if you want to try understanding this more deeply, don't look at the environment, look at the way the eyes and brain build up the representation of it. This is what both the sensors and the algorithms are tweaked to.

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)