I hate MMX

17.10.2005 22:48

Do you want your code to be fast? Take my advice and don't even try hand optimizing it for these fancy SIMD instruction sets like MMX, SSE and similar. In fact, don't even google for "mmx optimization" or you'll be deceived by pretty numbers and benchmarks and you'll forget what you've read here. You will spend hours writing optimized memcpy() functions, parallelizing loops in your code and putting those damned PREFETCHes everywhere. And in the end (if you're anything like me), you'll rm -rf everything in frustration because no matter how you look at things, your cool assembly MMX optimized code will always run slower than plain dumb unoptimized C.

Or perhaps I'm doing something wrong here? Perhaps not. (No, I'm not mixing floating point and MMX!) Even AMD's optimization guide says that in most cases you should use scalar instructions instead of vector ones. Instead of parallelization it recommends some (in my opinion very ugly) hacks that enable you to take for example some control over what is stored in processor cache, how instructions are pipelined and how successful branch predictions will be. Those video software guys must be practicing black magic or something if they really get that 200% boost from MMX.

The problem I see here is that today's CPUs have too much internal logic that thinks it is smarter than you and reorders instructions, renames registers, controlls cache, ... I agree that's good for those 99.9% of the code that you don't want to bother optimizing, but for the few loops that need to be hand optimized, this is a nightmare. Instead of simply instructing the CPU to store this and that in its cache, I have to perform some weird sequence of reads from memory (take a look at some of the examples in that optimization guide if you don't believe me) so that I'll trick CPU's internal logic into thinking that perhaps it is a good idea after all to store that in the cache. And if I don't do it just right, the wrong things will end up in cache and that latest CPU will start emulating a 486. It would be nice if CPUs would have a mechanism to turn all this mess off and let me be in control for a while.

I miss the days when I could count the exact number of cycles the CPU needed to execute a function instead of having to measure it with 10% uncertainty.

Posted by Tomaž | Categories: Code

Comments

Funny, recently I made MMX optimization of StrLen procedure and it is really twice as fast and what is more important smaller than the highly optimized implementation using 32bit instructions.

Search for StrLen here: http://fresh.flatassembler.net/fossil/repo/fresh/artifact/dc7c5394faf8132f61d291f0971ecf5b10ba8523

Using MMX is actually a little bit hard for a human brain, mainly because of really weird opcodes, but it worths sometimes.

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)