Yesterday I had a problem. I had a chunk of binary data saved in a file that I had extracted from another file. I needed to find where this chunk of data had come from. I wanted to know if it appeared in the original file more than once, and if so, where. I also wanted to see if it appeared in a number of other files.
There is no command in Linux (that I know of) that does this, so I set out to write my own. I used the obvious and naive implementation. I memory mapped the file containing the chunk I was looking for (the needle) and the file containing all the data I was looking through (the haystack). Then I walked through the haystack byte by byte and ran memcmp to see if it was the same. int compare(char *needle_pa, int needle_size, char *haystack_pa, int haystack_size) { int i; int found = 0; for (i = 0; i < haystack_size-needle_size; i++) { if (memcmp(needle_pa, haystack_pa + i, needle_size) == 0) { printf("Found needle at byte offset %x\n", i); found++; } } return found; } This was fast enough for my purposes, but intellectual curiosity lead me to try and optimize this function. There are a number of ways to speed this up, but the one that seemed most fun was to compare more bits at once. This lead me to explore in internals of memcmp.
glibc’s memcmp
Glibc comes with a memcmp implementation, of course. You can view the source here. It’s quite simple. If the passed buffer is not aligned on a word boundary, iterate byte by byte doing comparison via subtraction until you get to a word boundary. Then loop through word by word comparing with the equality operator (==). Once you get to the end of all the full words, compare the final few bytes by subtraction again.This implementation is pretty good. It’s fast on most platforms as it maximizes aligned comparisons. It’s very portable as it does not take advantage of any processor specific features. However, the fact that it does not take advantage of processor features means that it could be faster.
SIMD (Single Instruction Multiple Data)
Many platforms have SIMD instructions. SIMD instructions are very specific vector instructions that allow you to manipulate large chunks of data with a single instruction. A common operation found in SIMD instruction sets is string comparison, which is what essentially what memcmp does. For instance, in the x86 architecture, there are instructions to compare strings 16 bytes at a time. That’s 4x the throughput of the glibc implementation.Taking advantage of SIMD instructions is a bit tricky. They’re not supported on all platforms, even in the same processor family. Intel, for instance, has SSE1, SSE2, and SSE3 instruction sets, which all add new instructions not found in the previous iteration. Generally there’s no reason to go through the pain of using these instructions unless you absolutely know you need the speed boost. Today, we’re just doing it for fun.
GCC built-in functions
GCC comes with a number of SIMD optimized replacements for common libc functions. Memcmp is one of those functions that GCC optimizes. However, turning on this optimization is a bit tricky. Again, since every platform supports different instructions, you have to be very specific about what features are supported in order to get the maximum optimization. GCC will always create the most compatible binary it can unless specifically told otherwise, which is the right thing to do.To compile with GCC’s optimized memcmp, we run the following: gcc -Wall findneedle.c -O3 -march=opteron -o findneedle This tells gcc to optimize as much as it can (-O3) and that it can expect to be running on a processor with the same features as the Opteron (x86-64, SSE1-3). I’m running this on an Intel Core Duo, which supports the same feature set. This gives us a version of the findneedle program that does not use the standard memcmp implementation, but instead uses GCC’s SIMD optimized version.
The Results
My test was simple: I timed how long it took to a 16K needle in a 328MB haystack.At first I was doing this on a virtual machine. The most interesting result of any of this is that the program running in a vmware image slowed down significantly. It went from an average of ~4 seconds to an average of ~10 seconds. Thinking that the virtual machine might be emulating the SIMD instructions, I compiled on my mac itself. In that situation I saw the expected speedup. The average search time went from ~4.4 seconds to ~4.1 seconds. That’s a 7% speedup!
Clearly, these results show that optimizing your program to take advantage of processor features does not always help very much. However, if you’re compiling your own programs (I’m looking at you, gentoo fanatics), it does help to make sure that your processor options are set correctly. Just switching a few flags can cause core functionality to be improved, and that makes the whole experience that much better.
Also, assume your virtual machine is a 486. It’s way faster that way.
Update: There was a good debate about optimizing the algorithm itself over on hacker news. jws was nice enough to do a rundown on a number of different algorithms. Not surprisingly, the one demonstrated above is indeed the most naive.