xavier roche's homework

random thoughts and sometimes pieces of code

MD5 Is Your Friend

MD5 is Your Friend, Really ?

Yes, MD5 is your friend. Believe me.

Why is MD5 Your Friend ?

My first encounter with MD5 was around fifteen years ago, when I implemented the first hashtable in httrack. The code was not really glorious, but at least I quickly discovered something: if you design your own hashing function, you’re gonna have a bad time.

Here’s what I first implemented (please do not scream):

unsigned long int hash_for_dummies(const unsigned char* s) {
  unsigned long sum;
  unsigned int i;
  for(i = 0, sum = 0 ; *s != '\0' ; i++) {
    sum += 1;
    sum += (unsigned int) *s;
    sum *= (unsigned int) *(s++);
    sum += i;
  return sum;

Well, on several (I mean, on one of two) examples, it was kind of a cool hashing function actually. Of course, once being used inside my hashtable, things did not go so well, and on several sites with special URL patterns, the table exploded (in term of complexity).

Oh yes, I forgot to mention: the hashing function was meant to be used in a hashtable, and the keys being hashed were URLs. The funny thing with URLs is that you may have any pattern:

  • totally randomly distributed URLs (with nice random-looking /session/?5209d5f396ca0d46)
  • not so random URLs (plain text URLs, such as /I-made-a-huge-mistake)
  • really not very random URLs (/page/1, /page/2, … /page/999999)

When hashing with a specific function, you may have very good results for one of specific pattern, or for several families, but time to time you discover that the hashing function betrayed you in specific cases.

So, MD5 was the Solution ?

Yes, yes, but you ruined the plot! Okay, that was expected, but yet, after discovering my poor hashing abilities, I decided to use the thermo-nuclear option: MD5, the cryptographic hashing function, which has been studied during long years, and which is considered an excellent solution for hashing anything.

And it was good. Hardly no collisions. Perfect distribution of keys. No nightmares about specific URL patterns that would ruin everything. MD5 was perfect.

And Costly. Switch to FNV-1 or Murmur!

Ah, yes, the cost. MD5 has a bad reputation of being over-costly. If you look at the internal loop, you will discover an intimidating block of 64 MD5STEP macros such as:

#define F1(x, y, z) (z ^ (x & (y ^ z)))
#define MD5STEP(f, w, x, y, z, data, s) \
( w += f(x, y, z) + data,  w = w<<s | w>>(32-s),  w += x )
  MD5STEP(F1, a, b, c, d, in[0] + 0xd76aa478, 7);
  MD5STEP(F1, d, a, b, c, in[1] + 0xe8c7b756, 12);
// ... during 64 lines

If you look closer, the internal loop function is actually processing 64 bytes at once (with final zero-padding) ; which mean that for each byte, a dozen simple (add, xor, shift…) operations is executed. That’s a lot, and not so much at the same time. Most operations can be executed in one cycle or less (pipelined) by a modern processor – and it has to be compared to more expensive operations (such as mul) used in other hashes.

So yes, MD5 is probably slower. Well, actually it is:

MD5 FNV-1 Murmur 3
6.0s 4.40s 1.54s

(Tests executed on a 3801493504 bytes file, on a Xeon® CPU E5-1620 0 @ 3.60GHz)

Basically 50% slower than some FNV-1 variant, and four times slower than Murmur3.

That’s Horrible!

No, actually not. Remember one thing: we’re not hashing the universe, we’re hashing small keys (No, keys are not supposed to be huge binary blobs). We have plenty of work to do after that (like, uh, storing the value in the hashtable, and probably many other interesting things with them), and the keys hashing is only a small subset of the code.

If you do more extensive tests, such as, inserting and reading to a hashtable, things are less clear:

FNV-1 MD5 Murmur
1m13s 1m54 1m13s

(Tests executed on 10000000 keys progressively, with keys of length between 80 and 100 bytes, with insert/overwrite/delete patterns.)

Yep, basically MD5 is only increasing marginally (well, yes, I call 30% “marginally” – I know everybody would love to earn marginally more) the overall hashtable cost.

But Still, It Is Slower

Yes. And I don’t care. Remember the URLs pattern I was talking about previously ? Well, with MD5, I can sleep quietly. I don’t have to worry whether the hash function has been studied correctly. MD5 has. And even if on several tests Murmur has proven to be a reliable hashing function (with the same number of collisions than MD5, basically), I’m still reluctant in exchanging a little CPU cost with a bit of algorithmic uncertainty.

TL;DR: I stopped worrying and learned to love MD5.

Note: during my various tests, I used 64-bit hashing functions variants, with shift+mask to isolate the number of bits needed (ie. the hashtable bucket size was a power of two). The hashes were always processed by computing 128-bit hashing data, XOR-folded into a 64 bit version.