I know what you’re going to say. “Heck, who the hell write a hashtable nowadays ? Take the stl or any random library and use it!”
Well, yes, hashtables are probably the thing every developer write at least once in their lifetime. Generally much more.
The first hashtable for httrack was written in 1999, at the beginning of the project, when it was kind of obvious that the linear algorithms being used just could not possibly scale more than a thousand of pages with reasonably powerful machines.
This first version was rather basic:
- fixed-size bucket (this was bad) with a lame prime number
- using linked-list (this was not great)
- using MD5 hashing (this was not so bad)
We can quickly discuss these three points:
- the fixed-size decision was probably a poor design choice, but on the other hand most projects do have more or less the same number of pages – or, should I say, nobody is going to crawl a million pages (well, this is actually false)
- linked-list decision was not so great, too, because it meant to allocate a memory chunk for every single item put in the table
- the MD5 hash function choice was not so bad – I’ll discuss this point in a later entry – especially considering my poor hashing skills at that time (this was much better than taking a poor hashing function like many people still do nowadays)
Still, httrack suffers from not-so-great performances when you attempt to capture a very large site, so I have been addressing this long-lasting issue during the last several weeks.
Now, what were my options ?
- use the stl
- too bad, httrack was developed in pure, plain C (NO, I don’t want to convert the code)
- remember the various quicksort
bugsdesign choices ? (Solaris: old version in O(N²), Linux: memory corruption with buggy comparator, etc.)
- use an external library
- humm, increasing the number of external dependencies annoys me, because I have to find cross-platforms solutions (yes, Windows, but also Android – did you know Android did not ship OpenSSL or iconv ? well, now you know)
- use a small-is-beautiful little piece of C code from the outside and use it, hoping that it will be somewhat supported, would give proper performances, with no bugs etc.
- humm, so I need to fix the code by myself in case of problems ?
- write the hashtable by myself
- after all this is piece of cake, isn’t it ?
Well, actually yes and no. The idea was to code something with more performances compared to the current version, so I did some lookup on the INTERNETS
First, I needed the following properties:
- lookup in O(1) (better: guaranteed constant time)
- insert/replace items in O(1) (better: guaranteed constant time)
- delete items in O(1) (better: guaranteed constant time)
- limit memory allocations
- enumerate all items efficiently
Oh, and I also needed:
- should not crash too often
In the wonderful world of hashtables, people are still discovering new algorithms. A pretty nice one I decided to pick is the cuckoo-hashing hashtable. This one is kind of recent (2001) for such a basic algorithm.
Basically, like many other open-addressing strategies, it lets you have a fixed-array bucket (you have to make sure the array is not filled with too many elements – keeping it half-full is a convenient choice), and directly lookup the position using a hashing function. But, unlike linear probing and its friends, rather than recomputing a new position each time you collide, a secondary hash function lets you find an alternate place. And, if you still have a collision, you forcibly remove the colliding item and place yours (like a cuckoo would do – isn’t it ?). Of course, you have to take care of the expelled item now, by placing it on its alternate position (ie. the position produced by the alternate hash function), and you may repeat the same operation until the last expelled item finds an empty place.
What’s so great about this strategy ?
- lookup is guaranteed constant time (yay)
- deleting an item or replacing its value is guaranteed constant time (yay)
- enumerating is simple (enumerate the bucket, skipping empty elements)
- only one big chunk of memory (oh, and one for the keys – okay, let’s say two memory blocks, with some string pool cleaning)
- in theory, inserting can lead to a N-items loop (N/2-items actually), and even an infinite loop — but statistically you may rest well, my friend (insert mathematical papers references explaining why here) because the average is O(1) (and if the table is never more than half-full, empirical tests shows that the average is something like “1.4 Θ(1)”)
There’s only only little issue: the “in theory .. statistically you may rest well” is a bit bullshit, because statistically it may happen. Darn. And it does happen, actually, time to time.
If you look at the algorithm literature, you will find this gem (taken from Wikipedia): “[…] the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions”
Oh great. So if I enter an infinite loop, I have to design a new hash function ? You are kidding me, right ?
Okay, it means I can slightly change some kind of starting point in the hash function – but what if it still loops ? Well, err, dunno, this is not possible, right ?
Fortunately our friends at Microsoft (well, the ones at the research division, not the ones who are responsible for the Windows 8 debacle) ended up with a nice and elegant solution through the use of a stashing area. The idea looks a bit odd at first glance: colliding items are just put… outside the table, in a special stashing area (ie. a simple array). The beauty of this solution is that you just only need few (like 4 or so) items, and this number does not depend on the number of items in the hashtable (ie. it can be static, yay). I will not discuss the boring statistical reasons behing this magic, but, well, it is proven, so let’s rest.
TL;DR: so, I made a hashtable.