> *If you get a collision in a Cukoo hash, you're pretty much guaranteed a cache...

pbasista · on Aug 31, 2016

Why would they do that?

The cuckoo hashing is just a collision resolution technique used in open addressing hashing schemes. Having two possible hashes of a key always next to each other (that is what you suggest by using buckets if I understand correctly) would cause the cuckoo algorithm for kicking out the conflicting items to stop working.

If both possible locations for a key are always in the same bucket, the two keys with a colliding hash would also share the alternative hash (i.e. it would be the other place in the same bucket). And so, they would only be able to kick each other out. A loop of length 2. Always. If three keys hash into the same bucket, you will have to rehash the entire table.

You can have larger buckets, but the problem remains. Once a big enough number of keys hash into the same bucket, you will have to rehash the table. That would most probably mean low typical load factors.

Of course, what you will be getting in return is the locality of memory references. But is that a reasonable trade-off? I think there are better ways to achieve locality of memory references...

cmurphycode · on Aug 31, 2016

Forgive me, but I think you've misunderstood something here. The most important thing about the cuckoo is that your bucket choice functions (i.e. hashing function) are independent. If two keys hash to the same first choice bucket, their alternate hashes will (almost certainly) hash to DIFFERENT buckets. For instance, take the SHA-1 of the key. Your first bucket choice is defined by the first x bits, your second bucket choice is defined by the next x bits. But inside a bucket, you can have a bunch of slots.

It works like this. A typical sizing is 4 slots in a bucket, with 2 bucket choices. So your first choice, you arrive at a bucket, and it is just a linear array of length 4. You can put your element anywhere! After a while, when you go to insert into your first choice and you find all 4 slots taken, then you can take ANY of those and try to put it in its alternate bucket. If that alternate bucket is full, you evict from that bucket, too! Thanks to the random nature of your hash functions, it works out very well. 2 bucket 4 slot gets you near 95% occupancy without many evictions on insert.

When you go to lookup, you check your first bucket, and you check all 4 elements until you find it. If you don't find it, you go to your alternate bucket and check all 4 elements. But that's it - only two random accesses (the 4 slots will easily fit in a cache line). It's not so bad to check all 4 elements because the cost is really dominated by loading the line to begin with.

I really recommend a read of some of the cuckoo papers, it's cool stuff.

pbasista · on Sept 1, 2016

I am aware that using buckets the way you have described works fine for the cuckoo hashing collision resolution technique. But it would not help to ensure the same kind of locality of memory references as you would get using the linear probing.

I do not know what exactly the commenter meant by suggesting to use buckets. I assumed (as the conversation is predominantly about the locality of memory references) that they suggested to use buckets in order to avoid two (or more) cache misses. And I just pointed out that it would be a bad trade-off, because then the cuckoo algorithm would only operate within a bucket and the expected load factors would be low.

nkurz · on Sept 1, 2016

And I just pointed out that it would be a bad trade-off, because then the cuckoo algorithm would only operate within a bucket and the expected load factors would be low.

Like 'cmurphycode', I think you might be misinterpreting something about how the buckets would be used. The buckets are in addition to the multiple hashes (which I think some authors call 'ways'). A good number of buckets would those sufficient to fill 1 or 2 cachelines, since these are going to already be fetched from memory as a unit.

The idea is that you fetch the full buckets from 2 (or more) independent hashes, and then sift through them linearly without further memory accesses. The terminology for cuckoo hashes is poor, but I think "buckets", "bins", and "slots" are all synonyms depending on whose description you are using. By "using buckets", the idea is to have (at least) two regions in consecutive memory where the item will be if it exists.

I think this paper offers a reasonable terminology: http://www.cs.princeton.edu/~xl/Publications_files/cuckoo-eu.... They use a "num_hashes, num_buckets" pair to describe variations. The classic version is a "2,1-cuckoo": you create two hashes from the key, and look in one bucket in each location. Li et al. suggest that a "2,4-cuckoo" works allows much easier insertions at high density.

pbasista · on Sept 2, 2016

> The idea is that you fetch the full buckets from 2 (or more) independent hashes, and then sift through them linearly without further memory accesses

Maybe. How do you know what the commenter (erichocean) meant?

If they say that using "buckets" can mean the same locality of memory references as linear probing, their interpretation of what "buckets" mean must deal with that problem in some way.

nkurz · on Sept 2, 2016

> How do you know what the commenter (erichocean) meant?

Because "bucket" is a well-defined term-of-the-art with regard to cuckoo hashes, because Erich has demonstrated deep knowledge of similar algorithms in the past, because I've been thinking about improving the memory locality of cuckoo hashes for years, and because I would have used approximately the same phrasing as Erich did to describe the memory locality advantages of "buckets".

If they say that using "buckets" can mean the same locality of memory references as linear probing, their interpretation of what "buckets" mean must deal with that problem in some way.

Not only do buckets "deal" with locality, because of the granular nature of cachelines and parallel abilities of memory controllers, searching two compact small regions turns out to be just as searching one region of twice the size. Vectorized linear probing and SIMD cuckoo hashes are so similar that your question very likely indicates that you are not using the same terminology as others are.

Here's paper that offers a good comparison:

  Rethinking SIMD Vectorization for In-Memory Databases
  Polychroniou, Roghaven, and Ross (2015)

http://www.cs.columbia.edu/~orestis/sigmod15.pdf

doomrobo · on Aug 31, 2016

It's the same problem though. As the other comment said, you have 2 uncorrelated memory locations. If you check the first one and your element isn't there, then checking the second one will probably incur a cache miss.

The difference between linear probing and buckets is a big one: buckets are linked lists. That means that hitting a bucket doesn't pull everything else inside the bucket into the CPU cache. So iterating through the bucket may incur a cache miss for every element. On the other hand, linear probing in open addressing does away with buckets and instead searches the neighbors of the collision. Because they're nearby in memory, it's a lot quicker to search.

pkhuong · on Aug 31, 2016

You can have preallocated buckets in an array. What's the difference you ask? When you have more elements per buckets, the variance becomes smaller as a fraction of the bucket's size. Imagine you had a hash table with buckets of 1000 elements. You'd probably feel comfortable with a 90% load even without any strategy to handle bucket overflow except growing and rehashing. Obviously, the downside is that we have to do more work to search in buckets.