More

dryman · on July 4, 2017

Awesome. This will be a super useful reference once I want to migrate the library to C++.

dryman · on July 4, 2017

All these concerns are critical, and this is why I spent 6 months to build OPIC malloc prototype. I surveyed a lot on state of the art malloc implementations, as well as some POC papers. The model I implemented is similar to scalloc and supermalloc. I did a simple benchmark, the performance is identical to jemalloc. That's seems good enough for me for now. OPIC has potential to use even faster concurrency model dropping atomic implementation and move to urcu (user space rcu). However, this is not very necessary for a serialization focused malloc. The detail of how I implemented this malloc will be discussed in later post.

Reference. https://github.com/cksystemsgroup/scalloc https://github.com/kuszmaul/SuperMalloc

dryman · on July 4, 2017

I don't have a good answer to this question yet. For now I only create the heap in swap, write it to disk, and then use it as read only mmap. To ensure the written file is valid, one can write it to a temporal file first, once confirmed the file is written, then mv the file to desired file name and location. This works for immutable data, but is a big blocker for me to make OPIC work on mutable back store.

This problem is generally hard. See [Ensuring data reaches disk](https://lwn.net/Articles/457667/)

hyc_symas · on July 4, 2017

You should have looked more closely at LMDB, which already solves this problem. Also you could look at LMDB's API for using fixed-address mmap, which lets you store pointer-based structures without any deserialization step at all.

dryman · on July 4, 2017

The problem is C++ brings in many extra pointers. For example, the vtable pointer used in virtual functions. All the pointers not converted to offset can be invalid in next process that deserializes the object.

bananaboy · on July 4, 2017

You don't have to use virtual functions though. You can just use plain old data structures.

dryman · on July 4, 2017

If I use a strict subset of C++, probably will do. However, figuring out the subset of different C++ standards and implementation that doesn't include extra pointers is hard. I might need to re-implement some useful utilities like unique pointer and share pointers as well. Some fundamental data structure in C++ includes pointers as well, like short string optimization introduce extra pointers. With my poor C++ knowledge, I don't even know what are the other pointers are missing..

dryman · on July 3, 2017

Author here. I did think about creating a programming language specialized for serialization. Fortunately, using just C seems to be sufficient for building a POC. Another advantage for using C is it is easier to embed into other languages. OPIC is more library focused, which would benefit for integrating into other languages. Jai language seems to be a application (gaming) focused language, and the language abstraction makes developer faster to code is more important.

dryman · on May 9, 2017

Thanks for pointing out the thread-unsafe alternative. I'll update the benchmark with it and double I configured it with cityhash. (will update in a day or two).

Speaking of cityhash, have you tried farmhash as well? I'm not sure what I did wrong, but farmhash's performance wasn't good as cityhash in my hash table. Did you experience the same problem?

dgacmu · on May 9, 2017

Thanks! Appreciate the responsiveness. :)

We haven't - I had some other issues with Farm. I tried for a while to see if I could get FarmHash included as the standard hash for TensorFlow instead of the murmur-like homebrewed one it uses, but the cross-platform build issues became too complicated to justify it. We ended up sticking with the murmur-like one for "fast but unsafe", and providing SIP via the highwayhash package as the "strong hash".

dryman · on May 9, 2017

In my experiments using linear probing in robin hood doesn't give good results as well. I hope the author has tried robin hood with quadratic probing. But it's a good reference. I didn't dive deep in the cuckoo route. This paper gives me more confidence to try it out.

dgacmu · on May 9, 2017

If you're curious about blending some of the "don't increase by a power of two" memory efficiency gains, I implemented this in a cuckoo hash table for TensorFlow, also using Lemire's trick: https://github.com/tensorflow/tensorflow/blob/master/tensorf...

I never back-ported it to libcuckoo, because it adds a fair bit of complexity (and requires an extra bit), but if there's demand for non-2^k tables, I'd love to know more, either externally or internally. (dga@, for various values of cs.cmu.edu and google)

dryman · on May 9, 2017

Lemire's fast range works perfectly with cuckoo, as it doesn't require probing (?). The fast mod and scale trick was invented to address this issue.

It's actually pretty funny. I didn't know Lemire's trick until I start to write this article. But I my first try on fixed point arithmetic was exactly the same as his work. Then figured out the lower bits would get omitted with fast range (I named it scaling). Finally I came up with this fast mod and scale idea.

I don't know for other peoples need on non power of 2 tables. My target is to build large scale hash tables and other data structures. The memory and data compactness is quite critical for this goal.

dryman · on May 9, 2017

Thanks for raising this up. I spent way way more time on OPIC (almost a year) than robin hood hash map (2 weeks to complete the POC). Glad to see people noticing this project :)

The other comment pointed out MDBM, which I didn't know about. From their performance number I think this may show that why OPIC robin hood is quite optimal. https://yahooeng.tumblr.com/post/104861108931/mdbm-high-spee...

MDMB gives users raw access to the mmaped data and pointers. And from its benchmarks it results 10x faster than rocksdb and leveldb. The design of OPIC has even less overhead (may not be a good thing) than MDBM, and it also works on a mmaped file (or anonymous swap). There's no lock, transaction, or WAL in OPIC. OPIC just brings you the raw performance a hash table can gives you.

dryman · on May 9, 2017

Thanks for pointing out. I used std::string mainly because I suck at C++ :(. In C I can easily define the key length to be a variable passed to the constructor, but I don't know how to do the same with C++.

dryman · on May 9, 2017

1) I don't see why a linear probing robin hood is a sorted array? Can you explain more?

Robin hood hashing doesn't limit which probing scheme you use. I end up with quadratic probing which gives me both good cache locality and good probe distributions. The probing schemes I tried was omitted in this post because it would bring too much noise. But I can give you some quick summary here:

1. linear probing: probing distribution has high medium and high variance. Performance is not that great either because of the high probing numbers.

2. quadratic probing: probing distribution is not the best, but the medium sicks to 2 probes. Since the first two probes are very close in quadratic probing, its overall performance wins.

3. When probing, hash the key with the probe added to the seed. This gives very good hash distribution, but hash on each probe is very slow. Also you cannot do deletion using this scheme.

4. rotate(key, probe). This is somewhat like rehash, but way faster. The probe distribution is also very good, but items goes too far away so we lost the cache locality.

5. Combination of different schemes listed above. Still, the quadratic probing gives me best performance.

I also tried to use gcc/clang vector extension to speed up probing, but it actually slows down for 30%! I guess I have to hand tune the SSE intrinsics and measure it carefully with IACA to get the optimal performance.

Deletion itself is quite complicated and deserves its own post.

pubby · on May 9, 2017

I was actually working on a blog post about this a few months ago, but never published it because I wasn't happy with it. Here it is, anyway: https://goo.gl/W1KZ2t

What I found in testing was that linear probing beat everything when it had a sufficiently good hash function and a sufficiently low load factor. The load factor was pretty important. Even when factoring in cache misses, page faults, and the cost of allocations, a linearly probed table with a low load factor always beat more space-efficient designs.

>I also tried to use gcc/clang vector extension to speed up probing, but it actually slows down for 30%! I guess I have to hand tune the SSE intrinsics and measure it carefully with IACA to get the optimal performance.

They're only advantageous when the hash is perfect and the data is in the cache. Otherwise the performance should be the same as regular code. To clarify what I meant by SSE with double hashing, I really meant search an entire cacheline before incrementing by the second hash value. SSE can be used for this in some cases but regular code works just as well.

>Deletion itself is quite complicated and deserves its own post.

That's why I prefer the linear variant; deletion is easy!

valarauca1 · on May 9, 2017

It's a sorted array because your sorting by 2 keys (A,B). A is the hash, normally a u32 or u64. B is the probe count, also normally the same sized b/c system ints are easy.

Strictly speaking the Robin Hood "sort" isn't purely lexiconally ordered. A larger A value maybe replaced a smaller A, with a much larger B.

But this relation nonetheless is just a weird solution to build a cmp function one. One that arguably doesn't work in a strict sense. But one that _nearly_ works.

dastbe · on May 9, 2017

So if A is the hashed-to index into your table, then B is a function of A. When you are doing your linear probe and you see a value along with its probe count, it either

1. Has a higher probe count, this means it's hash is further away than yours => less than yours

2. Has a lower probe count than yours. This means its hash is closer than yours => greater than yours

3. Is equal in probe count to yours. That means it's in the same bucket => equal to yours.

dryman · on May 9, 2017

When I first see "sorted array", my first impression is it is sorted by the "original key". If we look for the hashed value in hash table, all hash tables are "sorted array" with this definition.

Also there's a finial mod in linear probing h(k, i) = (k + i) mod N. With this mod you may have different key/probe combination that messed up the order.