Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's somewhat similar in that BigTable (and Cassandra) perform writes as sequential I/O by writing to arrays (for stratified b-tree) or to SSTables (for Cassandra) and then merge them together using more sequential I/O. This is the not-so-secret-sauce that makes inserts stay fast compared to vanilla b-trees.

However stratified b-trees bring some algorithmic rigor to the merging process of the arrays that provides guarantees on the search time.

The authors do indeed provide some comparison with Cassandra which should answer your question - http://www.acunu.com/2011/03/cassandra-under-heavy-write-loa... http://www.acunu.com/2011/03/cassandra-under-heavy-write-loa...

Summary: 1) Single key read performance for Cassandra depends somewhat on when the last compaction was done (Bloom filters help here but even just a couple of false positives per query eats random I/O. The less SSTables there are the less random I/O) so write spikes create poor read performance and 2) Bloom filters don't work for range queries so they tank in Cassandra in general. Stratified b-trees don't need bloom filters and performs better and with more consistency in both cases.



Thanks for the links. The statistics on latency distribution are quite impressive, to say the least.

Why do you say that stratified b-trees don't need Bloom filters? Yes, the improved merging discipline reduces the number of arrays to read, but presumably there is often >1 array, which is sufficient to make Bloom filters desirable. Even if you only have two arrays, doubling the number of random I/Os per key lookup is easily enough of a penalty to make Bloom filters worthwhile. The paper itself seems to indicate that Bloom filters are used:

"In the simplest construction, we embed into each array a B-tree and maintain a Bloom filter on the set of keys in each array. A lookup for version v involves querying the Bloom filter for each array tagged with version v, then performing a B-tree walk on those arrays matching the filter."


The paper doesn't say so I am making some assumptions here but if you had a bloom filter per array then as these doubling arrays get really big all the bloom filter would tell you is that the target entry is probably contained in this giant array. A false positive in the bloom filter would cause a pretty significant amount of work.

The stratified b-trees use forward pointers to help target the search in the next array down the tree. Like regular b-trees the smaller root arrays will likely be cached in memory so the the number of random I/O's will be small.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: