What’s the tldr on why it is fast?

polskibus · on Jan 6, 2020

Heavily optimized column store, incl use of SSE instructions, etc. Moreover, some architectural tradeoffs mentioned at https://clickhouse.yandex/docs/en/introduction/features_cons...

SloopJon · on Jan 6, 2020

I'd say the materialized view is the main thing:

> Thus, we add the following materialized view ... At the end we should have 1440 times fewer rows in the aggregate than the source table.

The cost of populating that view is amortized over the 17.5 hours it took to load the data.

hodgesrm · on Jan 6, 2020

Mat views are great as the article showed. I use them to get query response down to milliseconds, as they vastly reduce the amount of data ClickHouse must scan.

That said, there are a lot of other tools: column storage, vectorwise query, efficient compression including column codecs, and skip indexes to name a few. If you only have a few billion rows it's still possible to get sub-second query results using brute force scans.

Disclaimer: I work for Altinity, who wrote this article.

hodgesrm · on Jan 6, 2020

p.s. Loading the view is low-cost compared to loading the source data. On the NUC it's 40-60 minutes, so worst case it's something like 1h / 17.5h = 5.71%. Also, you can still query the source data. That is fast for individual sensors as the examples showed.

lmeyerov · on Jan 7, 2020

Yeah I was confused, where I couldn't tell what was precomputed stats (col min/max/count), view calcs, and what's actual perf -- even legacy SQL vendors do all those. That's apples/oranges, more of a statement against the other db vs for clickhouse. Likewise, the db comparison I'd like to see if _other_columnar_stores_.

I know some folks running one of the larger clickhouse instances out there... but this article made me trust the community less, not more.

imtringued · on Jan 6, 2020

Because it front loaded all operations so that they happen outside of the benchmark. Depending on what you want to do it makes sense but the original intention of the benchmark was a brute force query benchmark.

Have you seen the sticker on the NUC? 116 billion rows per second at 233.61GB/s. If you spend even a single second thinking about how absurd that number is you would start to see that the two benchmarks measure completely different things. Even with a quad channel Xeon CPU you won't see significantly more than 100GB/s memory bandwidth. Those 116 billion queries didn't actually happen. It's just a synthetic number. The result of the query was calculated during insertion of the temperature record before the benchmark has even started and then they just calculated the theoretical number of queries you would have to do for an equivalent result and slapped that fictional number on their NUC.

hodgesrm · on Jan 7, 2020

That's a sticker from a ClickHouse community event, not related to the benchmark. We tend to stick them on anything flat. My ancient Dell XPS-13 has one. It's definitely not that fast.

That said, the sticker is from a real performance test. I assume it was a cluster but don't have details. ClickHouse query performance is outstanding--it's not hard to scan billions of rows per second on relatively modest hosts. These are brute force queries on source data, no optimization using materialized views or indexes.

For instance, I have an Amazon md5.2xlarge with 8 vcpus, 32 GB of RAM, and EBS GP2 storage rated at 100 iops. I can compute average passengers on the benchmark NYC taxi cab dataset [1] in .551 seconds using direct I/O. The throughput is 2.37B rows/sec.

ClickHouse is so fast on raw scans that many production users don't even use materialized views. I mostly use them to get responses down to small numbers of milliseconds for demos.

[1] https://tech.marksblogg.com/benchmarks.html

zX41ZdbW · on Jan 7, 2020

> 116 billion rows per second at 233.61GB/s

The numbers on sticker are from a cluster of 400 servers and 200 servers were participated in query execution.

endymi0n · on Jan 6, 2020

I'd argue on ClickHouse not even being that fast (compared to comparable technology like Snowflake, Redshift or BigQuery) but actually the ScyllaDB example being completely misleading. Scylla is probably one of the fastest OLTP datastores, yet they're benchmarking an analytics query — which is pretty easy to crack by any columnar datastore.

The actual point here is that you can execute millions of (different!) individual queries per second on ScyllaDB, which beats any columnar datastore hands down. ClickHouse "cheated" here by translating the (unfortunate) benchmark setup into a single query that's extremely heavily optimized under the hood.

PeterZaitsev · on Jan 6, 2020

Actually while ClickHouse does not have all features of RedShift, BigQuery etc it usually is much faster than them. It can be slower on some workloads on GPU powered systems, when all data fits in GPU memory but it is not the use case it targets.

ScyllaDB is amazing when it comes to OLTP performance but not in the Analytical ones.

I think they took pretty mediocre Analytical Workload results and shared them as something outstanding.

lmeyerov · on Jan 7, 2020

The restriction to a tiny GPU workload is increasingly wrong for assessments.

GPU compute stacks are increasingly geared towards multi-gpu/multi-node & streaming, esp. given the crazy bandwidth they're now built for (2TB/s for a dgx2 node?). Likewise, per-GPU memory and per-GPU-node memory is going up nicely each year (16-24GB/GPU, and 100GB-512GB/node with TBs connected same-node). Network is more likely to become the bottleneck if you saturate that, not your DB :)

Though I like to do mostly single gpu streaming in practice b/c I like not having to think about multinode and they're pretty cheap now :)

FridgeSeal · on Jan 6, 2020

Why do you say CH isn’t as fast as Snowflake? Because my experience is the opposite.

Also, Snowflake is so painful to use and has a bunch of weird edge cases. It’s also so expensive.

zX41ZdbW · on Jan 7, 2020

There is a video named "The Secrets of ClickHouse Performance Optimizations":

https://youtu.be/ZOZQCQEtrz8

eecc · on Jan 6, 2020

Precomputed materialized views

brobinson · on Jan 6, 2020

Yeah, this post could have been titled "Why Materialized Views are Awesome"

bdcravens · on Jan 6, 2020

Clickhouse is also crazy fast without materialized views - I've only done some PoC's against it, but in loading a largish data set of raw invoice CSVs, I was very impressed with the performance compared to our standard RDBMS.

tumanian · on Jan 7, 2020

that sounds like a non-canonical use of clickhouse. Wouldnt a good RDBMS be a better fit for invoice data? This is on the surface, of course, really interested in what is this invoice data like, and what queries are you trying to run on them.

caust1c · on Jan 6, 2020

Tightly coupled storage and execution engines, along with an opinionated view on the actual storage format.