Mat views are great as the article showed. I use them to get query response down to milliseconds, as they vastly reduce the amount of data ClickHouse must scan.
That said, there are a lot of other tools: column storage, vectorwise query, efficient compression including column codecs, and skip indexes to name a few. If you only have a few billion rows it's still possible to get sub-second query results using brute force scans.
Disclaimer: I work for Altinity, who wrote this article.
p.s. Loading the view is low-cost compared to loading the source data. On the NUC it's 40-60 minutes, so worst case it's something like 1h / 17.5h = 5.71%. Also, you can still query the source data. That is fast for individual sensors as the examples showed.
Yeah I was confused, where I couldn't tell what was precomputed stats (col min/max/count), view calcs, and what's actual perf -- even legacy SQL vendors do all those. That's apples/oranges, more of a statement against the other db vs for clickhouse. Likewise, the db comparison I'd like to see if _other_columnar_stores_.
I know some folks running one of the larger clickhouse instances out there... but this article made me trust the community less, not more.
Because it front loaded all operations so that they happen outside of the benchmark. Depending on what you want to do it makes sense but the original intention of the benchmark was a brute force query benchmark.
Have you seen the sticker on the NUC? 116 billion rows per second at 233.61GB/s. If you spend even a single second thinking about how absurd that number is you would start to see that the two benchmarks measure completely different things. Even with a quad channel Xeon CPU you won't see significantly more than 100GB/s memory bandwidth. Those 116 billion queries didn't actually happen. It's just a synthetic number. The result of the query was calculated during insertion of the temperature record before the benchmark has even started and then they just calculated the theoretical number of queries you would have to do for an equivalent result and slapped that fictional number on their NUC.
That's a sticker from a ClickHouse community event, not related to the benchmark. We tend to stick them on anything flat. My ancient Dell XPS-13 has one. It's definitely not that fast.
That said, the sticker is from a real performance test. I assume it was a cluster but don't have details. ClickHouse query performance is outstanding--it's not hard to scan billions of rows per second on relatively modest hosts. These are brute force queries on source data, no optimization using materialized views or indexes.
For instance, I have an Amazon md5.2xlarge with 8 vcpus, 32 GB of RAM, and EBS GP2 storage rated at 100 iops. I can compute average passengers on the benchmark NYC taxi cab dataset [1] in .551 seconds using direct I/O. The throughput is 2.37B rows/sec.
ClickHouse is so fast on raw scans that many production users don't even use materialized views. I mostly use them to get responses down to small numbers of milliseconds for demos.
I'd argue on ClickHouse not even being that fast (compared to comparable technology like Snowflake, Redshift or BigQuery) but actually the ScyllaDB example being completely misleading. Scylla is probably one of the fastest OLTP datastores, yet they're benchmarking an analytics query — which is pretty easy to crack by any columnar datastore.
The actual point here is that you can execute millions of (different!) individual queries per second on ScyllaDB, which beats any columnar datastore hands down. ClickHouse "cheated" here by translating the (unfortunate) benchmark setup into a single query that's extremely heavily optimized under the hood.
Actually while ClickHouse does not have all features of RedShift, BigQuery etc it usually is much faster than them. It can be slower on some workloads on GPU powered systems, when all data fits in GPU memory but it is not the use case it targets.
ScyllaDB is amazing when it comes to OLTP performance but not in the Analytical ones.
I think they took pretty mediocre Analytical Workload results and shared them as something outstanding.
The restriction to a tiny GPU workload is increasingly wrong for assessments.
GPU compute stacks are increasingly geared towards multi-gpu/multi-node & streaming, esp. given the crazy bandwidth they're now built for (2TB/s for a dgx2 node?). Likewise, per-GPU memory and per-GPU-node memory is going up nicely each year (16-24GB/GPU, and 100GB-512GB/node with TBs connected same-node). Network is more likely to become the bottleneck if you saturate that, not your DB :)
Though I like to do mostly single gpu streaming in practice b/c I like not having to think about multinode and they're pretty cheap now :)
Clickhouse is also crazy fast without materialized views - I've only done some PoC's against it, but in loading a largish data set of raw invoice CSVs, I was very impressed with the performance compared to our standard RDBMS.
that sounds like a non-canonical use of clickhouse. Wouldnt a good RDBMS be a better fit for invoice data? This is on the surface, of course, really interested in what is this invoice data like, and what queries are you trying to run on them.