Define "same benefits". Bloom filters allow you to prune the number of files you...

darkflame91 · on May 29, 2024

Partitioning also prunes the number of files to be looked at. Only the directory structure of the files (or prefixes of the objects in S3) need to be checked.

ithkuil · on May 29, 2024

partitioning only prunes the files you need to be looked at *if* the predicate includes the column you're partitioning on.

For example, let's imagine you're partitioning on "time" and "region" and the high cardinality column is "container_id". Now imagine you want to query that filters on a particular container_id but is run across all time and all regions. You'd have to scan through the "container_id" chunks of all your parquet files. Indices on your high-cardinality data allows to know which column chunks have data that matches your predicate (and bloom filters will tell you that probabilistically). In the example, without such indices you'd have to scan through all data unless you also have predicates on "time" and "region".