Just to make this discussion less abstract: is a groupby operation monotonic?

bubblyworld · on Sept 4, 2024

I think it's logically monotonic in the sense that as you add input data to your query, the output simply accumulates (you never have to retract any prior conclusions).

Consistency between two accumulators is easy, they just sum their states.

slt2021 · on Sept 5, 2024

you are right. but even in case where you have to retract prior conclusions, it is possible to to fold accumulators to maintain global consistency.

for example as you said: global_sum() = accum1.sum() + accum2.sum()

global_min():

  accum3 = [accum1.min(), accum2.min()]

  global_min = accum3.min()

any non-monotonic computation can be split into monotonic, where they process the bulk of the data, and folding the results to product the final result

bubblyworld · on Sept 5, 2024

Min is also logically monotonic in the same sense though, right? As a min accumulator I can throw away all past information once I've computed my result and never have consistency issues in light of new data.

A better example would be reachability analysis in a DAG - here you really do have monotonicity failure when merging results from multiple workers.

mehulashah · on Sept 4, 2024

Yes. Based on the definition.

mehulashah · on Sept 4, 2024

So I contradicted myself above. Aggregation is probably not monotonic -- the output of the aggregation is not a subset if the input is a subset.

slt2021 · on Sept 5, 2024

it is possible to split aggregation into parts that use commutative operators.

for example global AVG can be expressed as SUM()/COUNT()

although AVG is not commutative, SUM() and COUNT() are commutative and can be run in scatter/gather mode and offloaded to individual nodes