Can you tell us more about your scaling issues with Temporal? I haven't yet used...

ub-volta-toss · on May 16, 2024

I'm not sure how Uber does it, but it might be because they're using Cadence instead.

The Temporal team has acknowledged that Cassandra-backed Temporal hits scaling limits pretty fast.

The limitations aren't a clean "X actions/sec", they're sneakier. Because you can run X/sec for days and then the memory on the history service will spike, or any tiny slowdown in the DB will cause looping degradation. There are nasty feedback loops hidden in Temporal that turn small problems into very very large problems.

I think the core problem with Temporal is the way its sharded. This affects history service and its caches. If anything tells them to reload or restart, or that any of the nodes are unreachable, you get a retry storm on the DB.

In addition to these issues, Temporal can create feedback loops within itself. I've seen cases where it would not return to health, even with 0 workers requesting work for 10s of minutes.

We could have kept using and scaling Temporal, but it required 10-30x the resources of building something else. And it was scary to administer. You really need an entire team. You can't have somebody who isn't a dedicated engineer take on-call for it.

claytonjy · on May 17, 2024

What did you move to instead?