Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Temporal is really neat but I think its marketed at too many use cases.

After a year of high-scale Temporal work, I found it was only good for low-scale work.

The onboarding and learning curve were insanely difficult and complex. Ultimately it doesn't scale as well as you think. The temporal team invented their own database to get around this limitation.



Would love to hear more about the scale issues you saw. How many workflows or actions was too many? which components started breaking down, what were their failure modes?


See above. Its not so straightforward. You need enough headroom on each component that a negative feedback loop can start, eat resources, and have enough time and resources to calm itself before hitting some limit or degrading itself further


Can you tell us more about your scaling issues with Temporal?

I haven't yet used it in production, but I would've expected that a system which evolved out of Uber's Cadence [0] (and which I believe is used at Uber extensively) would've scaled very well.

[0]: https://stackoverflow.com/a/61281435/1579058


I'm not sure how Uber does it, but it might be because they're using Cadence instead.

The Temporal team has acknowledged that Cassandra-backed Temporal hits scaling limits pretty fast.

The limitations aren't a clean "X actions/sec", they're sneakier. Because you can run X/sec for days and then the memory on the history service will spike, or any tiny slowdown in the DB will cause looping degradation. There are nasty feedback loops hidden in Temporal that turn small problems into very very large problems.

I think the core problem with Temporal is the way its sharded. This affects history service and its caches. If anything tells them to reload or restart, or that any of the nodes are unreachable, you get a retry storm on the DB.

In addition to these issues, Temporal can create feedback loops within itself. I've seen cases where it would not return to health, even with 0 workers requesting work for 10s of minutes.

We could have kept using and scaling Temporal, but it required 10-30x the resources of building something else. And it was scary to administer. You really need an entire team. You can't have somebody who isn't a dedicated engineer take on-call for it.


What did you move to instead?


Invented their own database? They use Cassandra IIRC


Nope. They hit the scale limit with Cassandra and now have an in-house storage layer




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: