It can sometimes be very hard to cold start in such circumstances, because the demand from all the services trying to access their secrets at once can create a “thundering herd” that knocks the recently restored service over again.
My guess is that they’ve had to shut most services down, restore the secret store, and then bring up internal services in a careful sequence to avoid a thundering herd. At $JOB even understanding the correct sequence to cold start our organization would probably take more than 24hrs.
Back in college (90’s), we had a Solaris shop. One time after a long power outage we discovered that we couldn’t bring up the system because of a circular dependency.
It’s been decades, something along the lines of a NIS server depending on a file server and vice-versa. We managed to manually get the services running and then do a clean reboot after the rest of the systems came up.
Just one more reason I don't take the "engineering" part of "software engineering" seriously. Hardly surprising given the admiration of tinkering and hacking that runs through the industry.
modern software spans the globe, often communicating with 3rd party systems, operates 24/7/365, and is often at the whims of deadlines. its some of the most technically challenging endeavors in existence.
its not really surprising how you'd end up in these situations.
that being said. rate limits and error handling people. jesus. =)
The 19th century is full of “real engineering” that is frankly clown shoes stuff. Stories about how early truss bridges were designed makes modern software engineering seem quite orderly by comparison.
I've always been amused at this comparison. You're not wrong. We wouldn't tolerate or excuse a group of people attempting engineering based on 19th century practices. Why should we excuse software engineers (if, indeed, their goal is to be "engineers" in any meaningful sense) for not building on the last two centuries of engineering knowledge?
Because software engineering is not the same as mechanical engineering. We’ve tried to apply mechanical engineering principles to software projects; it didn’t work.
No idea about their infra, but for complex enough, built in Silicon Valley move fast style (and they are likely in that bucket) restarting after such a failure can be a huge problem. You can have thundering herds, that you may not have levers to control, infrastructure needed to build and deploy fixes maybe down, you can have cyclical dependencies sneaking in the past, that makes it impossible to startup some systems, etc, etc
I really hate overengineered cloud infrastructure. It's the 2020s equivalent of 1990s to early 2000s enterprise Java except it's "use every conceivable cloud service in as baroque a deployment as possible" instead of "use every single design pattern in every part of the program."
I refer to software built this way as "write once run once" as in it's so complex it could barely be redeployed and might not even be possible to restart. "Write once run once" is a play on the old "write once run anywhere" tagline for Java.
But hey it makes Amazon rich. Amazon figured out how to monetize the tendency of sophomore engineers to add as much complexity as possible. I bet the authors of the original 1994 design patterns book are kicking themselves for not finding a way to charge for every implementation of the factory pattern.
What makes you believe Java apps after early 2000 are any better? Spring apps = Factory pattern apps on steroids. But why pick even Java? Any backend landscape of scale looks like that, mostly because kids want to pad their resumes with cool shit, or so they think. Eg. everything has to be fluent/reactive/async for some reason now - not a problem you could blame on "managers".
That reminds me of the first company I worked for. Rumor had it that a handful of their engineers got their hand on the Design Patterns book, read it, and decided on a single design pattern that they'd use for everything.
Totally. You'll have stale caches that need to be regenerated at the same (vs incrementally during steady operation), backed up queues, users themselves doing things like rapid refreshes, etc.
Ironically recovering after downtime is usually harder if you have downtime relatively rarely, because these problems don't get exposed.
> You can have thundering herds, that you may not have levers to control
To me, this seems almost unthinkable. You mean to tell me that you can't just coordinate setting "replicas: 0" for all of your services, or the equivalent and to restart them one at a time? If that's really the case, then perhaps it's time to re-evaluate how that situation ever came to be - where's the technical leadership who kept systems scalable, yet manageable? Was that even a concern at any point in time?
Sadly, my personal experience indicates that many enterprises don't give it any thought and just expect things to work, oftentimes not even knowing about all of their services in enough detail to be able to answer questions about their architecture, nor having a clear way of managing their service instances.
> You can have cyclical dependencies sneaking in the past, that makes it impossible to startup some systems
This does seem like an application problem. If an application cannot reach external services, it should either:
- return an error for all services calls that require this external integration
- queue up these requests internally to be retried later, if there are the necessary resources for this (RabbitMQ, or just a lot of RAM)
Having an application fail fast when one of its dependencies is down makes sense, of course, but it's also a really dangerous approach, since just one cyclic dependency sneaking past could break everything. Perhaps it's better to avoid such situations entirely and not make your app crash if it cannot reach an external service, merely try again later?
Roblox has been around for 16 years. I would be shocked to learn that 16 years ago they planned the architecture of their system so meticulously as to account for dependencies and problems you describe for the long term future. Odds are some novice programmer or a few of them hacked something together and as the platform grew they just kept adding to the poor hacking.
It might take a huge investment to correct it now. But perhaps an outage like this will prove to them that it's necessary.
(N.B.: I did not vote up or down on your comment.)
> I would be shocked to learn that 16 years ago they planned the architecture of their system so meticulously as to account for dependencies and problems you describe for the long term future.
If their architecture hasn't changed much in that long of a time, then that is impressive, both in a positive and negative sense.
In regards to the positive aspects, that indeed makes you consider the thought that must have been put in the system design and how it must have been good enough to survive this long.
However, there are also the negative aspects - if there are single points of failure, or even scalability issues which may lead to a lot of downtime, then it's almost certain that rewrites in some capacity will need to be carried out. A system design that worked 16 years ago will probably run into some roadblocks today, much like C10k was a problem back in the day but has largely been solved in many situations.
Either way, long term downtime doesn't reflect positively on the current state of the overall system.
(No worries about the upvotes/downvotes, i merely mentioned that fact in case it was my tone that was inadequate, or perhaps there were technical inaccuracies or just inactionable advice given on my part - being told exactly why i'm wrong is helpful not just to me but to others as well.)
Clarification for the downvote(s) since the person didn't provide any: in my current workplace, i actually brought this very issue to the attention of all the stakeholders and over the past few months have been working on improving the applications that we develop: updating the frameworks and libraries to recent, maintained and more secure versions, using containers and Ansible to improve configuration management and change management, as well as to get rid of years of technical debt.
A part of this was needing to understand how they truly process their configuration, when they choose to fail to start and how they attempt to address external systems - in my mind, it's imperative for the long term success of any project to have a clear understanding of all of this, instead of solving it on an "ad hoc" basis for each separate feature change request.
To give you an example:
- now almost everything runs in containers with Docker, Docker Compose or Docker Compose (depending on the use case), with extremely clear information about what belongs in which environment and how many replicas exist for each service. A part of testing all of this was actually taking down certain parts of the system to identify any and all dependencies that exist and creating action plans for them.
- for example, if i want to build a new application version i won't be able to do that if the GitLab CI is down, or if the GitLab CI server cannot reach the Nexus registry which runs within containers, so if the server that it's on goes down, we'll need a fallback - in this case, temporarily using the public Docker image from Docker Hub, before restoring the functionality of Nexus through either Ansible or manually, and then proceeding with the rest of the servers.
- it's gotten to the point where we could wipe any or all of the servers within that project's infrastructure and, as long as we have the sources for the projects within GitLab or within a backed up copy of it, we could restore everything to a working state with a few blank VMs and either GitLab CI or by launching local containers with the very same commands that the CI server executes (given the proper permissions).
Of course, make no mistake - it's been an uphill battle every step of the way and frankly i've delivered way too many versions of software and configuration changes late into the evening, because for some reason we don't have an entire team of DevOps specialists for this - just me and other developers to onboard. Getting the applications not to fail-fast whenever external services are unavailable has also been a battle, especially given the old framework versions and inconsistent, CV driven development over the years, yet in my eyes it's a battle worth fighting.
Not only that, but the results of this are extremely clear - no more depressive thoughts after connecting to the server for some environment through SSH and having to wonder whether it uses sysvinit, systemd, random scripts or something else for environment management. No more wondering about which ports are configured and how the httpd/Nginx instances map to Tomcat ports, since all of that is now formalized in a Compose stack. No more worrying about resource limits and old badly written services with rogue GC eating up all of the resources and making the entire server grind to a halt. No more configuration and data that's strewn across the POSIX file system, everything's now under /app for each stack. There hasn't been an outage for these services in two months, their resource usage is consistent, they have health checks and automatic restarts (if ever needed). Furthermore, suddenly it becomes extremely easy to add something like Nexus, or Zabbix, or Matomo or Skywalking to the infrastructure because they're just containers.
Therefore, i'd posit that the things i mentioned in the original post are not only feasible, but also necessary - to both help in recovering from failures, but also to make the way everything runs more clear and intuitive for people. And it pains me every time when i join a new project with the goal of consulting some other enterprise and generating value with some software, but instead see something like a neglected codebase with not even a README and the expectation for the developers to pull out ideas on how to run local environments out of thin air.
If you see environments like that, you're not set up for success, but rather failure. If you see environments like that, consider either fixing them or leaving.
Why do you doubt they were built in Silicon Valley move fast style? Because it's old? It was released in 2004, the same year as Facebook (both of them had different names originally)
I don’t know much about Roblox other than it’s some multiplayer game platform.
Is it down totally with nothing working, or just extremely overloaded but with some requests still getting through?
If it’s totally down still after 24hr it must be pretty serious. I’ve seen my fair share of production fires but usually you can bring some of the service back up quickly and then fight with the thundering herds and load and eventually stabilise the service.
I hope they put out a public post-mortem afterwards.
Yesterday, about an hour or two after the incident began, my kids and I were able to get into a Roblox instance with some unusual processes. (Logging into an account wouldn't work and some other website functionality was screwy, but if you were still logged in, you could do some things to get into some Roblox instances.)
A couple hours later, when trying to play Roblox again, our tricks no longer worked and we eventually gave up. But at least at the start, some could still play.
It points to it being a serious infrastructure issue really. You don't have a hugely popular platform be dead for two days unless there is some fundamental issue with the infrastructure.
Likely some stupid misconfiguration in some overengineered "cloud" architecture is the root cause, and they're mired in trying to debug and resolve the problem, fighting the complexity of the system the whole way.
We had a situation like that once because of circular dependencies; services that needed another service to start. “How is that possible?” you ask. Code reviews; “take that code out and get the data from an existing service.”
We hard-coded data into a base service, pushed it to production, restarted services that depended on it, then services that depended on those services, etc.