An insider told me that their internal secret store became overloaded, which caused every other service to suddenly fail to have access to any credentials.
Roblox is pretty open about being all in on Hashicorp stuff so that would imply major issues with their Vault implementation. Going to be a very interesting postmortem...
It can sometimes be very hard to cold start in such circumstances, because the demand from all the services trying to access their secrets at once can create a “thundering herd” that knocks the recently restored service over again.
My guess is that they’ve had to shut most services down, restore the secret store, and then bring up internal services in a careful sequence to avoid a thundering herd. At $JOB even understanding the correct sequence to cold start our organization would probably take more than 24hrs.
Back in college (90’s), we had a Solaris shop. One time after a long power outage we discovered that we couldn’t bring up the system because of a circular dependency.
It’s been decades, something along the lines of a NIS server depending on a file server and vice-versa. We managed to manually get the services running and then do a clean reboot after the rest of the systems came up.
Just one more reason I don't take the "engineering" part of "software engineering" seriously. Hardly surprising given the admiration of tinkering and hacking that runs through the industry.
modern software spans the globe, often communicating with 3rd party systems, operates 24/7/365, and is often at the whims of deadlines. its some of the most technically challenging endeavors in existence.
its not really surprising how you'd end up in these situations.
that being said. rate limits and error handling people. jesus. =)
The 19th century is full of “real engineering” that is frankly clown shoes stuff. Stories about how early truss bridges were designed makes modern software engineering seem quite orderly by comparison.
I've always been amused at this comparison. You're not wrong. We wouldn't tolerate or excuse a group of people attempting engineering based on 19th century practices. Why should we excuse software engineers (if, indeed, their goal is to be "engineers" in any meaningful sense) for not building on the last two centuries of engineering knowledge?
Because software engineering is not the same as mechanical engineering. We’ve tried to apply mechanical engineering principles to software projects; it didn’t work.
No idea about their infra, but for complex enough, built in Silicon Valley move fast style (and they are likely in that bucket) restarting after such a failure can be a huge problem. You can have thundering herds, that you may not have levers to control, infrastructure needed to build and deploy fixes maybe down, you can have cyclical dependencies sneaking in the past, that makes it impossible to startup some systems, etc, etc
I really hate overengineered cloud infrastructure. It's the 2020s equivalent of 1990s to early 2000s enterprise Java except it's "use every conceivable cloud service in as baroque a deployment as possible" instead of "use every single design pattern in every part of the program."
I refer to software built this way as "write once run once" as in it's so complex it could barely be redeployed and might not even be possible to restart. "Write once run once" is a play on the old "write once run anywhere" tagline for Java.
But hey it makes Amazon rich. Amazon figured out how to monetize the tendency of sophomore engineers to add as much complexity as possible. I bet the authors of the original 1994 design patterns book are kicking themselves for not finding a way to charge for every implementation of the factory pattern.
What makes you believe Java apps after early 2000 are any better? Spring apps = Factory pattern apps on steroids. But why pick even Java? Any backend landscape of scale looks like that, mostly because kids want to pad their resumes with cool shit, or so they think. Eg. everything has to be fluent/reactive/async for some reason now - not a problem you could blame on "managers".
That reminds me of the first company I worked for. Rumor had it that a handful of their engineers got their hand on the Design Patterns book, read it, and decided on a single design pattern that they'd use for everything.
Totally. You'll have stale caches that need to be regenerated at the same (vs incrementally during steady operation), backed up queues, users themselves doing things like rapid refreshes, etc.
Ironically recovering after downtime is usually harder if you have downtime relatively rarely, because these problems don't get exposed.
> You can have thundering herds, that you may not have levers to control
To me, this seems almost unthinkable. You mean to tell me that you can't just coordinate setting "replicas: 0" for all of your services, or the equivalent and to restart them one at a time? If that's really the case, then perhaps it's time to re-evaluate how that situation ever came to be - where's the technical leadership who kept systems scalable, yet manageable? Was that even a concern at any point in time?
Sadly, my personal experience indicates that many enterprises don't give it any thought and just expect things to work, oftentimes not even knowing about all of their services in enough detail to be able to answer questions about their architecture, nor having a clear way of managing their service instances.
> You can have cyclical dependencies sneaking in the past, that makes it impossible to startup some systems
This does seem like an application problem. If an application cannot reach external services, it should either:
- return an error for all services calls that require this external integration
- queue up these requests internally to be retried later, if there are the necessary resources for this (RabbitMQ, or just a lot of RAM)
Having an application fail fast when one of its dependencies is down makes sense, of course, but it's also a really dangerous approach, since just one cyclic dependency sneaking past could break everything. Perhaps it's better to avoid such situations entirely and not make your app crash if it cannot reach an external service, merely try again later?
Roblox has been around for 16 years. I would be shocked to learn that 16 years ago they planned the architecture of their system so meticulously as to account for dependencies and problems you describe for the long term future. Odds are some novice programmer or a few of them hacked something together and as the platform grew they just kept adding to the poor hacking.
It might take a huge investment to correct it now. But perhaps an outage like this will prove to them that it's necessary.
(N.B.: I did not vote up or down on your comment.)
> I would be shocked to learn that 16 years ago they planned the architecture of their system so meticulously as to account for dependencies and problems you describe for the long term future.
If their architecture hasn't changed much in that long of a time, then that is impressive, both in a positive and negative sense.
In regards to the positive aspects, that indeed makes you consider the thought that must have been put in the system design and how it must have been good enough to survive this long.
However, there are also the negative aspects - if there are single points of failure, or even scalability issues which may lead to a lot of downtime, then it's almost certain that rewrites in some capacity will need to be carried out. A system design that worked 16 years ago will probably run into some roadblocks today, much like C10k was a problem back in the day but has largely been solved in many situations.
Either way, long term downtime doesn't reflect positively on the current state of the overall system.
(No worries about the upvotes/downvotes, i merely mentioned that fact in case it was my tone that was inadequate, or perhaps there were technical inaccuracies or just inactionable advice given on my part - being told exactly why i'm wrong is helpful not just to me but to others as well.)
Clarification for the downvote(s) since the person didn't provide any: in my current workplace, i actually brought this very issue to the attention of all the stakeholders and over the past few months have been working on improving the applications that we develop: updating the frameworks and libraries to recent, maintained and more secure versions, using containers and Ansible to improve configuration management and change management, as well as to get rid of years of technical debt.
A part of this was needing to understand how they truly process their configuration, when they choose to fail to start and how they attempt to address external systems - in my mind, it's imperative for the long term success of any project to have a clear understanding of all of this, instead of solving it on an "ad hoc" basis for each separate feature change request.
To give you an example:
- now almost everything runs in containers with Docker, Docker Compose or Docker Compose (depending on the use case), with extremely clear information about what belongs in which environment and how many replicas exist for each service. A part of testing all of this was actually taking down certain parts of the system to identify any and all dependencies that exist and creating action plans for them.
- for example, if i want to build a new application version i won't be able to do that if the GitLab CI is down, or if the GitLab CI server cannot reach the Nexus registry which runs within containers, so if the server that it's on goes down, we'll need a fallback - in this case, temporarily using the public Docker image from Docker Hub, before restoring the functionality of Nexus through either Ansible or manually, and then proceeding with the rest of the servers.
- it's gotten to the point where we could wipe any or all of the servers within that project's infrastructure and, as long as we have the sources for the projects within GitLab or within a backed up copy of it, we could restore everything to a working state with a few blank VMs and either GitLab CI or by launching local containers with the very same commands that the CI server executes (given the proper permissions).
Of course, make no mistake - it's been an uphill battle every step of the way and frankly i've delivered way too many versions of software and configuration changes late into the evening, because for some reason we don't have an entire team of DevOps specialists for this - just me and other developers to onboard. Getting the applications not to fail-fast whenever external services are unavailable has also been a battle, especially given the old framework versions and inconsistent, CV driven development over the years, yet in my eyes it's a battle worth fighting.
Not only that, but the results of this are extremely clear - no more depressive thoughts after connecting to the server for some environment through SSH and having to wonder whether it uses sysvinit, systemd, random scripts or something else for environment management. No more wondering about which ports are configured and how the httpd/Nginx instances map to Tomcat ports, since all of that is now formalized in a Compose stack. No more worrying about resource limits and old badly written services with rogue GC eating up all of the resources and making the entire server grind to a halt. No more configuration and data that's strewn across the POSIX file system, everything's now under /app for each stack. There hasn't been an outage for these services in two months, their resource usage is consistent, they have health checks and automatic restarts (if ever needed). Furthermore, suddenly it becomes extremely easy to add something like Nexus, or Zabbix, or Matomo or Skywalking to the infrastructure because they're just containers.
Therefore, i'd posit that the things i mentioned in the original post are not only feasible, but also necessary - to both help in recovering from failures, but also to make the way everything runs more clear and intuitive for people. And it pains me every time when i join a new project with the goal of consulting some other enterprise and generating value with some software, but instead see something like a neglected codebase with not even a README and the expectation for the developers to pull out ideas on how to run local environments out of thin air.
If you see environments like that, you're not set up for success, but rather failure. If you see environments like that, consider either fixing them or leaving.
Why do you doubt they were built in Silicon Valley move fast style? Because it's old? It was released in 2004, the same year as Facebook (both of them had different names originally)
I don’t know much about Roblox other than it’s some multiplayer game platform.
Is it down totally with nothing working, or just extremely overloaded but with some requests still getting through?
If it’s totally down still after 24hr it must be pretty serious. I’ve seen my fair share of production fires but usually you can bring some of the service back up quickly and then fight with the thundering herds and load and eventually stabilise the service.
I hope they put out a public post-mortem afterwards.
Yesterday, about an hour or two after the incident began, my kids and I were able to get into a Roblox instance with some unusual processes. (Logging into an account wouldn't work and some other website functionality was screwy, but if you were still logged in, you could do some things to get into some Roblox instances.)
A couple hours later, when trying to play Roblox again, our tricks no longer worked and we eventually gave up. But at least at the start, some could still play.
It points to it being a serious infrastructure issue really. You don't have a hugely popular platform be dead for two days unless there is some fundamental issue with the infrastructure.
Likely some stupid misconfiguration in some overengineered "cloud" architecture is the root cause, and they're mired in trying to debug and resolve the problem, fighting the complexity of the system the whole way.
We had a situation like that once because of circular dependencies; services that needed another service to start. “How is that possible?” you ask. Code reviews; “take that code out and get the data from an existing service.”
We hard-coded data into a base service, pushed it to production, restarted services that depended on it, then services that depended on those services, etc.
I interviewed with their ops team there and turned it down. Just had a bad smell. I think they hire 'SRE' that can write good leet code. The problem with those folks for ops is that they over engineer complex solutions, and then completely forget the simple but very important stuff. I wish them luck in getting it back together.
Completely unfounded speculation—what if they were hit by a ransomware attack?
I wonder only because, well, after a full 24 hours... a purposefully malicious attack starts to seem more plausible to me than a simple freak accident.
roblox powers much of the internet? I find that unlikely. If you don't have kids, this is probably the first you heard of it. I guess there's the "my kid is complaining to me that it's down and now I know about it" angle, but it's hard to draw any conclusions from your kid's purchasing habits alone.
Right, but do you really need an outage for everyone to find that out? It's hard to notice that fastly powers a given site, but it's trivial to notice the game your kid's playing.
There's probably a lesson in here about managing incidents, very few updates or acknowledgments there's even a major problem.
Will be interesting to find out what caused an incident big enough to kill their entire platform for so long (if they release details or they come out through the grapevine)
The status page acknowledges there is an incident, that it's pretty bad — everything is marked as down, and it's in red! What more could you want? (/s)
Compare to Azure: most incidents never see the status page. Hell, getting support to acknowledge that an incident exists even once your internal investigation has reached certainty on "oh, yeah, it was them" is hard. There was an AAD outage earlier this year (?; IDK — I've lost track of the passage of time in the pandemic…) and the status page was down, and even once you managed to get the status page (IIRC you could hit the IP if you knew it magically which the Twitterverse did) … most services were still green, even if completely offline as far as one could tell by issuing queries to the service…
And I'm comparing a kid's game with a "major" cloud PaaS…
> Compare to Azure: most incidents never see the status page.
That sounds like a conscious decision on their part. Everyone always talks about disclosure being the best policy, but at the same time there's plenty who believe that not informing anyone about an outage or even a breach is the correct thing to do, since then they'll probably get into less trouble themselves, or at least will create the illusion of not having as many outages as "those other guys".
After all, informing everyone about an outage that will noticeably affect only some probably has a larger impact on the company's reputation than having it be dragged through mud in smaller communities for its dishonesty. Then again, with many of the larger services, it's not like you have much of a choice of using or not using it - you just get a corporate policy passed down upon you and that's that.
Thus, sweeping problems under the rug and pretending that they don't exist is a dishonest, yet valid way of handling outages, breaches and so on. Personally, i'd avoid any service that does that, though it's not like that's being done just because of incompetence.
> That sounds like a conscious decision on their part.
It is; I've been told by their support that they don't want to cause alarm.
I think it's a bad way to run a PaaS, though. If I'm looking at your status page, it is because I suspect an outage and am trying to confirm it. Very willing to give some leeway to fix problems (an SLA — and Azure could do better here too — exists to establish what that allowable leeway is) I just need to know "is it me, or not?" and it's nicer to just get the answer when it's not me. As it is, I have to jump through a support hoop to get at "I think you are having an outage" and even then, it's typically multiple cycles before support seems to query engineering (and — that's another problem: support doesn't just know that there is an outstanding issue…) and gets to the bottom of it.
It needs to be easy for a customer, experiencing an issue with a service, to drive resolution of that problem. I can forgive small service outages, but it's this lack of any ability to get resolution or closure or some "yeah, we had a failure, here's what we're doing to prevent it going forward" that is the real problem.
Sadly, there is only so much choice I have in the matter of which cloud provider we're using…
I'll honestly change that to all major cloud providers - AWS has some small hiccups and then that were never recorded, and Google Workspace describes "a problem to a subset of users" when it clearly worldwide.
Seems to me that a routing / networking issue would have been resolved by now, and application bug would have been rolled back.
If I were to speculate, I would say that it must have something to do with databases / storage. Something must have gone wrong and broke some database, and it’s difficult to restore.
I agree that DB is the prime suspect here. They said the problem was identified over 24h ago, cant imagine it being something other than a data issue that takes that long to resolve.
How is it an “unfair” advantage? It’s just an advantage. It’s not like getting a good nights sleep is something that requires an exorbitant amount of money.
Kind of tongue in cheek; parents know the struggle of getting kids to sleep, it takes dedication and effort to not just wait until they decide themselves to go to sleep.
This dedication and effort requires time and patience, and with more kids raised without multiple adults (time is zero sum), that time is missing, and those kids ‘naturally’ decide on much less sleep than they need (because as we all know, the distractions available are amazingly fun).
Probably. Roblox's original value proposition of 'kids making games for kids' has devolved into 'adults making dopamine slot-machines for kids' which is pretty sad :(
The various Open Simulator grids (known collectively as the hypergrid) are interoperable like that. As long as the one you created your avatar on (which can easily be a self-hosted one) is up, there are many worlds to teleport to.
Wow... seems like a good time to sell Roblox stocks anyways... because most of Roblox' popular games appear to haven been cloned into Minecraft... earnings are coming up real soon though... I guess we'll find out in less then a week.
edit: I edited this comment a bit to make it clearer because HN would not let me reply to the child comment.
I would think the opposite. This is a short term thing that presumably won't affect the business in any meaningful way. They will (probably) recover and be fine and nobody will be too worried that a kid's game has some, limited, downtime from time to time. If this outage causes a drop in share price, it's probably a good time to pick up a bit at a discount.
I'm slightly embarrassed I didn't think of that, since I'm Catholic and have given up reddit for Lent in the past...
I guess I misinterpreted your previous comment as an entire community giving up the same thing together, rather than each individual choosing a specific thing.