Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Roblox has been down for over 24 hours (roblox.com)
238 points by intunderflow on Oct 29, 2021 | hide | past | favorite | 117 comments


An insider told me that their internal secret store became overloaded, which caused every other service to suddenly fail to have access to any credentials.


Roblox is pretty open about being all in on Hashicorp stuff so that would imply major issues with their Vault implementation. Going to be a very interesting postmortem...


Some interesting info on their infrastructure here

https://www.hashicorp.com/case-studies/roblox


“4 SREs…” my heart goes out to each of those 4. This must have been the worst 48 hours of their career.


I guess this explains why their recruiters are looking for SREs lol


It's been over 72 hours now.


It's still going now. This is especially bad for Roblox due to all of the Halloween events going on across the platform. Yikes!


imho, not Vault as such, but Consul which delivers too many unexpected troubles under slight pressure


Nobody’s forcing you to use consul as storage though - works pretty well with cockroachdb/tidb/foundationdb storage engines


Nomad makes things interesting...


Seems surprising that it would cause 24+ hours of downtime. My guess is they're recovering data from backup.


It can sometimes be very hard to cold start in such circumstances, because the demand from all the services trying to access their secrets at once can create a “thundering herd” that knocks the recently restored service over again.

My guess is that they’ve had to shut most services down, restore the secret store, and then bring up internal services in a careful sequence to avoid a thundering herd. At $JOB even understanding the correct sequence to cold start our organization would probably take more than 24hrs.


At $JOB, I've often pointed out that no-one actually knows if given a sufficiently bad issue and a reasonable timeline we even could bring it back up.

In power grids its called a Black Start, I think that's something we could nick as a concept.


Back in college (90’s), we had a Solaris shop. One time after a long power outage we discovered that we couldn’t bring up the system because of a circular dependency.

It’s been decades, something along the lines of a NIS server depending on a file server and vice-versa. We managed to manually get the services running and then do a clean reboot after the rest of the systems came up.


Just one more reason I don't take the "engineering" part of "software engineering" seriously. Hardly surprising given the admiration of tinkering and hacking that runs through the industry.


modern software spans the globe, often communicating with 3rd party systems, operates 24/7/365, and is often at the whims of deadlines. its some of the most technically challenging endeavors in existence.

its not really surprising how you'd end up in these situations.

that being said. rate limits and error handling people. jesus. =)


real engineering is full of the same examples. i’m thinking of civil infrastructure but there is no shortage in any domain


The 19th century is full of “real engineering” that is frankly clown shoes stuff. Stories about how early truss bridges were designed makes modern software engineering seem quite orderly by comparison.


I've always been amused at this comparison. You're not wrong. We wouldn't tolerate or excuse a group of people attempting engineering based on 19th century practices. Why should we excuse software engineers (if, indeed, their goal is to be "engineers" in any meaningful sense) for not building on the last two centuries of engineering knowledge?


Because software engineering is not the same as mechanical engineering. We’ve tried to apply mechanical engineering principles to software projects; it didn’t work.


Not surprising to me at all.

No idea about their infra, but for complex enough, built in Silicon Valley move fast style (and they are likely in that bucket) restarting after such a failure can be a huge problem. You can have thundering herds, that you may not have levers to control, infrastructure needed to build and deploy fixes maybe down, you can have cyclical dependencies sneaking in the past, that makes it impossible to startup some systems, etc, etc


I really hate overengineered cloud infrastructure. It's the 2020s equivalent of 1990s to early 2000s enterprise Java except it's "use every conceivable cloud service in as baroque a deployment as possible" instead of "use every single design pattern in every part of the program."

I refer to software built this way as "write once run once" as in it's so complex it could barely be redeployed and might not even be possible to restart. "Write once run once" is a play on the old "write once run anywhere" tagline for Java.

But hey it makes Amazon rich. Amazon figured out how to monetize the tendency of sophomore engineers to add as much complexity as possible. I bet the authors of the original 1994 design patterns book are kicking themselves for not finding a way to charge for every implementation of the factory pattern.


What makes you believe Java apps after early 2000 are any better? Spring apps = Factory pattern apps on steroids. But why pick even Java? Any backend landscape of scale looks like that, mostly because kids want to pad their resumes with cool shit, or so they think. Eg. everything has to be fluent/reactive/async for some reason now - not a problem you could blame on "managers".


That reminds me of the first company I worked for. Rumor had it that a handful of their engineers got their hand on the Design Patterns book, read it, and decided on a single design pattern that they'd use for everything.

So many visitors...



Oh. That hits close to home!



Spot on!


Totally. You'll have stale caches that need to be regenerated at the same (vs incrementally during steady operation), backed up queues, users themselves doing things like rapid refreshes, etc.

Ironically recovering after downtime is usually harder if you have downtime relatively rarely, because these problems don't get exposed.


> You can have thundering herds, that you may not have levers to control

To me, this seems almost unthinkable. You mean to tell me that you can't just coordinate setting "replicas: 0" for all of your services, or the equivalent and to restart them one at a time? If that's really the case, then perhaps it's time to re-evaluate how that situation ever came to be - where's the technical leadership who kept systems scalable, yet manageable? Was that even a concern at any point in time?

Sadly, my personal experience indicates that many enterprises don't give it any thought and just expect things to work, oftentimes not even knowing about all of their services in enough detail to be able to answer questions about their architecture, nor having a clear way of managing their service instances.

> You can have cyclical dependencies sneaking in the past, that makes it impossible to startup some systems

This does seem like an application problem. If an application cannot reach external services, it should either:

  - return an error for all services calls that require this external integration
  - queue up these requests internally to be retried later, if there are the necessary resources for this (RabbitMQ, or just a lot of RAM)
Having an application fail fast when one of its dependencies is down makes sense, of course, but it's also a really dangerous approach, since just one cyclic dependency sneaking past could break everything. Perhaps it's better to avoid such situations entirely and not make your app crash if it cannot reach an external service, merely try again later?


Roblox has been around for 16 years. I would be shocked to learn that 16 years ago they planned the architecture of their system so meticulously as to account for dependencies and problems you describe for the long term future. Odds are some novice programmer or a few of them hacked something together and as the platform grew they just kept adding to the poor hacking.

It might take a huge investment to correct it now. But perhaps an outage like this will prove to them that it's necessary.

(N.B.: I did not vote up or down on your comment.)


> I would be shocked to learn that 16 years ago they planned the architecture of their system so meticulously as to account for dependencies and problems you describe for the long term future.

If their architecture hasn't changed much in that long of a time, then that is impressive, both in a positive and negative sense.

In regards to the positive aspects, that indeed makes you consider the thought that must have been put in the system design and how it must have been good enough to survive this long.

However, there are also the negative aspects - if there are single points of failure, or even scalability issues which may lead to a lot of downtime, then it's almost certain that rewrites in some capacity will need to be carried out. A system design that worked 16 years ago will probably run into some roadblocks today, much like C10k was a problem back in the day but has largely been solved in many situations.

Either way, long term downtime doesn't reflect positively on the current state of the overall system.

(No worries about the upvotes/downvotes, i merely mentioned that fact in case it was my tone that was inadequate, or perhaps there were technical inaccuracies or just inactionable advice given on my part - being told exactly why i'm wrong is helpful not just to me but to others as well.)


Clarification for the downvote(s) since the person didn't provide any: in my current workplace, i actually brought this very issue to the attention of all the stakeholders and over the past few months have been working on improving the applications that we develop: updating the frameworks and libraries to recent, maintained and more secure versions, using containers and Ansible to improve configuration management and change management, as well as to get rid of years of technical debt.

A part of this was needing to understand how they truly process their configuration, when they choose to fail to start and how they attempt to address external systems - in my mind, it's imperative for the long term success of any project to have a clear understanding of all of this, instead of solving it on an "ad hoc" basis for each separate feature change request.

To give you an example:

  - now almost everything runs in containers with Docker, Docker Compose or Docker Compose (depending on the use case), with extremely clear information about what belongs in which environment and how many replicas exist for each service. A part of testing all of this was actually taking down certain parts of the system to identify any and all dependencies that exist and creating action plans for them.
  - for example, if i want to build a new application version i won't be able to do that if the GitLab CI is down, or if the GitLab CI server cannot reach the Nexus registry which runs within containers, so if the server that it's on goes down, we'll need a fallback - in this case, temporarily using the public Docker image from Docker Hub, before restoring the functionality of Nexus through either Ansible or manually, and then proceeding with the rest of the servers.
  - it's gotten to the point where we could wipe any or all of the servers within that project's infrastructure and, as long as we have the sources for the projects within GitLab or within a backed up copy of it, we could restore everything to a working state with a few blank VMs and either GitLab CI or by launching local containers with the very same commands that the CI server executes (given the proper permissions).
Of course, make no mistake - it's been an uphill battle every step of the way and frankly i've delivered way too many versions of software and configuration changes late into the evening, because for some reason we don't have an entire team of DevOps specialists for this - just me and other developers to onboard. Getting the applications not to fail-fast whenever external services are unavailable has also been a battle, especially given the old framework versions and inconsistent, CV driven development over the years, yet in my eyes it's a battle worth fighting.

Not only that, but the results of this are extremely clear - no more depressive thoughts after connecting to the server for some environment through SSH and having to wonder whether it uses sysvinit, systemd, random scripts or something else for environment management. No more wondering about which ports are configured and how the httpd/Nginx instances map to Tomcat ports, since all of that is now formalized in a Compose stack. No more worrying about resource limits and old badly written services with rogue GC eating up all of the resources and making the entire server grind to a halt. No more configuration and data that's strewn across the POSIX file system, everything's now under /app for each stack. There hasn't been an outage for these services in two months, their resource usage is consistent, they have health checks and automatic restarts (if ever needed). Furthermore, suddenly it becomes extremely easy to add something like Nexus, or Zabbix, or Matomo or Skywalking to the infrastructure because they're just containers.

Therefore, i'd posit that the things i mentioned in the original post are not only feasible, but also necessary - to both help in recovering from failures, but also to make the way everything runs more clear and intuitive for people. And it pains me every time when i join a new project with the goal of consulting some other enterprise and generating value with some software, but instead see something like a neglected codebase with not even a README and the expectation for the developers to pull out ideas on how to run local environments out of thin air.

If you see environments like that, you're not set up for success, but rather failure. If you see environments like that, consider either fixing them or leaving.


I doubt they are "built in Silicon Valley move fast style". I played Roblox when I was a child. In fact, it is what got me into programming.

(And they just rejected my intern application, which is very sad.)


Why do you doubt they were built in Silicon Valley move fast style? Because it's old? It was released in 2004, the same year as Facebook (both of them had different names originally)


I don’t know much about Roblox other than it’s some multiplayer game platform.

Is it down totally with nothing working, or just extremely overloaded but with some requests still getting through?

If it’s totally down still after 24hr it must be pretty serious. I’ve seen my fair share of production fires but usually you can bring some of the service back up quickly and then fight with the thundering herds and load and eventually stabilise the service.

I hope they put out a public post-mortem afterwards.


Completely dead with nothing working, main site drops a maintenance page: https://roblox.com


Yesterday, about an hour or two after the incident began, my kids and I were able to get into a Roblox instance with some unusual processes. (Logging into an account wouldn't work and some other website functionality was screwy, but if you were still logged in, you could do some things to get into some Roblox instances.)

A couple hours later, when trying to play Roblox again, our tricks no longer worked and we eventually gave up. But at least at the start, some could still play.


There are screenshots floating around on Reddit showing limited functionality, presumably from sessions that were still logged in when it started:

https://www.reddit.com/r/roblox/comments/qi3du1/roblox_witho...


Roblox is 15 years old

Its more like they arent used to people noticing their incompetence

But they had plenty of time to improve


It points to it being a serious infrastructure issue really. You don't have a hugely popular platform be dead for two days unless there is some fundamental issue with the infrastructure.

Likely some stupid misconfiguration in some overengineered "cloud" architecture is the root cause, and they're mired in trying to debug and resolve the problem, fighting the complexity of the system the whole way.


We had a situation like that once because of circular dependencies; services that needed another service to start. “How is that possible?” you ask. Code reviews; “take that code out and get the data from an existing service.”

We hard-coded data into a base service, pushed it to production, restarted services that depended on it, then services that depended on those services, etc.


I find it surprising when companies recover in less than 72h personally


$20 says dependency cycle


Wonder how they would handle purchases lost and what not. All those kids having to prove they lost money they spent on their Roblox gift cards.


Not if they managed to drop all the secrets.


Amusing, John Carmack mentioned the metaverse will be something like Roblox https://arstechnica.com/gaming/2021/10/john-carmack-sounds-a... .

FB (meta) taking drastic action to shutdown the competition


I think this is a pretty strong statement to make without any citation behind it.


Not GP but I would interpret that as sarcasm


I think the poster was joking


Recent and related:

Roblox Service Disruption - https://news.ycombinator.com/item?id=29034909 - Oct 2021 (40 comments)


I interviewed with their ops team there and turned it down. Just had a bad smell. I think they hire 'SRE' that can write good leet code. The problem with those folks for ops is that they over engineer complex solutions, and then completely forget the simple but very important stuff. I wish them luck in getting it back together.


Leet code style questions for an SRE is so dumb. SRE is all about solving unexpected, unique problems. Opposite of leet code!


Completely unfounded speculation—what if they were hit by a ransomware attack?

I wonder only because, well, after a full 24 hours... a purposefully malicious attack starts to seem more plausible to me than a simple freak accident.


Interestingly... the stock is up 3% over the course of this outage so far.


I'm guessing because many investors found out about Roblox from their kids when it went down.


I could imagine something like this:

Dad! Roblox is down!!

Dad: whats roblox?

Looks at credit card over the years: fuuuuuu!

Goes to buy shares :)


Yea, this is definitely it, wow


Hilarious and possibly true.


That does actually make a lot of sense.


I wouldn't have thought "any publicity is good publicity" would hold even in this case.


Sounds like when Fastly went down - the stock went up. Everyone suddenly realised how much of the internet they powered?



roblox powers much of the internet? I find that unlikely. If you don't have kids, this is probably the first you heard of it. I guess there's the "my kid is complaining to me that it's down and now I know about it" angle, but it's hard to draw any conclusions from your kid's purchasing habits alone.


Roblox has been around for a long time.


roblox has the most under 18 users of anyone out there.


Right, but do you really need an outage for everyone to find that out? It's hard to notice that fastly powers a given site, but it's trivial to notice the game your kid's playing.


My parents didn’t notice any of the games we were playing.


yes, my parents wouldn't understand playstation or xbox


Probably because of Facebook turning themselves into a Roblox, validating them.


Metaverse overload


How is that interesting? Markets are irrational.


sometimes things are interesting because they are counterintuitive, however irrational


There's probably a lesson in here about managing incidents, very few updates or acknowledgments there's even a major problem.

Will be interesting to find out what caused an incident big enough to kill their entire platform for so long (if they release details or they come out through the grapevine)


The status page acknowledges there is an incident, that it's pretty bad — everything is marked as down, and it's in red! What more could you want? (/s)

Compare to Azure: most incidents never see the status page. Hell, getting support to acknowledge that an incident exists even once your internal investigation has reached certainty on "oh, yeah, it was them" is hard. There was an AAD outage earlier this year (?; IDK — I've lost track of the passage of time in the pandemic…) and the status page was down, and even once you managed to get the status page (IIRC you could hit the IP if you knew it magically which the Twitterverse did) … most services were still green, even if completely offline as far as one could tell by issuing queries to the service…

And I'm comparing a kid's game with a "major" cloud PaaS…

I'm definitely suffering from Stockholm syndrome.


> Compare to Azure: most incidents never see the status page.

That sounds like a conscious decision on their part. Everyone always talks about disclosure being the best policy, but at the same time there's plenty who believe that not informing anyone about an outage or even a breach is the correct thing to do, since then they'll probably get into less trouble themselves, or at least will create the illusion of not having as many outages as "those other guys".

After all, informing everyone about an outage that will noticeably affect only some probably has a larger impact on the company's reputation than having it be dragged through mud in smaller communities for its dishonesty. Then again, with many of the larger services, it's not like you have much of a choice of using or not using it - you just get a corporate policy passed down upon you and that's that.

Thus, sweeping problems under the rug and pretending that they don't exist is a dishonest, yet valid way of handling outages, breaches and so on. Personally, i'd avoid any service that does that, though it's not like that's being done just because of incompetence.


> That sounds like a conscious decision on their part.

It is; I've been told by their support that they don't want to cause alarm.

I think it's a bad way to run a PaaS, though. If I'm looking at your status page, it is because I suspect an outage and am trying to confirm it. Very willing to give some leeway to fix problems (an SLA — and Azure could do better here too — exists to establish what that allowable leeway is) I just need to know "is it me, or not?" and it's nicer to just get the answer when it's not me. As it is, I have to jump through a support hoop to get at "I think you are having an outage" and even then, it's typically multiple cycles before support seems to query engineering (and — that's another problem: support doesn't just know that there is an outstanding issue…) and gets to the bottom of it.

It needs to be easy for a customer, experiencing an issue with a service, to drive resolution of that problem. I can forgive small service outages, but it's this lack of any ability to get resolution or closure or some "yeah, we had a failure, here's what we're doing to prevent it going forward" that is the real problem.

Sadly, there is only so much choice I have in the matter of which cloud provider we're using…


Compare to Azure

I'll honestly change that to all major cloud providers - AWS has some small hiccups and then that were never recorded, and Google Workspace describes "a problem to a subset of users" when it clearly worldwide.


a kids game yes, but worth 50B


Seems to me that a routing / networking issue would have been resolved by now, and application bug would have been rolled back.

If I were to speculate, I would say that it must have something to do with databases / storage. Something must have gone wrong and broke some database, and it’s difficult to restore.


They had an incident in the past where they returned nil for all get operations to their database, that caused some chaos at the time so it's not unprecedented: https://devforum.roblox.com/t/update-datastores-incident-dat...


I agree that DB is the prime suspect here. They said the problem was identified over 24h ago, cant imagine it being something other than a data issue that takes that long to resolve.


I'm curious where the traffic is now going that isn't going to Roblox? Steam player count doesn't seem to be jumping much. Maybe just mobile games?


The kids are getting sleep they need?

https://www.healthline.com/health-news/children-lack-of-slee...

Parent advice: Get your kids to sleep, it gives them a huge unfair advantage in school :)

https://news.mit.edu/2019/better-sleep-better-grades-1001


How is it an “unfair” advantage? It’s just an advantage. It’s not like getting a good nights sleep is something that requires an exorbitant amount of money.


Kind of tongue in cheek; parents know the struggle of getting kids to sleep, it takes dedication and effort to not just wait until they decide themselves to go to sleep.

This dedication and effort requires time and patience, and with more kids raised without multiple adults (time is zero sum), that time is missing, and those kids ‘naturally’ decide on much less sleep than they need (because as we all know, the distractions available are amazingly fun).


Probably. Roblox's original value proposition of 'kids making games for kids' has devolved into 'adults making dopamine slot-machines for kids' which is pretty sad :(


For free!


Steam is usually targeted 13+, since most users are under 13 I doubt most are on steam. Probably mobile games (as you said), Minecraft, YouTube, etc.


Can confirm. They are still picture-in-picture on YouTube, but playing Minecraft instead of Roblox.


Youtube kids videos.


My kids play roblox and don't have any access to steam. Which is probably very common.

Also, a lot of roblox playing is with tablets and phones. Those kids can't move to steam even if parents allowed it.


Minecraft


Can someone please do the same for Minecraft? Maybe my kids will finally read a book


Maybe your kids are already reading books from The Uncensored Library server in Minecraft :)

Edit: son to kids.


Worse the the FB outage for parents… help!


Guilded integration?


That’s weird


Well everything can be taken 'down', from blockchains to even real life by governments.

For the metaverse to succeed, it should never be 'down'. Good luck with that if it is on a single platform or a single point of failure.

Perhaps an interoperable metaverse / hyperverse then?


I think you’re right. Even Fb/Meta said interoperability is a key feature of the “metaverse”.


Sure. Interoperability between Facebook, Instagram, Messenger, and WhatsApp.


With everyone plugging into the Meta hub, of course.


The various Open Simulator grids (known collectively as the hypergrid) are interoperable like that. As long as the one you created your avatar on (which can easily be a self-hosted one) is up, there are many worlds to teleport to.


Perhaps some sort of MetaChain…of MetaBlocks…is needed?


Wow... seems like a good time to sell Roblox stocks anyways... because most of Roblox' popular games appear to haven been cloned into Minecraft... earnings are coming up real soon though... I guess we'll find out in less then a week.

edit: I edited this comment a bit to make it clearer because HN would not let me reply to the child comment.


I would think the opposite. This is a short term thing that presumably won't affect the business in any meaningful way. They will (probably) recover and be fine and nobody will be too worried that a kid's game has some, limited, downtime from time to time. If this outage causes a drop in share price, it's probably a good time to pick up a bit at a discount.


>nobody will be too worried that a kid's game has some, limited, downtime from time to time

except for the parents of kids addicted to the game


We need to keep kids more offline.


lets make June, the month of no Internet for kids.


A whole month? Kid me would stage a coup on the spot.


that's okay, kids don't have rights. coup all you want. just do it in your room so the grown-ups can talk


But they can hate their parents for the rest of their life.


Every society that survives eventually realizes it is good to have occasional fasts to reexamine priorities and improve mental health.

There are communities in the U.S. that take an entire month off from these types of digital activities :)


Interesting, are these well-known communities? Of course the Amish never use computers to begin with.


Well, the communities belong to the oldest continuous organization of mankind :)

E.g. https://encourageandteach.wordpress.com/2015/02/20/the-frida... https://www.catholicapostolatecenter.org/uploads/9/2/4/6/924...


I'm slightly embarrassed I didn't think of that, since I'm Catholic and have given up reddit for Lent in the past...

I guess I misinterpreted your previous comment as an entire community giving up the same thing together, rather than each individual choosing a specific thing.


I could have been more clear: there are a subset of community members who band together and encourage each other to fast (digitally)…

I’m also a Catholic (took RCIA @ Harvard), and incorporating regular fasting has been awesome :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: