Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The lack of transparency makes me want to consider other cloud providers. All providers will have outages -- that's a reality I can live with -- but I will prioritize the ones who are the most forthcoming in their statuses and explanations into failures.


Google kept their status page up to date as the outage was progressing, and now (the day after the outage), they've provided an apology and a preliminary explanation of what happened.

If that's not sufficient, what more are you looking for, and what other large cloud providers consistently meet that standard?


AWS's post outrage summaries are pretty much the gold standard.

e.g.

https://aws.amazon.com/message/2329B7/

https://aws.amazon.com/message/41926/

To be fair to Google, they haven't had enough time to perform a detailed autopsy, and some GCP incident summaries have shown meat on the bones e.g. https://status.cloud.google.com/incident/compute/16007. And balancing the scales, the AWS status page is notorious for showing green when things are ... not so verdant.

I have seen full <public cloud> internal outage tickets and the volume of detail is unsurprisingly vast, and boiling it down into summaries - both internal and external - without whitewashing, without emotion, to capture an honest and coherent narration of all the relevant events and all the useful forward learnings is an epic task for even a skilled technical writer and/or principal engineer. You don't get to rest just because services are up, some folks at Google will have a sleep deficit this week.


Google also posts detailed postmortems for their more significant outages.

Some examples:

https://status.cloud.google.com/incident/cloud-networking/18...

https://status.cloud.google.com/incident/cloud-pubsub/19001

https://status.cloud.google.com/incident/cloud-networking/18...

https://status.cloud.google.com/incident/cloud-networking/18...

https://status.cloud.google.com/incident/compute/18012

Given that this was a multi-region outage that lasted several hours and impacted a substantial number of services, I'd expect a detailed postmortem to follow.


I hope one of the things Google learns in the post mortem is that the next day summary should clearly include that a full post mortem is coming in the next few days or however long.

Half the people in this thread are overlooking that fact and going into outrage mode.


I love reading the AWS post-mortems since they're always very detailed in describing the roles of the impacted systems, the intention of the action that caused the outage, the actual action triggered, all the nuances involved in the bug or irregularities from expected behavior, impact to systems, complications, and resolution. It paints a very complex and through picture of how their massive outages are a collection of generally simple failures or oversights that had to all line up just right for catastrophic failure.

Every time I read a Google post-mortem, they seem to hand wave everything away as "a configuration error", "bug", or "bad deploy" and their resolution always has the generic "implement changes to things" that says absolutely nothing. Honestly, when the the causes of these massive disruptions are so simply dismissed, it portrays their system as frail amateur work.


“post outrage summaries” hehe


The status page wasn't up to date, but I don't have any way of backing up that claim. It's certainly isn't tied to any automated system failure reporting -- the status page seems to require manual updates. When minutes turn into hours, and no updates are available on the status page, it certainly doesn't leave me at ease.


We began noticing erratic behavior with some of our GCE instances at around 11:50 Pacific on the day of the outage, and Google posted a notice on their on their status page that GCE was having an outage about 30 minutes later. They also updated the status page every hour (or when they had new information), which is what they said they would do in their status updates.

While it sucks that multiple regions malfunctioned simultaneously for several hours, I can't really fault them for their communication about the issue.


I wasn't able to load a google doc from drive, or my calendar (both g suite), around 11:45 PDT.


There were an hours' worth of tweets before the status page changed. For Ops quite a shitty hour of uncertainty.


Gmail was down for me for at least an hour and all the lights were green on their status page


I don't experience this as anything like a lack of transparency.

The incident was less than 2 days ago, is resolved, and we have a preliminary report from the "VP, 24x7", which is easily digestible by the average GCP customer with more details undoubtedly to come.


This isnt the postmortem though. It says they're still working on that.


as the other comments here state - wait until they post the full post mortem to draw conclusions. This reads like a statement to Gmail/Youtube customers, not a post-mortem for GCP customers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: