What can prevent this from happening in the future? Why was the config change re...

jacquesm · on June 4, 2019

Test changes on smaller parts of the network before pushing them to critical parts would be a first guess.

Config changes tend to be nasty in that their implications are often hard to oversee until they have been made, and if the effects preclude you from making another config change then you've just cut off the branch that you were sitting on.

Google is best-in-class when it comes to this stuff, the thing you should take away from this is that if they can mess up everybody does. And that pretty much correlates with my experience to date. This stuff is hard, maybe needlessly so but that does not change the fact that it is hard and that accidents can and will happen. So you plan for things to go wrong when you design your systems. Failure is not only an option, it is the default.

rst · on June 4, 2019

Which seems to have been what they were trying to do; according to this update, the config change which caused the problem was intended to apply to a smaller portion of the network than it actually hit. But automated enforcement of procedures like this can also be tricky; how's a machine supposed to know that "this change was already tried on a smaller part of the network"?

geofft · on June 4, 2019

I think that's likely to be answered in the post-mortem, which is still to come.

lallysingh · on June 4, 2019

Prioritize configuration traffic so they can fix the problem quicker.

mlthoughts2018 · on June 4, 2019

Obviously to prevent things like this, Google needs more binary search tree whiteboard trivia problems in the interview process.