Test changes on smaller parts of the network before pushing them to critical parts would be a first guess.
Config changes tend to be nasty in that their implications are often hard to oversee until they have been made, and if the effects preclude you from making another config change then you've just cut off the branch that you were sitting on.
Google is best-in-class when it comes to this stuff, the thing you should take away from this is that if they can mess up everybody does. And that pretty much correlates with my experience to date. This stuff is hard, maybe needlessly so but that does not change the fact that it is hard and that accidents can and will happen. So you plan for things to go wrong when you design your systems. Failure is not only an option, it is the default.
Which seems to have been what they were trying to do; according to this update, the config change which caused the problem was intended to apply to a smaller portion of the network than it actually hit. But automated enforcement of procedures like this can also be tricky; how's a machine supposed to know that "this change was already tried on a smaller part of the network"?