Cliff notes: google's configuration service broke itself then fixed itself today...

aroman · on Jan 25, 2014

yeah I'm very curious as to what sort of bug deploys a bad configuration and then magically deploys the fix 30 minutes later...

reidrac · on Jan 25, 2014

I don't know the specifics of what happened here, but in my experience with automatic configuration generation one must have a way to validate the config, but that validator can have bugs (as any other software).

Then either the software loading the configuration detects the problem or the monitoring system detects something's not right, and automatically the last working configuration is applied and the non working one is discarded.

By the looks of it I would say their monitoring detected the problem but the reliability team needed some minutes to realise it was a configuration problem. A classic problem is a network appliance that is misbehaving (eg. firewall, switch, etc), but nobody knows it is because of the configuration and it is replaced by a fallback appliance that... oh, has the same problem (configuration).

All together 25 minutes seems a lot, but when you're troubleshooting and you know an important part of your infrastructure is down, time fly!

free652 · on Jan 25, 2014

Probably a race condition that happened in the first deployment and the system ran just fine the second time.

crazytony · on Jan 25, 2014

the bug was probably neatly tucked inside a conditional that was time and/or state related

if state=="x": break_google else: fix_google