Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Cliff notes: google's configuration service broke itself then fixed itself today. Engineers were alerted. Skynet is self-aware.


yeah I'm very curious as to what sort of bug deploys a bad configuration and then magically deploys the fix 30 minutes later...


I don't know the specifics of what happened here, but in my experience with automatic configuration generation one must have a way to validate the config, but that validator can have bugs (as any other software).

Then either the software loading the configuration detects the problem or the monitoring system detects something's not right, and automatically the last working configuration is applied and the non working one is discarded.

By the looks of it I would say their monitoring detected the problem but the reliability team needed some minutes to realise it was a configuration problem. A classic problem is a network appliance that is misbehaving (eg. firewall, switch, etc), but nobody knows it is because of the configuration and it is replaced by a fallback appliance that... oh, has the same problem (configuration).

All together 25 minutes seems a lot, but when you're troubleshooting and you know an important part of your infrastructure is down, time fly!


Probably a race condition that happened in the first deployment and the system ran just fine the second time.


the bug was probably neatly tucked inside a conditional that was time and/or state related

if state=="x": break_google else: fix_google




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: