Who was making configuration changes on a Sunday afternoon?
Not many engineers at Google work Sundays, and most teams outright prohibit production affecting changes at weekends.
The only type of change normally allowed would be one to mitigate an outage. Do I suspect therefore that the incident was started by an on-call engineer responding to a minor (perhaps not user visible) outage made a config mistake triggering a real outage?
That seems likely because on-call engineers at weekends are at their most vulnerable - typically there is nobody else around to do thorough code reviews or to bounce ideas off. The person most familiar with a particular subsystem is probably not the person responding, so you end up with engineers trying to do things they aren't super familiar with, under time pressure, and with no support.
Not many engineers at Google work Sundays, and most teams outright prohibit production affecting changes at weekends.
The only type of change normally allowed would be one to mitigate an outage. Do I suspect therefore that the incident was started by an on-call engineer responding to a minor (perhaps not user visible) outage made a config mistake triggering a real outage?
That seems likely because on-call engineers at weekends are at their most vulnerable - typically there is nobody else around to do thorough code reviews or to bounce ideas off. The person most familiar with a particular subsystem is probably not the person responding, so you end up with engineers trying to do things they aren't super familiar with, under time pressure, and with no support.