if one machine failed and failover kicked in correctly, why was the engineer pag...

jimrandomh · on June 16, 2015

Because it's hard to make an automatic monitoring system that reliably distinguishes between "a failure occurred but everything is fine" and "a failure occurred and now everything is on fire".

InclinedPlane · on June 16, 2015

Depends on how much spare capacity they had. Being one failure away from going down is an emergency situation at many places.

mentat · on June 16, 2015

I wondered this as well. Valuing your engineers' sleep is important.

adamsurak · on June 16, 2015

We have multiple different pages. In our cluster we have 3 machines and if one of them is unavailable because of broken network, we do not page. In this case the page came as an application error that the application was not able to cope with. When we have issue that we have seen before and the server can handle it on its own, we do not page.

Qantourisc · on June 16, 2015

Also depends on how many machines you got running. If it's 2: do you really want to wait it out and risk the other one going to hell too ?