Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

if one machine failed and failover kicked in correctly, why was the engineer paged?


Because it's hard to make an automatic monitoring system that reliably distinguishes between "a failure occurred but everything is fine" and "a failure occurred and now everything is on fire".


Depends on how much spare capacity they had. Being one failure away from going down is an emergency situation at many places.


I wondered this as well. Valuing your engineers' sleep is important.


We have multiple different pages. In our cluster we have 3 machines and if one of them is unavailable because of broken network, we do not page. In this case the page came as an application error that the application was not able to cope with. When we have issue that we have seen before and the server can handle it on its own, we do not page.


Also depends on how many machines you got running. If it's 2: do you really want to wait it out and risk the other one going to hell too ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: