Yes thank you. As someone who built these for a living I can definitely say that...

foobiekr · on Nov 19, 2022

Just gotta be careful about how you implement it. The real problem to solve is to detect when the application is. It making forward progress. I’ve seen this implemented in awful, stupid ways (like spinning off a thread to stroke the watchdog - borderline pointless).

londons_explore · on Nov 19, 2022

For network connected things (which aren't designed to work offline), I like to implement a network watchdog poker...

Ie. every 30 seconds, contact an update server server and, if successful, poke the hardware watchdog.

That means every device in your network will either be rebooting, or correctly talking to your server with the latest version of your software. There are no other stuck/error states to consider.

I also implement reboot-with-no-screen-flicker, so that I can display a basic company logo so things look respectable while there is an outage and everything is bootlooping. Some brands of screen let you upload an image to display if there is no valid signal, which is an easy way to implement this.

Also worth doing a load test of your server to verify that it can withstand all your devices rebooting simultaneously and you won't suffer the thundering herd problem.

ilyt · on Nov 19, 2022

That's generally pretty hard problem, to find a point where it triggers where app is unresponsible but not when say it is running something CPU intensive and just lags for few seconds

kqr · on Nov 19, 2022

Isn't that in its most general form exactly the halting problem?

foobiekr · on Nov 19, 2022

The halting problem is for the general case. If you design the application to facilitate it you can demonstrate forward progress in a pretty robust way.

Is it perfect? No. But you can do much, much better than “is the kernel up.”

throwaway9870 · on Nov 19, 2022

I think you are conflating hardware watchdogs and software timers. The hardware watchdog should be to catch cases where the system is no longer running your software (or the kernel or the watchdog daemon). Timeouts for software catch things like slow procedures, etc.

foobiekr · on Nov 19, 2022

Not at all. In most embedded devices, the point of the software on the device is some specific function which is something more than “the kernel is running” or “the kernel is scheduling the zero-io watchdog process.” You want to actually pick up the case where the kernel is up but, for example, your process is all but dead because your storage driver has wedged.

The goal is to prove forward progress and the best way to do that is to come as close to proving that your userplane SW is actually working and not dead, or worse, half dead.