Using a hardware watchdog is an alternative and it covers the additional case wh...

dtgriscom · on Nov 19, 2022

I have a network-controllable outlet switch with a timeout feature. You can configure each outlet to ping a specific IP address; if the address doesn't respond after X seconds then the outlet will be power-cycled.

mananaysiempre · on Nov 19, 2022

Did you get that as fabulously expensive datacenter gear, or were you able to squeeze it out of a consumer “smart outlet”? I’ve been looking for a thing like that some weeks ago (for reasons that are now moot), and the easiest option seemed to be to buy a handful of the vendor-app-locked whitelabel crap and see which can be reflashed with open source firmware. Which is not as easy as I’d wish.

yencabulator · on Nov 19, 2022

Shelly relays are known to be easy to control via Wifi. Off the shelf, you'd need to run the actual ping+decide logic elsewhere, though you could re-flash the firmware with custom logic too...

https://shelly-api-docs.shelly.cloud/gen2/General/RPCProtoco...

https://www.amazon.com/stores/page/4E39C18F-DCA3-4726-A8A7-6...

Plus, they're also easy to reflash with other firmwares (e.g. Tasmota, ESPHome).

oakwhiz · on Nov 19, 2022

I want one of these in "plug" form factor that uses a normally-closed relay rather than normally-open. The reason being that the internal power supply and microcontroller could fail by shutting down, preventing things from working at all. There's nothing that can be done about the microcontroller failing while holding the circuit open, though, but power supply failure has been a reoccurring but infrequent thing for me lately and it would be nice to cover that case well.

Edit: For small do-it-yourself computer clusters you probably want it to fail normally-open, this would be for non-redundant stuff

metadat · on Nov 19, 2022

What is the name of this product model? It sounds amazingly useful.

jerrysievert · on Nov 19, 2022

Yes thank you. As someone who built these for a living I can definitely say that this is a really good solution.

foobiekr · on Nov 19, 2022

Just gotta be careful about how you implement it. The real problem to solve is to detect when the application is. It making forward progress. I’ve seen this implemented in awful, stupid ways (like spinning off a thread to stroke the watchdog - borderline pointless).

londons_explore · on Nov 19, 2022

For network connected things (which aren't designed to work offline), I like to implement a network watchdog poker...

Ie. every 30 seconds, contact an update server server and, if successful, poke the hardware watchdog.

That means every device in your network will either be rebooting, or correctly talking to your server with the latest version of your software. There are no other stuck/error states to consider.

I also implement reboot-with-no-screen-flicker, so that I can display a basic company logo so things look respectable while there is an outage and everything is bootlooping. Some brands of screen let you upload an image to display if there is no valid signal, which is an easy way to implement this.

Also worth doing a load test of your server to verify that it can withstand all your devices rebooting simultaneously and you won't suffer the thundering herd problem.

ilyt · on Nov 19, 2022

That's generally pretty hard problem, to find a point where it triggers where app is unresponsible but not when say it is running something CPU intensive and just lags for few seconds

kqr · on Nov 19, 2022

Isn't that in its most general form exactly the halting problem?

foobiekr · on Nov 19, 2022

The halting problem is for the general case. If you design the application to facilitate it you can demonstrate forward progress in a pretty robust way.

Is it perfect? No. But you can do much, much better than “is the kernel up.”

throwaway9870 · on Nov 19, 2022

I think you are conflating hardware watchdogs and software timers. The hardware watchdog should be to catch cases where the system is no longer running your software (or the kernel or the watchdog daemon). Timeouts for software catch things like slow procedures, etc.

foobiekr · on Nov 19, 2022

Not at all. In most embedded devices, the point of the software on the device is some specific function which is something more than “the kernel is running” or “the kernel is scheduling the zero-io watchdog process.” You want to actually pick up the case where the kernel is up but, for example, your process is all but dead because your storage driver has wedged.

The goal is to prove forward progress and the best way to do that is to come as close to proving that your userplane SW is actually working and not dead, or worse, half dead.

throwaway9870 · on Nov 19, 2022

Exactly. Every Intel machine has a hardware watchdog built into the ICH chipset. Make sure it is turned on.

oakwhiz · on Nov 19, 2022

Make sure you turn on the watchdog poker before you turn on the watchdog. Otherwise it can be quite annoying to turn it back off.

metadat · on Nov 19, 2022

I tried it on my supermicro server and recall it causing problems, so I have to turn it off.

noodlesUK · on Nov 19, 2022

Could you elaborate a little about how these are implemented?

zh3 · on Nov 19, 2022

A true hardware watchdog is separate electronics. For example, it's really easy to make an electronic circuit that operates a relay once every 5 minutes, say, unless a 'restart timer' button is pressed. Connect the relay contacts across the PC reset button, and run a program on the PC that 'presses' (electronically) the 'restart timer' button once a minute, say. Then if the PC fails to boot, it gets reset once every 5 minutes until it does (fairly obviously, another set of relay contacts can be used to trigger an alerting device, e.g. wailing sirens).

We use this sort of approach with diskless systems in particular. If there's a power cut, the first boot attempt after power restoration might not work (because the network isn't back up yet). So the diskless systems just sit there, continually attempting a network boot until successfu (at which point the software on the PC hits the 'restart timer' button periodicaly.

This is closely related to the concept of a "deadman's handle", for example train drivers who must keep a lever pressed down during operation - if it's released, the train stops automatically.

jesse_cureton · on Nov 19, 2022

Preface: my knowledge here is on ARM, particularly baremetal, but also embedded Linux. No idea about Windows or x86.

Generally there’s a hardware watchdog implemented as a counter/timer in the processor. It can have a predefined or configurable period. It counts down, and if it times out then it initiates a hardware reset of the processor.

You can ensure your software/OS is always at least executing code by having a task (in-kernel on Linux, or an RTOS task, or just in your main event loop on baremetal) that resets that timer. Then, if your code stops resetting that timer, it expires and resets the processor.

shabble · on Nov 19, 2022

A more specialised variant that's also quite common is the "window watchdog" peripheral, which is similar to the timer version, but will also trigger a reset if the keep-alive signal arrives too early, as well as too late.

It can be useful where you've got a mainloop doing some very predictably timed activities, and allows detection of faults which cause your watchdog servicing to occur too frequently.

I think it's quite common in DSP and things like motor control, where you often have hard realtime requirements and things happening too soon is just as bad as too late.

noodlesUK · on Nov 19, 2022

Is there some way of accessing this from user space on Linux?

jesse_cureton · on Nov 19, 2022

Yes! There’s an ioctl interface for managing the watchdog, and a character device at /dev/watchdog. The kernel docs[1] are a decent jumping off point to learn more.

Upon reading these I did realize on Linux it’s implemented as a kernel device, but it’s usually a userspace task that has to notify the kernel watchdog interface to actually kick the timer. This makes sense, since userspace being functional is probably what you really care about.

[1] https://www.kernel.org/doc/html/latest/watchdog/watchdog-api...

touisteur · on Nov 19, 2022

One of the preternal problems of such hardware watchdogs was the inability to discriminate whether a sudden reboot was due a reset-button, hw security (e.g. temperature), ECC problem, or (micro) loss of power, or HW watchdog.

On most IPMI-capable BIOS/firmware there's now (been for 10 years but I'm old) an option to log 'system' events (ipmi failures like fan speeds if you've set threshold, but also reboot reasons). It's call the System Event Log. Very useful.

And on IPMI-plugged watchdogs, you can also see the state of the HW watchdog (is it running, how many seconds are left). Very useful too.

AnssiH · on Nov 19, 2022

In addition to those already mentioned, one way is to enable it in systemd:

  # /etc/systemd/system.conf.d/foobar.conf
  [Manager]
  RuntimeWatchdogSec=60

When used in this manner, if systemd fails to ping the watchdog for 60 seconds, the system resets.

https://www.freedesktop.org/software/systemd/man/systemd-sys...

Somewhat related, nowadays by default systemd enables a 10-minute watchdog just before a regular reboot (i.e. after everything has been shut down) to ensure the reboot happens even if there is a hang for some kernel/HW reason.

amluto · on Nov 19, 2022

https://www.kernel.org/doc/html/latest/watchdog/watchdog-api...

Many x86 systems have a built in hardware watchdog.

ilyt · on Nov 19, 2022

There is a caveat here, it won't stop your app from crashing before the watchdog activation. Some CPUs have fuse that can enable watchdog before any code starts running but the ARMs I played with (STM32) don't appear to have that option.

shabble · on Nov 19, 2022

At least some STM32s do, see page 89 of the STM32F4xx reference manual[1], the option bits 5:7 at 0x1fffc000 let you activate the hardware watchdog immediately following reset if you wish.

[1] https://www.st.com/resource/en/reference_manual/rm0090-stm32...

dtgriscom · on Nov 19, 2022

Embedded micros (Systems On a Chip) often include dedicated watchdog hardware. This is a timer which is reset ("fed") by writing values to a specific register. Crucially, it often isn't just one value; you have to alternate between two values. That way, you can write one value at one point in your event loop and the other value at another point, making it less likely that something will break but keep feeding the watchdog.

If the watchdog hasn't been fed for X milliseconds, then it resets the system.

cedws · on Nov 19, 2022

Great point. If the hardware is available I'd use both.