Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Using a hardware watchdog is an alternative and it covers the additional case where the system hangs without panicking.


I have a network-controllable outlet switch with a timeout feature. You can configure each outlet to ping a specific IP address; if the address doesn't respond after X seconds then the outlet will be power-cycled.


Did you get that as fabulously expensive datacenter gear, or were you able to squeeze it out of a consumer “smart outlet”? I’ve been looking for a thing like that some weeks ago (for reasons that are now moot), and the easiest option seemed to be to buy a handful of the vendor-app-locked whitelabel crap and see which can be reflashed with open source firmware. Which is not as easy as I’d wish.


Shelly relays are known to be easy to control via Wifi. Off the shelf, you'd need to run the actual ping+decide logic elsewhere, though you could re-flash the firmware with custom logic too...

https://shelly-api-docs.shelly.cloud/gen2/General/RPCProtoco...

https://www.amazon.com/stores/page/4E39C18F-DCA3-4726-A8A7-6...

Plus, they're also easy to reflash with other firmwares (e.g. Tasmota, ESPHome).


I want one of these in "plug" form factor that uses a normally-closed relay rather than normally-open. The reason being that the internal power supply and microcontroller could fail by shutting down, preventing things from working at all. There's nothing that can be done about the microcontroller failing while holding the circuit open, though, but power supply failure has been a reoccurring but infrequent thing for me lately and it would be nice to cover that case well.

Edit: For small do-it-yourself computer clusters you probably want it to fail normally-open, this would be for non-redundant stuff


What is the name of this product model? It sounds amazingly useful.


Yes thank you. As someone who built these for a living I can definitely say that this is a really good solution.


Just gotta be careful about how you implement it. The real problem to solve is to detect when the application is. It making forward progress. I’ve seen this implemented in awful, stupid ways (like spinning off a thread to stroke the watchdog - borderline pointless).


For network connected things (which aren't designed to work offline), I like to implement a network watchdog poker...

Ie. every 30 seconds, contact an update server server and, if successful, poke the hardware watchdog.

That means every device in your network will either be rebooting, or correctly talking to your server with the latest version of your software. There are no other stuck/error states to consider.

I also implement reboot-with-no-screen-flicker, so that I can display a basic company logo so things look respectable while there is an outage and everything is bootlooping. Some brands of screen let you upload an image to display if there is no valid signal, which is an easy way to implement this.

Also worth doing a load test of your server to verify that it can withstand all your devices rebooting simultaneously and you won't suffer the thundering herd problem.


That's generally pretty hard problem, to find a point where it triggers where app is unresponsible but not when say it is running something CPU intensive and just lags for few seconds


Isn't that in its most general form exactly the halting problem?


The halting problem is for the general case. If you design the application to facilitate it you can demonstrate forward progress in a pretty robust way.

Is it perfect? No. But you can do much, much better than “is the kernel up.”


I think you are conflating hardware watchdogs and software timers. The hardware watchdog should be to catch cases where the system is no longer running your software (or the kernel or the watchdog daemon). Timeouts for software catch things like slow procedures, etc.


Not at all. In most embedded devices, the point of the software on the device is some specific function which is something more than “the kernel is running” or “the kernel is scheduling the zero-io watchdog process.” You want to actually pick up the case where the kernel is up but, for example, your process is all but dead because your storage driver has wedged.

The goal is to prove forward progress and the best way to do that is to come as close to proving that your userplane SW is actually working and not dead, or worse, half dead.


Exactly. Every Intel machine has a hardware watchdog built into the ICH chipset. Make sure it is turned on.


Make sure you turn on the watchdog poker before you turn on the watchdog. Otherwise it can be quite annoying to turn it back off.


I tried it on my supermicro server and recall it causing problems, so I have to turn it off.


Could you elaborate a little about how these are implemented?


A true hardware watchdog is separate electronics. For example, it's really easy to make an electronic circuit that operates a relay once every 5 minutes, say, unless a 'restart timer' button is pressed. Connect the relay contacts across the PC reset button, and run a program on the PC that 'presses' (electronically) the 'restart timer' button once a minute, say. Then if the PC fails to boot, it gets reset once every 5 minutes until it does (fairly obviously, another set of relay contacts can be used to trigger an alerting device, e.g. wailing sirens).

We use this sort of approach with diskless systems in particular. If there's a power cut, the first boot attempt after power restoration might not work (because the network isn't back up yet). So the diskless systems just sit there, continually attempting a network boot until successfu (at which point the software on the PC hits the 'restart timer' button periodicaly.

This is closely related to the concept of a "deadman's handle", for example train drivers who must keep a lever pressed down during operation - if it's released, the train stops automatically.


Preface: my knowledge here is on ARM, particularly baremetal, but also embedded Linux. No idea about Windows or x86.

Generally there’s a hardware watchdog implemented as a counter/timer in the processor. It can have a predefined or configurable period. It counts down, and if it times out then it initiates a hardware reset of the processor.

You can ensure your software/OS is always at least executing code by having a task (in-kernel on Linux, or an RTOS task, or just in your main event loop on baremetal) that resets that timer. Then, if your code stops resetting that timer, it expires and resets the processor.


A more specialised variant that's also quite common is the "window watchdog" peripheral, which is similar to the timer version, but will also trigger a reset if the keep-alive signal arrives too early, as well as too late.

It can be useful where you've got a mainloop doing some very predictably timed activities, and allows detection of faults which cause your watchdog servicing to occur too frequently.

I think it's quite common in DSP and things like motor control, where you often have hard realtime requirements and things happening too soon is just as bad as too late.


Is there some way of accessing this from user space on Linux?


Yes! There’s an ioctl interface for managing the watchdog, and a character device at /dev/watchdog. The kernel docs[1] are a decent jumping off point to learn more.

Upon reading these I did realize on Linux it’s implemented as a kernel device, but it’s usually a userspace task that has to notify the kernel watchdog interface to actually kick the timer. This makes sense, since userspace being functional is probably what you really care about.

[1] https://www.kernel.org/doc/html/latest/watchdog/watchdog-api...


One of the preternal problems of such hardware watchdogs was the inability to discriminate whether a sudden reboot was due a reset-button, hw security (e.g. temperature), ECC problem, or (micro) loss of power, or HW watchdog.

On most IPMI-capable BIOS/firmware there's now (been for 10 years but I'm old) an option to log 'system' events (ipmi failures like fan speeds if you've set threshold, but also reboot reasons). It's call the System Event Log. Very useful.

And on IPMI-plugged watchdogs, you can also see the state of the HW watchdog (is it running, how many seconds are left). Very useful too.


In addition to those already mentioned, one way is to enable it in systemd:

  # /etc/systemd/system.conf.d/foobar.conf
  [Manager]
  RuntimeWatchdogSec=60
When used in this manner, if systemd fails to ping the watchdog for 60 seconds, the system resets.

https://www.freedesktop.org/software/systemd/man/systemd-sys...

Somewhat related, nowadays by default systemd enables a 10-minute watchdog just before a regular reboot (i.e. after everything has been shut down) to ensure the reboot happens even if there is a hang for some kernel/HW reason.


https://www.kernel.org/doc/html/latest/watchdog/watchdog-api...

Many x86 systems have a built in hardware watchdog.


There is a caveat here, it won't stop your app from crashing before the watchdog activation. Some CPUs have fuse that can enable watchdog before any code starts running but the ARMs I played with (STM32) don't appear to have that option.


At least some STM32s do, see page 89 of the STM32F4xx reference manual[1], the option bits 5:7 at 0x1fffc000 let you activate the hardware watchdog immediately following reset if you wish.

[1] https://www.st.com/resource/en/reference_manual/rm0090-stm32...


Embedded micros (Systems On a Chip) often include dedicated watchdog hardware. This is a timer which is reset ("fed") by writing values to a specific register. Crucially, it often isn't just one value; you have to alternate between two values. That way, you can write one value at one point in your event loop and the other value at another point, making it less likely that something will break but keep feeding the watchdog.

If the watchdog hasn't been fed for X milliseconds, then it resets the system.


Great point. If the hardware is available I'd use both.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: