Using the Watchdog Timer in Linux

The Software Watchdog

First: build the Linux kernel with watchdog support, the full guide is located here:

Device Drivers  --->
  [*] Watchdog Timer Support  --->
    -*-   WatchDog Timer Driver Core
    <*>   Software watchdog

After a reboot with the new kernel there should be a /dev/watchdog file:

[root@alarm ~]# ls -l /dev/watchdog
crw------- 1 root root 10, 130 Jan  1 01:00 /dev/watchdog
[root@alarm ~]#

Next: you will need to install a watchdog daemon:

[root@alarm ~]# pacman -Sy watchdog
:: Synchronizing package databases...
core is up to date
extra                    450.8 KiB  7.72K/s 00:58 [######################] 100%
community                483.5 KiB  6.38K/s 01:16 [######################] 100%
alarm is up to date
aur is up to date
resolving dependencies...
looking for inter-conflicts...

Targets (1): watchdog-5.12-2

Total Download Size:    0.04 MiB
Total Installed Size:   0.18 MiB

Proceed with installation? [Y/n] y
:: Retrieving packages from extra...
watchdog-5.12-2-arm       41.9 KiB  76.1K/s 00:01 [######################] 100%
(1/1) checking package integrity                   [######################] 100%
(1/1) loading package files                        [######################] 100%
(1/1) checking for file conflicts                  [######################] 100%
(1/1) checking available disk space                [######################] 100%
(1/1) installing watchdog                          [######################] 100%
[root@alarm ~]#

List the files that get installed by the watchdog package:

[root@alarm ~]# pacman -Ql watchdog
watchdog /etc/
watchdog /etc/conf.d/
watchdog /etc/conf.d/watchdog
watchdog /etc/conf.d/wd_keepalive
watchdog /etc/rc.d/
watchdog /etc/rc.d/watchdog
watchdog /etc/rc.d/wd_keepalive
watchdog /etc/watchdog.conf
watchdog /usr/
watchdog /usr/lib/
watchdog /usr/lib/systemd/
watchdog /usr/lib/systemd/system/
watchdog /usr/lib/systemd/system/watchdog.service
watchdog /usr/sbin/
watchdog /usr/sbin/watchdog
watchdog /usr/sbin/wd_identify
watchdog /usr/sbin/wd_keepalive
watchdog /usr/share/
watchdog /usr/share/man/
watchdog /usr/share/man/man5/
watchdog /usr/share/man/man5/watchdog.conf.5.gz
watchdog /usr/share/man/man8/
watchdog /usr/share/man/man8/watchdog.8.gz
watchdog /usr/share/man/man8/wd_identify.8.gz
watchdog /usr/share/man/man8/wd_keepalive.8.gz
[root@alarm ~]#

This looks interesting, /usr/lib/systemd/system/watchdog.service is a Systemd service file.

Starting and stopping the watchdog:

The watchdog gets automatically started once you open /dev/watchdog. To stop the watchdog, you will need to:

  • Write the character V into /dev/watchdog to prevent stopping the watchdog accidentally
  • Close the file /dev/watchdog unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled. When this option is enabled, the watchdog cannot be stopped at all.

After the watchdog has been enabled you have to reset the watchdog timer every 60 seconds, else your system gets rebooted. Resetting the timer will be done by the watchdog daemon if none of its tests fails.

Supported tests by the watchdog daemon to check the system status:

  • Is the process table full?
  • Is there enough free memory?
  • Are some files accessible?
  • Have some files changed within a given interval?
  • Is the average work load too high?
  • Has a file table overflow occurred?
  • Is a process still running? The process is specified by a pid file.
  • Do some IP addresses answer to ping?
  • Do network interfaces receive traffic?
  • Is the temperature too high? (Temperature data not always available.)
  • Execute a user defined command to do arbitrary tests.
  • Execute one or more test/repair commands found in /etc/watchdog.d. These commands are called with the argument test or repair.

The configuration file should be self-explanatory:

[root@alarm ~]# cat /etc/watchdog.conf
#ping                   = 172.31.14.1
#ping                   = 172.26.1.255
#interface              = eth0
#file                   = /var/log/messages
#change                 = 1407

# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1             = 24
#max-load-5             = 18
#max-load-15            = 12

# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory             = 1

#repair-binary          = /usr/sbin/repair
#repair-timeout         =
#test-binary            =
#test-timeout           =

#watchdog-device        = /dev/watchdog

# Defaults compiled into the binary
#temperature-device     =
#max-temperature        = 120

# Defaults compiled into the binary
#admin                  = root
#interval               = 1
#logtick                = 1
#log-dir                = /var/log/watchdog

# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime                = yes
priority                = 1

# Check if syslogd is still running by enabling the following line
#pidfile                = /var/run/syslogd.pid

[root@alarm ~]#

Now we will enable the watchdog daemon, currently it should be disabled:

[root@alarm ~]# systemctl status watchdog.service
watchdog.service - Watchdog Daemon
          Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled)
          Active: inactive (dead)

[root@alarm ~]#

For testing purpose I’ve added the following to my /etc/watchdog.conf:

ping = google.de

So when my WiFi connection gets lost my system should reboot.

Start the watchdog daemon:

[root@alarm ~]# systemctl start watchdog.service
Job for watchdog.service failed. See 'systemctl status watchdog.service' and 'journalctl -xn' for details.
[root@alarm ~]# systemctl status watchdog.service
watchdog.service - Watchdog Daemon
          Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled)
          Active: failed (Result: exit-code) since Thu 1970-01-01 01:13:01 CET; 53s ago
        Process: 208 ExecStart=/usr/sbin/watchdog (code=exited, status=1/FAILURE)

[root@alarm ~]# watchdog -v
watchdog: unknown host google.de
[root@alarm ~]# ping google.de
PING google.de (173.194.65.94) 56(84) bytes of data.
64 bytes from ee-in-f94.1e100.net (173.194.65.94): icmp_seq=1 ttl=47 time=66.7 ms
64 bytes from ee-in-f94.1e100.net (173.194.65.94): icmp_seq=2 ttl=47 time=64.6 ms

--- google.de ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 64.604/65.667/66.730/1.063 ms
[root@alarm ~]#

OK, then I will have to use the IP address because the watchdog daemon fails to start. The ping option of watchdog only supports numeric IPv4 addresses:

ping = 173.194.65.94
interface = wlan0
interval = 20

In general you are safer pinging your router, packages to an remote host can get lost or delayed, Googles IP may change or your IP gets blocked if you send 24/7 pinq requests to Google.

And it works:

[root@alarm ~]# watchdog -v
[root@alarm ~]# tail -n 10 /var/log/messages.log
Feb  2 18:35:15 alarm watchdog[203]: got answer from target 173.194.65.94
Feb  2 18:35:25 alarm watchdog[203]: still alive after 6 interval(s)
Feb  2 18:35:25 alarm watchdog[203]: device wlan0 received 7176 bytes
Feb  2 18:35:25 alarm watchdog[203]: got answer from target 173.194.65.94
Feb  2 18:35:35 alarm watchdog[203]: still alive after 7 interval(s)
Feb  2 18:35:35 alarm watchdog[203]: device wlan0 received 7316 bytes
Feb  2 18:35:36 alarm watchdog[203]: got answer from target 173.194.65.94
Feb  2 18:35:46 alarm watchdog[203]: still alive after 8 interval(s)
Feb  2 18:35:46 alarm watchdog[203]: device wlan0 received 7726 bytes
Feb  2 18:35:46 alarm watchdog[203]: got answer from target 173.194.65.94
[root@alarm ~]# killall watchdog
[root@alarm ~]# killall watchdog
[root@alarm ~]# killall watchdog
watchdog: no process found
[root@alarm ~]# systemctl start watchdog.service
[root@alarm ~]# systemctl status watchdog.service
watchdog.service - Watchdog Daemon
          Loaded: loaded (/usr/lib/systemd/system/watchdog.service; disabled)
          Active: active (running) since Sat 2013-02-02 18:37:09 CET; 2min 13s ago
        Process: 259 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
        Main PID: 261 (watchdog)
          CGroup: name=systemd:/system/watchdog.service
                   └───261 /usr/sbin/watchdog

[root@alarm ~]# tail -n 10 /var/log/messages.log
Feb  2 18:37:02 alarm watchdog[203]: stopping daemon (5.12)
Feb  2 18:37:08 alarm systemd[1]: Starting Watchdog Daemon...
Feb  2 18:37:09 alarm watchdog[261]: starting daemon (5.12):
Feb  2 18:37:09 alarm watchdog[261]: int=20s realtime=yes sync=no soft=no mla=0 mem=0
Feb  2 18:37:09 alarm watchdog[261]: ping: 173.194.65.94
Feb  2 18:37:09 alarm watchdog[261]: file: no file to check
Feb  2 18:37:09 alarm watchdog[261]: pidfile: no server process to check
Feb  2 18:37:09 alarm watchdog[261]: interface: wlan0
Feb  2 18:37:09 alarm watchdog[261]: test=none(0) repair=none(0) alive=none heartbeat=none temp=none to=root no_act=no
Feb  2 18:37:09 alarm systemd[1]: Started Watchdog Daemon.
[root@alarm ~]#

Now disconnect the WiFi and voila, after max. 60 seconds it will reboot:

[root@alarm ~]#
[  990.540000] usb 1-1: USB disconnect, device number 2
[  990.880000] usb 1-1: ath9k_htc: USB layer deinitialized
[ 1036.300000] Restarting system.
HTLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLFC
PowerPrep start initialize power...
Battery Voltage = 3.32V
Chargeable battery detected but
...

Later we can enable the watchdog on boot when everything is working correctly:

[root@alarm ~]# systemctl enable watchdog.service

The Hardware Watchdog

The software watchdog module is, of course, no protection against a kernel fault but hardware watchdog support is coming for the iMX233-OLinuXino.

Have a look at chapter 23 of the iMX233 Reference Manual (17,5 MB):

23.7 Watchdog Reset Function

The watchdog reset is a CPU-configurable device. It is programmed by software to generate a chip-wide reset after HW_RTC_WATCHDOG milliseconds. The watchdog generates this reset if software does not rewrite this register before this time elapses.

The watchdog timer decrements the register value once for every tick of the 1-kHz clock supplied from the RTC analog section (see Figure 23-1). The reset generated by the watchdog timer has no effect on the values retained in the master registers of the real-time clock seconds counter, alarm, or persistent registers (analog persistent storage).

The watchdog timer is initially disabled and set to count 4,294,967,295 milliseconds before generating a watchdog reset.

The watchdog timer does not run when the chip is in its powered-down state. Therefore, there is no master/shadow register pairing for the watchdog timer, and it must be reprogrammed after cycling power or resetting the block.

I’ve seen a kernel option (<*>   Freescale STMP3XXX & i.MX23/28 watchdog) on newer kernels and also some log messages:

[    1.300000] stmp3xxx-rtc 8005c000.rtc: rtc core: registered 8005c000.rtc as rtc0
 Watchdog: stmp3xxx-rtc 8005c000.rtc: setting system clock to 1970-01-01 00:00:05 UTC (5)

Now I have 3 watchdog devices:

[root@olinuxino ~]# ls /dev/watchdog*
/dev/watchdog  /dev/watchdog0  /dev/watchdog1
[root@olinuxino ~]# cat /var/log/messages.log |grep watchdog
Dec 31 18:00:19 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:28:00 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:21 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:22 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:21 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:22 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:21 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:22 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
Dec 31 18:00:20 olinuxino kernel: [    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
[root@olinuxino ~]# dmesg |grep watchdog
[    1.470000] stmp3xxx_rtc_wdt stmp3xxx_rtc_wdt: initialized watchdog with heartbeat 19s
[root@olinuxino ~]# grep watchdog /var/log/syslog.log
[root@olinuxino ~]# cat /proc/version
Linux version 3.9.0-dirty (chris@thinkpad) (gcc version 4.7.3 20130312 (release) [ARM/embedded-4_7-branch revision 196615] (GNU Tools for ARM Embedded Processors) ) #1 Mon May 6 12:29:05 CEST 2013
[root@olinuxino ~]#

But which is the hardware watchdog?

[root@olinuxino ~]# echo "watchdog-device = /dev/watchdog" > test.conf
[root@olinuxino ~]# wd_identify --config-file ./test.conf
STMP3XXX RTC Watchdog
[root@olinuxino ~]# echo "watchdog-device = /dev/watchdog0" > test.conf
[root@olinuxino ~]# wd_identify --config-file ./test.conf
STMP3XXX RTC Watchdog
[root@olinuxino ~]# echo "watchdog-device = /dev/watchdog1" > test.conf
[root@olinuxino ~]# wd_identify --config-file ./test.conf
Software Watchdog
[root@olinuxino ~]#

So by default the hardware watchdog timer gets assigned to /dev/watchdog which makes sense. I haven’t tested it yet whether the hardware watchdog timer is working on the OLinuXino but I think so.