Resource Watchdog

The Resource Watchdog monitors useful information such as CPU usage, ethernet interface states and addresses, etc. and reports it in the syslog of the Connect. The Watchdog is entirely configurable with its own bespoke configuration file.

It is possible to redirect the syslog of the Connect to either the RFS or a client, for more information, see syslog.

Enabling or disabling the Watchdog

The Resource Watchdog is disabled by default and starts automatically at Connect boot. To disable it, or to re-enable it, go in the FPUI to Menu System > System Configuration > Watchdog Config > Enable watchdog and select the appropriate setting (ENABLE or DISABLE).

Connect needs to reboot before Enable or Disable setting takes place.

You can also modify the [nethsm_watchdog] section in the Connect Configuration File by setting enable_watchdog to yes or no. Then, push it to the Connect as usual using cfg-pushnethsm.

See the nShield Connect User Guide section Configuring the nShield Connect to use the client for more information.

The watchdog automatically applies the new setting after a cycle (usually up to a minute).

Understanding the default settings

The default Watchdog Configuration File is as follows:

Network Devices:
- eth0
- eth1
- nshield0
Monitored Processes:
- hardserver
- netui
- cosmod
System Information Report:
  enabled: False
  interval: 1h
CPU Usage Monitoring:
  threshold: 180%
  interval: 60s
  frequency: 1h

Network Devices

The Watchdog monitors when the ethernet interfaces listed in its configuration file change state. By default, it monitors eth0, eth1 and nshield0 (only available on nShield 5c). You can add or remove ethernet interfaces from the configuration file.

A typical report looks like this:

Mar 13 10:47:50 nethsm nfwatchdog: Interface eth0 is now up.
Mar 13 10:47:50 nethsm nfwatchdog: Interface eth1 is now down.

The Watchdog also monitors address changes on these interfaces. A typical report looks like this:

Mar 13 10:47:50 nethsm nfwatchdog: The addresses of interface eth0 have been set to ["00:60:e0:87:a1:59", "172.23.135.129"].
Mar 13 13:58:25 nethsm nfwatchdog: The addresses of interface eth0 have changed from ["00:60:e0:87:a1:59", "172.23.135.8"] to ["00:60:e0:87:a1:59", "172.23.135.129"].

Monitored Processes

The Watchdog monitors when the processes listed in its configuration file start and stop. By default, it monitors harderver, netui and cosmod. You can add or remove processes from the configuration file. Please provide only the process name, do not include the path.

A typical report looks like this:

Mar 13 10:47:50 nethsm nfwatchdog: Process netui is not running.
Mar 13 10:47:50 nethsm nfwatchdog: Process hardserver is running: {"arguments": ["../sbin/hardserver", "-Llogfile"], "children": [{"arguments": ["/opt/nfast/python3/bin/python3", "-uIBm", "nshield.entrypoint", "--", "/opt/nfast/bin/hsc_servicehosts"], "children": [], "name": "python3", "pid": 607, "status": "running", "uptime": "0:00:01"}], "name": "hardserver", "pid": 598, "status": "sleeping", "uptime": "0:00:01"}

If the process is running, the Watchdog reports, in JSON format, the process name, arguments, PID, uptime, status and children, if any.

System Information Report

The Watchdog can report System Information. By default, the report is disabled. To enable it, set System Information Report > enabled to yes.

By default, the System Information Report is printed every hour (if enabled). To change the interval, set System Information Report > interval to the desired interval. This field can be expressed in seconds, minutes, hours, or days. Examples: 8h, 2d8h5m20s, 2m4s. A unit is required.

A typical report looks like this:

Mar 13 14:05:32 nethsm nfwatchdog: System Report for the Connect is {"system_uptime": "3:20:13.930539", "virtual_memory": {"total": 8253513728, "free": 7962390528, "available": 7996977152, "used": 173985792}, "load_avg": [0.09, 0.04, 0.01]}

It contains the following information in JSON format: * system up time * virtual memory in bytes (total, free and available, as defined in psutil) * system load over the last 1, 5 and 15 minutes, as defined in psutil. * The “load” represents the processes which are in a runnable state, either using the CPU or waiting to use the CPU (for example, waiting for disk I/O).

Field Value Comment

enabled

True or False, default: False.

Whether the System Information Report is enabled.

interval

An interval in seconds, minutes, hours or days, default: 1h.

The interval at which the System Information Report is printed, if enabled. A unit (s, m, h or d) is required.

CPU Usage Monitoring

The Watchdog monitors both the global CPU usage on the Connect and the CPU usage per process. Should the total CPU usage or the CPU usage of any process (as computed with psutil’s cpu_percent function) be above the threshold (default: 180%) for a sustained interval (default: 60s), then the Watchdog will report it. This check is performed every hour by default (CPU Usage Monitoring > frequency).

A typical report looks like this:

Mar 13 10:48:44 nethsm nfwatchdog: Global CPU usage for the Connect is 92.1%, higher than the threshold of 90%.
Mar 13 10:48:44 nethsm nfwatchdog: CPU usage for the the following processes is higher than the threshold of 90%.
Mar 13 10:48:44 nethsm nfwatchdog: Process hardserver, 90.5%: {"arguments": ["../sbin/hardserver", "-Llogfile"], "children": [{"arguments": ["../sbin/hardserver", "--spawn-svc"], "children": [], "name": "hardserver", "pid": 620, "status": "sleeping", "uptime": "0:00:54"} {"arguments": ["/opt/nfast/bin/ncssh", "-id", "/opt/nfast/services/client/ncoreapi/ssh/id_ecdsa", "-known-hosts", "/opt/nfast/services/module/5CA8-5A32-80F2/ncoreapi/known_hosts", "-hosts", "/opt/nfast/services/module/hosts.txt", "-hostname", "nshield-5CA8-5A32-80F2.local", "-port", "2203", "-user", "ncoreapi", "ncoreapi", "operational"], "children": [], "name": "ncssh", "pid": 626, "status": "sleeping", "uptime": "0:00:54"}], "name": "hardserver", "pid": 598, "status": "sleeping", "uptime": "0:00:55"}

For each process, the Watchdog reports the process name, the CPU usage as reported by psutil’s cpu_percent, and in JSON format: the process name, arguments, PID, children, status and uptime.

The threshold must be a number between 0% and 100% x (number of CPUs on the Connect). The % sign is optional. The default is 180%.

The interval is the time during which the Watchdog samples the CPU usage. This interval is blocking, that is to say that the Watchdog is not able to perform any other monitoring while the sampling is happening. The default is 60s. This field can be expressed in seconds, minutes, hours, or days. Examples: 8h, 2d8h5m20s, 2m4s. A unit is required.

The frequency sets how often the Watchdog performs the sampling. By default, the Watchdog performs the CPU usage sampling for 60s every hour (frequency = 1h). This field can be expressed in seconds, minutes, hours, or days. Examples: 8h, 2d8h5m20s, 2m4s. A unit is required.

Field Value Comment

threshold

A percentage between 0% and 100% x number of CPU, default: 180%.

Threshold above which global CPU and process CPU usage is reported. The % sign is optional.

interval

An interval in seconds, minutes, hours or days, default: 60s.

The interval during which the Watchdog computed the CPU usage (both total and per process). A unit (s, m, h or d) is required.

frequency

A frequency in seconds, minutes, hours or days, default: 1h.

The frequency at which the Watchdog monitors the CPU usage. A unit (s, m, h or d) is required.

Other features

Zombie processes

On top of the aforementioned configurable features, the Watchdog also monitors zombie processes, or processes that are waiting for their parent.

A typical report looks like this:

Aug  7 03:47:46 nethsm nfwatchdog: Information: Process netui exited, awaiting parent.
Aug  7 03:47:51 nethsm nfwatchdog: Information: 3 processes (netui, cosmod, hardserver) exited, awaiting parent.

Report filtering

Should any Watchdog event occur several times in a row, the reports will be filtered out, so as not to flood the log. This will be reported by the Watchdog in the following way:

May  9 09:35:47 nethsm nfwatchdog: Global CPU usage for the Connect is 200.0%, higher than the threshold of 180.0%.

Reading or modifying the Watchdog Configuration File

  1. Create a Watchdog Configuration file called nfwatchdog.yml.new on your client with your changes.

  2. Prepare to push the new configuration file. Perform one of the following:

    • On the Connect UI:

      1. On Menu System > System Configuration > Watchdog Config > Config push mode, set push to on.

      2. On Menu System > System Configuration > Watchdog Config > Client address, set the address of your client.

    • Update the Connect Configuration File directly:

      1. Open the file and locate the [nethsm_watchdog] section.

      2. Set push to on.

      3. Set remote_ip to the IP of your client.

      4. (Optionally) Set remote_keyhash as well. For more information, see config_op.

  3. Push the Connect Configuration File as usual, for example using cfg-pushnethsm.

  4. Use the following command to send your new Watchdog Configuration File to the Connect:

    sudo nfcp ./nfwatchdog.yml.new <connect_ip>:cfg-nfwatchdog

    The Connect will pick up the new configuration and the Watchdog will apply it after a cycle (usually up to 1 minute). The syslog should show:

    Feb 10 13:22:36 nethsm ../sbin/config-update: found new pushed watchdog config, attempting to install
    Feb 10 13:22:36 nethsm ../sbin/config-update: successfully installed new watchdog config

    The Connect will then push the current configuration back to /opt/nfast/kmdata/hsm-<esn>/config on your client, alongside the Connect Configuration File, if you have enabled [config_op] push.

Configuring via hardserver configuration file

The Watchdog can be configured using the hardserver configuration file, in the following steps.

  1. Set the Watchdog entries in the hardserver configuration file.

    [nethsm_watchdog]
    # Start of the nethsm_watchdog section
    # Connect Watchdog configuration. This section allows you to enable or disable
    # the Connect Watchdog and set up config file push.
    # Each entry has the following fields:
    #
    # Enable the Connect watchdog. (default=no)
    #  enable_watchdog=ENUM
    #
    # Whether to allow a client to push new watchdog config files to the netHSM.
    # If "on" then this effectively allows a client to remotely configure the
    # nethsm_watchdog. (default=off)
    #  push=ENUM
    #
    # The IP address of the client allowed to push watchdog config files. If not
    # set, or set to 0.0.0.0 or ::, allows ANY IP address to push on a new config
    # file.
    #  remote_ip=ADDR
    #
    # The hash of the key that the authorised client should use to authenticate
    # itself, or 40 zeros to indicate no key authentication required. (Default is
    # 40 zeros).
    #  remote_keyhash=KEYHASH
    enable_watchdog

    This section allows you to enable or disable the Connect Watchdog and set up config file push. The default is 'no'.

    push

    Whether to allow a client to push new Watchdog config files to the netHSM. If this is "on", it allows a client to remotely configure the nethsm_watchdog. The default is 'off'.

    remote_ip

    The IP address of the client allowed to push Watchdog config files. If this is not set, or set to 0.0.0.0 or ::, it allows any IP address to push a new config file.

    remote_keyhash

    The hash of the key that the authorised client should use to authenticate itself, or 40 zeros to indicate no key authentication required. The default is 40 zeros.

  2. Restart the hardserver to load the updated configuration file.

  3. Reboot the connect if enabling or disabling the Watchdog.

Troubleshooting

The Watchdog is designed to fall back to its default values in case the Watchdog Configuration File is missing or malformed. If the Watchdog does not behave as you intended, examine the syslog for clues.

Factory stating the Connect makes the Watchdog Configuration File revert to its defaults.
Error message Likely cause Solution

Could not open Watchdog Configuration File. Watchdog Configuration File has been restored to its default state.

Watchdog Configuration File is missing or contains a YAML syntax error.

Push a new Watchdog Configuration File.

Watchdog Configuration File was unexpectedly empty. Watchdog Configuration File has been restored to its default state.

Watchdog Configuration File is missing or empty.

Push a new Watchdog Configuration File.

Could not read missing …​ field. Using default …​ instead.

The field is missing or there is a typo.

Update the Watchdog Configuration File and push it again.

Ignoring unrecognized category …​ from Watchdog Configuration File. Valid categories are …​

There is a typo in the category or it is out-of-date.

Update the Watchdog Configuration File and push it again.

Ignoring unrecognized subcategory …​ from Watchdog Configuration File. Valid subcategories are …​

There is a typo in the subcategory or it is out-of-date.

Update the Watchdog Configuration File and push it again.

Could not parse any time information from …​ in the Watchdog Configuration File. Examples of valid strings: …​ Could not use …​ as …​. Using default …​ instead.

The string is not a valid time.

Update the Watchdog Configuration File and push it again.

Could not parse 'CPU Usage Monitoring/threshold': must be a number, not …​

The threshold provided is not a valid number.

Update the Watchdog Configuration File and push it again.

Could not parse 'CPU Usage Monitoring/threshold': must be between 0% and 200%, not …​

The threshold provided is not a number.

Update the Watchdog Configuration File and push it again.

The intervals and frequencies at which the Watchdog works cannot be enforced. That is, if the Watchdog is computing CPU usage and is due to compute the System Information Report, it will do so as soon as it is available.