Alerts

The alerts are listed here by their KeySafe 5 alert type.

The "nShield Monitor Alert" row, where available, displays the legacy alert name from nShield Monitor.

Other data returned

Data marked with "*" are returned only as part of the summary and not as a specific alert record field. For more information about what data is returned, see Labels.

HSM PSU Failure

This alert has a resolution.

Valid parameters

for

OpenMetrics used

nshield_error_conditions(source="psu_failed")

Basic Prometheus query

nshield_error_conditions{source=\"psu_failed\", nshield_error_conditions=\"failed\"} == 1

Other data returned

esn

Summary returned

HSM PSU has failed

nShield Monitor alert

snmp psuAlert / NShieldPowerSupplyFailure

HSM Fan Failure

Valid parameters

for

OpenMetrics used

nshield_error_conditions(source="fanX")

Basic Prometheus query

nshield_error_conditions{source=\"fanX\", nshield_error_conditions=\"failed\"} == 1

Other data returned

esn, source*

Summary returned

HSM Fan {source} has failed.

HSM Chassis Battery

Valid parameters

for

OpenMetrics used

nshield_error_conditions(source="chassis_battery_low")

Basic Prometheus query

nshield_error_conditions{source=\"chassis_battery_low\", nshield_error_conditions=\"failed\"} == 1

Other data returned

esn

Summary returned

HSM chassis battery has failed

HSM Fan Speed

This alert has a resolution.

The alert is triggered if the fan speed drops below the minimum limit or exceeds the maximum limit.

Valid parameters

for, over

OpenMetrics used

nshield_fan_speed_rpm, nshield_fan_speed_limit_rpm

Basic Prometheus query

nshield_fan_speed_rpm > ignoring(limit) nshield_fan_speed_limit_rpm{limit="maximum"},

nshield_fan_speed_rpm < ignoring(limit) nshield_fan_speed_limit_rpm{limit="minimum"} (2 queries for each trigger)

Other data returned

esn, fan_id

Summary returned

HSM Fan {fan_id} maximum limit exceeded. HSM Fan {fan_id} is below its minimum limit.

nShield Monitor alert

snmp fanfailureAlert / NShieldXCFanSpeedZero

HSM Memory Usage Percentage

This is the sum of the kernel and user memory expressed as a percentage of the total amount of available memory, or (kernel + user) / total * 100.

Valid parameters

min < x or max > y, for, over

OpenMetrics used

nshield_module_mem_bytes,

nshield_module_mem_alloc_kernel_bytes,

nshield_module_mem_alloc_user_bytes

Basic Prometheus query

((nshield_module_mem_alloc_kernel_bytes + nshield_module_mem_alloc_user_bytes) / nshield_module_mem_bytes) * 100 > max

Other data returned

esn

Summary returned

HSM Memory Usage exceeds {max}%.

HSM memory usage is below {min}%.

nShield Monitor alert

snmp memoryUsageHighAlert / memoryUsageOkAlert

HSM Temperature Percentage

The maximum value for any sensor. The sensor value is calculated the current value divided by the limit maximum value: maximum( 100 * current(sensor) / limit maximum(sensor) ) > max.

The maximum value is a percentage between 0 and 150.

Valid parameters

max > y, for, over

OpenMetrics used

nshield_temperature_celsius,

nshield_temperature_limit_celsius

Basic Prometheus query

(nshield_temperature_celsius / ignoring(limit) nshield_temperature_limit_celsius\{limit="maximum"}) * 100 > max

Other data returned

esn, sensor*

Summary returned

HSM Sensor: "{sensorName}" temperature exceeds {max}% of maximum temperature.

nShield Monitor alert

NShieldTemperaturePeak

HSM Queue Percentage

This alert type maps to two nShield Monitor alerts:

DeviceNShieldUtilizationOverloads

A two layer event, warning and critical, with a single resolution.

nShieldMonitor was load = (cmdCount - replyCount) * 100 / nshield.maxQ

load = jobs in progress / maximum queue length.

The minimum and maximum values values are percentages, from 0 to 150.

DeviceNShieldUtilizationPeakEvent

This alert has duration.

Valid parameters

min < x or max > y, for, over

OpenMetrics used

nshield_queue_in_progress,

nshield_queue_length_limit

Basic Prometheus query

Other data returned

esn, vcm

Summary returned

Amount of Active HSM Jobs exceed {max}% of queue length.

Amount of Active HSM Jobs is below {min}% of queue length.

nShield Monitor alerts

DeviceNShieldUtilizationOverloads

DeviceNShieldUtilizationPeakEvent

HSM Objects Count

This alert has a duration. It is a two layer event, warning and critical, with a single resolution.

Valid parameters

min < x or max > y, for, over

OpenMetrics used

nshield_objects_stored_total,

nshield_objects_destroyed_total

Basic Prometheus query

nshield_objects_stored_total - nshield_objects_destroyed_total > max

Other data returned

esn, vcm

Summary returned

HSM Objects Count exceeds {max}.

HSM Objects Count is below {min}.

nShield Monitor alert

DeviceNShieldHigHObjectCount

Host Hardserver

Valid parameters

for

OpenMetrics used

nshield_hardserver_liveness

Basic Prometheus query

nshield_hardserver_liveness == 0

Other data returned

esn, vcm

Summary returned

HSM is not responsive

nShield Monitor alert

ClientHostHardserverFailure

HSM Liveness

KeySafe5 is not communicating with the HSM.

Valid parameters

for

Default alert (unlicensed)

Y (for 5 minutes)

OpenMetrics used

nshield_hsm_liveness

Basic Prometheus query

nshield_hsm_liveness == 0

Other data returned

esn, vcm

Summary returned

HSM is not responsive

nShield Monitor alert

DeviceConnStatus

Host Liveness

KeySafe 5 is not communicating with the host.

Valid parameters

for

Default alert (unlicensed)

Y (for 5 minutes)

OpenMetrics used

nshield_host_liveness

Basic Prometheus query

nshield_host_liveness == 0

Other data returned

host

Summary returned

Host is not responsive

nShield Monitor alert

KeySafe5 is not communicating with the host

Licence Expiry

Valid parameters

min < x

Default alert (unlicensed)

Y (min < 0)

Y (min < 30 days)

OpenMetrics used

keysafe5_licence_expiry

Basic Prometheus query

keysafe5_licence_expiry - current_time < min

Other data returned

licence

Summary returned

Licence {licence} will expire in less than {min} days.

Licence {licence} will expire in less than 1 day.

Licence {licence} has expired.

HSM Client Licences Remaining

The actual number of licences remaining.

Valid parameters

min < y, for, over

OpenMetrics used

nshield_current_crypto_clients

nshield_current_crypto_clients_limit

Basic Prometheus query

nshield_current_crypto_clients_limit - nshield_current_crypto_clients < min

Other data returned

esn

Summary returned

Fewer than {min} client licences remaining.

No HSM client licences remaining.

Certificate Expiry

Valid parameters

min < x

Default alert (unlicensed)

Y (min < 0)

Y (min < 30 days)

OpenMetrics used

keysafe5_certificate_expiry

Basic Prometheus query

keysafe5_certificate_expiry{type=\"agent\"} < min

keysafe5_certificate_expiry{type=\"central\"} < min

keysafe5_certificate_expiry{type=\"ca\", agent=\"\"} < min

keysafe5_certificate_expiry{type=\"ca\", agent=~\".+\"} < min

Other data returned

type, agent1

Summary returned

<cert> will expire in less than {min} days.

<cert> will expire in less than 1 day.

<cert> has expired.

  1. type and agent values determine which of the following variations is used for the <cert> portion of the summary:

    • Central platform should have type "central" or type "ca" (System certificate, System CA certificate)

    • Agents will have type "agent" or type "ca" but will have an agent id (Agent <agent id> certificate, Agent <agent id> CA certificate)