Alerts
The alerts are listed here by their KeySafe 5 alert type.
The "nShield Monitor Alert" row, where available, displays the legacy alert name from nShield Monitor.
Data marked with "*" are returned only as part of the summary and not as a specific alert record field. For more information about what data is returned, see Labels.
HSM PSU Failure
This alert has a resolution.
Valid parameters |
for |
|---|---|
OpenMetrics used |
nshield_error_conditions(source="psu_failed") |
Basic Prometheus query |
nshield_error_conditions{source=\"psu_failed\", nshield_error_conditions=\"failed\"} == 1 |
Other data returned |
esn |
Summary returned |
HSM PSU has failed |
nShield Monitor alert |
snmp psuAlert / NShieldPowerSupplyFailure |
HSM Fan Failure
Valid parameters |
for |
|---|---|
OpenMetrics used |
nshield_error_conditions(source="fanX") |
Basic Prometheus query |
nshield_error_conditions{source=\"fanX\", nshield_error_conditions=\"failed\"} == 1 |
Other data returned |
esn, source* |
Summary returned |
HSM Fan {source} has failed. |
HSM Chassis Battery
Valid parameters |
for |
|---|---|
OpenMetrics used |
nshield_error_conditions(source="chassis_battery_low") |
Basic Prometheus query |
nshield_error_conditions{source=\"chassis_battery_low\", nshield_error_conditions=\"failed\"} == 1 |
Other data returned |
esn |
Summary returned |
HSM chassis battery has failed |
HSM Fan Speed
This alert has a resolution.
The alert is triggered if the fan speed drops below the minimum limit or exceeds the maximum limit.
Valid parameters |
for, over |
|---|---|
OpenMetrics used |
nshield_fan_speed_rpm, nshield_fan_speed_limit_rpm |
Basic Prometheus query |
nshield_fan_speed_rpm > ignoring(limit) nshield_fan_speed_limit_rpm{limit="maximum"}, nshield_fan_speed_rpm < ignoring(limit) nshield_fan_speed_limit_rpm{limit="minimum"} (2 queries for each trigger) |
Other data returned |
esn, fan_id |
Summary returned |
HSM Fan {fan_id} maximum limit exceeded. HSM Fan {fan_id} is below its minimum limit. |
nShield Monitor alert |
snmp fanfailureAlert / NShieldXCFanSpeedZero |
HSM Memory Usage Percentage
This is the sum of the kernel and user memory expressed as a percentage of the total amount of available memory, or (kernel + user) / total * 100.
Valid parameters |
min < x or max > y, for, over |
|---|---|
OpenMetrics used |
nshield_module_mem_bytes, nshield_module_mem_alloc_kernel_bytes, nshield_module_mem_alloc_user_bytes |
Basic Prometheus query |
((nshield_module_mem_alloc_kernel_bytes + nshield_module_mem_alloc_user_bytes) / nshield_module_mem_bytes) * 100 > max |
Other data returned |
esn |
Summary returned |
HSM Memory Usage exceeds {max}%. HSM memory usage is below {min}%. |
nShield Monitor alert |
snmp memoryUsageHighAlert / memoryUsageOkAlert |
HSM Temperature Percentage
The maximum value for any sensor. The sensor value is calculated the current value divided by the limit maximum value: maximum( 100 * current(sensor) / limit maximum(sensor) ) > max.
The maximum value is a percentage between 0 and 150.
Valid parameters |
max > y, for, over |
|---|---|
OpenMetrics used |
nshield_temperature_celsius, nshield_temperature_limit_celsius |
Basic Prometheus query |
(nshield_temperature_celsius / ignoring(limit) nshield_temperature_limit_celsius\{limit="maximum"}) * 100 > max |
Other data returned |
esn, sensor* |
Summary returned |
HSM Sensor: "{sensorName}" temperature exceeds {max}% of maximum temperature. |
nShield Monitor alert |
NShieldTemperaturePeak |
HSM Queue Percentage
This alert type maps to two nShield Monitor alerts:
DeviceNShieldUtilizationOverloads
A two layer event, warning and critical, with a single resolution.
nShieldMonitor was load = (cmdCount - replyCount) * 100 / nshield.maxQ
load = jobs in progress / maximum queue length.
The minimum and maximum values values are percentages, from 0 to 150.
DeviceNShieldUtilizationPeakEvent
This alert has duration.
Valid parameters |
min < x or max > y, for, over |
|---|---|
OpenMetrics used |
nshield_queue_in_progress, nshield_queue_length_limit |
Basic Prometheus query |
|
Other data returned |
esn, vcm |
Summary returned |
Amount of Active HSM Jobs exceed {max}% of queue length. Amount of Active HSM Jobs is below {min}% of queue length. |
nShield Monitor alerts |
DeviceNShieldUtilizationOverloads DeviceNShieldUtilizationPeakEvent |
HSM Objects Count
This alert has a duration. It is a two layer event, warning and critical, with a single resolution.
Valid parameters |
min < x or max > y, for, over |
|---|---|
OpenMetrics used |
nshield_objects_stored_total, nshield_objects_destroyed_total |
Basic Prometheus query |
nshield_objects_stored_total - nshield_objects_destroyed_total > max |
Other data returned |
esn, vcm |
Summary returned |
HSM Objects Count exceeds {max}. HSM Objects Count is below {min}. |
nShield Monitor alert |
DeviceNShieldHigHObjectCount |
Host Hardserver
Valid parameters |
for |
|---|---|
OpenMetrics used |
nshield_hardserver_liveness |
Basic Prometheus query |
nshield_hardserver_liveness == 0 |
Other data returned |
esn, vcm |
Summary returned |
HSM is not responsive |
nShield Monitor alert |
ClientHostHardserverFailure |
HSM Liveness
KeySafe5 is not communicating with the HSM.
Valid parameters |
for |
|---|---|
Default alert (unlicensed) |
Y (for 5 minutes) |
OpenMetrics used |
nshield_hsm_liveness |
Basic Prometheus query |
nshield_hsm_liveness == 0 |
Other data returned |
esn, vcm |
Summary returned |
HSM is not responsive |
nShield Monitor alert |
DeviceConnStatus |
Host Liveness
KeySafe 5 is not communicating with the host.
Valid parameters |
for |
|---|---|
Default alert (unlicensed) |
Y (for 5 minutes) |
OpenMetrics used |
nshield_host_liveness |
Basic Prometheus query |
nshield_host_liveness == 0 |
Other data returned |
host |
Summary returned |
Host is not responsive |
nShield Monitor alert |
KeySafe5 is not communicating with the host |
Licence Expiry
Valid parameters |
min < x |
|---|---|
Default alert (unlicensed) |
Y (min < 0) Y (min < 30 days) |
OpenMetrics used |
keysafe5_licence_expiry |
Basic Prometheus query |
keysafe5_licence_expiry - current_time < min |
Other data returned |
licence |
Summary returned |
Licence {licence} will expire in less than {min} days. Licence {licence} will expire in less than 1 day. Licence {licence} has expired. |
HSM Client Licences Remaining
The actual number of licences remaining.
Valid parameters |
min < y, for, over |
|---|---|
OpenMetrics used |
nshield_current_crypto_clients nshield_current_crypto_clients_limit |
Basic Prometheus query |
nshield_current_crypto_clients_limit - nshield_current_crypto_clients < min |
Other data returned |
esn |
Summary returned |
Fewer than {min} client licences remaining. No HSM client licences remaining. |
Certificate Expiry
Valid parameters |
min < x |
|---|---|
Default alert (unlicensed) |
Y (min < 0) Y (min < 30 days) |
OpenMetrics used |
keysafe5_certificate_expiry |
Basic Prometheus query |
keysafe5_certificate_expiry{type=\"agent\"} < min keysafe5_certificate_expiry{type=\"central\"} < min keysafe5_certificate_expiry{type=\"ca\", agent=\"\"} < min keysafe5_certificate_expiry{type=\"ca\", agent=~\".+\"} < min |
Other data returned |
type, agent1 |
Summary returned |
<cert> will expire in less than {min} days. <cert> will expire in less than 1 day. <cert> has expired. |
-
typeandagentvalues determine which of the following variations is used for the <cert> portion of the summary:-
Central platform should have type "central" or type "ca" (System certificate, System CA certificate)
-
Agents will have type "agent" or type "ca" but will have an agent id (Agent <agent id> certificate, Agent <agent id> CA certificate)
-