KeySafe 5 v1.5.0 Metrics

Introduction

The following tables display metric exposure in KeySafe 5, the source of the metric, and whether it was available in nShield Monitor or SNMP.

In the tables, the term "resource" refers to whether a statistic belongs to an HSM or a host.

If the statistic belongs to an nShield 5c 10G HSM, the term "resource" also diffentiates between "platform" and "tenancy". All other HSMs report both.

The HSM metrics can apply to the actual HSM ("module") or the HSM "chassis".

OpenMetrics

nshield_hsm

The labels on the HSM.

Type Details

Information

Has the label "label".

nshield_error_conditions

Error conditions reported by the 5c 10G chassis. In the 5c 10G, all conditions come from envmon.

In non-5c 10G network-attached HSMs, there is only a single error condition available, PSU failure, which comes from stattree.

Type Details Labels stattree node stattree ID EnvMon

stateset

  • failed

  • okay

source="psu_failed"

HostEnvStats

PSUFailure

cosmo_alerts_psu_failed

nshield_uptime_seconds

The length of elapsed time since the HSM was last reset. This does not include the HSM chassis.

Resource Type Unit Details stattree node stattree ID SNMP MIB

Platform

counter

seconds

ModuleEnvStats

Uptime

moduleStatsTable > uptime

nshield_hsm_liveness

A boolean HSM liveness report where 1 indicates "live".

Resource Type

Platform

gauge

nshield_commands

The total number of commands sent for processing from a client to the server or from the server to an HSM. The number of commands currently being processed is the CmdCount minus the ReplyCount.

For an HSM, this is the number of commands received from any client.

Resource Type Details stattree node stattree ID

Tenancy

counter

ModuleJobStats

CmdCount

nshield_replies

The total number of replies sent from a server to a client or from an HSM to a server.

For an HSM, this is the number of replies sent to any client.

Resource Type Details stattree node stattree ID

Tenancy

counter

ModuleJobStats

ReplyCount

nshield_objects_stored

The number of times the object store has had a new object put into it.

Resource Type Details stattree node stattree ID

Tenancy

counter

ModuleObjStats

ObjectsCreated

nshield_objects_destroyed

The number of items that have been deleted from the HSM’s object store and had their corresponding memory released.

Resource Type Details stattree node stattree ID

Tenancy

counter

ModuleObjStats

ObjectsDestroyed

nshield_queue_length_limit

The maximum and minimum queue length. This is a static value obtained from the HSM resource.

Resource Type Details

Tenancy

gauge

Has label "limit".

nshield_current_clients

The number of client connections currently made to the Connect hardserver.

Resource Type stattree node stattree ID

Tenancy

gauge

ServerGlobals

ClientCount

nshield_current_clients_limit

The number of licensed client connections available.

Resource Type stattree node stattree ID

Tenancy

counter

ServerGlobals

MaxClients

nshield_current_crypto_clients_limit

The number of licenced crypto client connections available.

Resource Type

Tenancy

gauge

nshield_current_crypto_clients

The number of licensable clients that are currently connected, including both active and parked sessions.

This is only relevant when reported from a hardserver with remote clients that have an image version of 13.5 or later.

Resource Type stattree node stattree ID

Tenancy

gauge

ServerGlobals

CryptoClientCount

nshield_audit_db_free_bytes

Resource Type Unit stattree node stattree ID

Tenancy

gauge

bytes

ServerGlobals

AuditDBFreeSpaceMB

nshield_audit_db_used_bytes

Resource Type Unit stattree node stattree ID

Tenancy

gauge

bytes

ServerGlobals

AuditDBUsedSpaceMB

nshield_queue_in_progress

All jobs currently in progress on the HSM, including jobs from the SEE machine.

Resource Type stattree node stattree ID

Tenancy

gauge

ModuleServerStats

JobsOutstanding

System load

nshield_cpu_load_per_hsm

The current processing load on the HSM. This is represented as a number from 0 to 100.

HSMs typically contain several different types of processing resources, such as the main CPU and RSA acceleration. This means that reporting on HSM load can be imprecise.

Normally, HSMs report 100% CPU load when all RSA processing capacity is occupied. When performing non-RSA tasks, the main CPU and other resources, such as the random number generator, can become saturated without this metric reaching 100%.

On the nShield Connect, this metric comes from stattree.

Resource Type Details Labels stattree node stattree ID

Platform

gauge

A value between 0 and 1.

Has label "source".

source="module"

ModuleJobStats

CPULoadPercent

source="chassis"

HostEnvStats

CPULoadPercent

nshield_cpu_load_average_per_hsm

Resource Type Details Labels EnvMon

Platform

gauge

A value between 0 and 1.

Has label "source".

source="1min"

cpu_load_average_1min

source="5min"

cpu_load_average_5mins

source="15min"

cpu_load_average_15mins

nshield_cpu_throttled

nShield 5 HSMs only.

An indicator of whether the main processor is being throttled to avoid overheating or not. A throttled processor is confirmation that a module is getting too hot. Processor throttling will impact cryptographic performance.

Resource Type Details

Platform

stateset

"throttled" or "okay"

Temperatures

nshield_temperature_celsius

The current temperature of different parts of the HSM.

Resource Type Units Details

Platform

gauge

celsius

Has label "sensor".

See the following sections for more details about the labels.

nshield_temperature_celsius: Module

The current temperature of the main circuit board. First-generation HSMs do not have a temperature sensor and so do not return temperature statistics.

Labels stattree node stattree ID

sensor="module_cpu_temp"

ModuleEnvStats

CurrentCPUTemp1

sensor="module_msp_temp"

`ModuleEnvStats

TempSP

sensor="module_crypto_co_proc_temp"

`ModuleEnvStats

CurrentCPUTemp2

nshield_temperature_celsius: Connect (chassis)

The ambient sensors for the chassis.

Labels stattree node stattree ID

sensor="chassis_left"

HostEnvStats

CurrentTempC

sensor="chassis_right"

CurrentTemp2C

nshield_temperature_celsius: 5c 10G (chassis processor)

KeySafe 5 takes the four CPU temperature readings and provides the maximum temperature as the reported value.

Labels EnvMon

sensor="chassis_processor"

temp_cpu_core0

temp_cpu_core1

temp_cpu_core2

temp_cpu_core3

nshield_temperature_celsius: 5c 10G (inlet and outlet)

The inlet and outlet sensors are similar to the ambient sensors in non-5c 10G network-attached HSMs, however they are in different positions so do not provide the same information.

Labels EnvMon

sensor="chassis_inlet_left"

temp_inlet_left

sensor="chassis_inlet_right"

temp_inlet_right

sensor="chassis_outlet_left"

temp_outlet_left

sensor="chassis_outlet_right"

temp_outlet_right

nshield_temperature_limit_celsius

The minimum and maximum acceptable temperature values for each sensor.

For the nShield 5c 10G, these values are provided by the platform. For non-5c 10G network-attached HSMs, these are hardcoded by KeySafe 5.

Resource Type Units Details Labels

Platform

gauge

celsius

Has labels "sensor" and "limit".

For sensor, see nshield_temperature_celsius

limit="maximum"

nshield_max_temperature_celsius

The maximum temperature recorded by the HSM’s temperature sensor. This is stored in non-volatile memory and is cleared when the unit is initialized.

The HostEnvStats are for non-5c 10G network-attached HSMs.

Resource Type Units Details

Platform

gauge

celsius

Has label "sensor".

See the following table for more details about the labels:

Labels stattree node stattree ID

sensor="module_cpu_temp"

ModuleEnvStats

MaxTempC

sensor="chassis_left"

HostEnvStats

MaxTempC

sensor="chassis_right"

HostEnvStats

MaxTemp2C

nshield_min_temperature_celsius

The minimum temperature recorded by the HSM’s temperature sensor. This is stored in non-volatile memory and is cleared when the unit is initialized.

The HostEnvStats are for non-5c 10G network-attached HSMs.

Resource Type Units Details

Platform

gauge

celsius

Has label "sensor".

See the following table for more details about the labels:

Labels stattree node stattree ID

sensor="module_cpu_temp"

ModuleEnvStats

MinTempC

sensor="chassis_left"

HostEnvStats

MinTempC

sensor="chassis_right"

HostEnvStats

MinTemp2C

Electrical

nshield_platform_voltage_volts

stattree mapping

Resource Type Units Details

Platform

gauge

volts

Has label "voltage_sensor".

See the following table for more details about the labels:

Labels stattree node stattree ID

voltage_sensor="cpu_core"

ModuleEnvStats

CPUVoltage1

voltage_sensor="t1022_ifc_io"

CPUVoltage2

voltage_sensor="t1022_serdes"

CPUVoltage3

voltage_sensor="t1022_serdes_io"

CPUVoltage4

voltage_sensor="fpga_serdes_core"

CPUVoltage5

voltage_sensor="fpga_serdes_io"

CPUVoltage6

voltage_sensor="msp_avcc"

CPUVoltage7

voltage_sensor="ddr4_access"

CPUVoltage8

voltage_sensor="ddr4_io"

CPUVoltage9

voltage_sensor="pci_bus"

CPUVoltage10

voltage_sensor="module_battery"

CPUVoltage11

EnvMon mapping

Resource Type Units Details

Platform

gauge

volts

Has label "voltage_sensor".

See the following table for more details about the labels:

Labels EnvMon

voltage_sensor="chassis_battery"

tamper_battery_voltage

voltage_sensor="12V"

power_supply_12V_voltage

voltage_sensor="3V"

voltage_sensor="5V"

voltage_sensor="5VStandby"

power_supply_5VSB_voltage

nshield_platform_current_amperes

Resource Type Units Details

Platform

gauge

amperes

Has label "current_sensor".

See the following table for more details about the labels:

Labels EnvMon

current_sensor="12V"

power_supply_12V_current

curent_sensor="3V"

current_sensor="3V"

current_sensor="5V"

current_sensor="5VStandby"

power_supply_5VSB_current

Fans

nshield_fan_speed_rpm

The fan speed for each fan in the HSM.

KeySafe 5 assumes fan speeds greater than 120,000rpm are errors, and instead reports a speed of zero.

Resource Type Units Details

Platform

gauge

rpm

Has label "fan_id".

Labels stattree node stattree ID EnvMon

fan_id="chassis1"

HostEnvStats

CurrentFanRPM

fan1_rpm

fan_id="chassis2"

CurrentFan2RPM

fan2_rpm

fan_id="chassis3"

CurrentFan3RPM

fan3_rpm

fan_id="chassis4"

CurrentFan4RPM

fan4_rpm

fan_id="module"

nshield_fan_speed_limit_rpm

The fan speed limits for each fan in the HSM.

These are hardcoded for each HSM type.

Resource Type Units Details Labels

Platform

gauge

rpm

Has labels "fan_id" and "limit".

For fan_id, see nshield_fan_speed_rpm.

limit = "maximum" and "minimum".

Memory

nshield_module_mem_bytes

The total amount of RAM, allocated and free, available to the HSM. This is equal to the installed RAM size, minus various fixed overheads.

It is a static value that is calculated by KeySafe 5.

Resource Type Units stattree node stattree ID

Platform

gauge

bytes

ModuleEnvStats

MemTotal

nshield_module_mem_alloc_kernel_bytes

The total amount of RAM allocated for kernel use, or non-SEE use, in a module. This is mainly used for the object store, for example, for keys and logical tokens, and for big-number buffers.

Resource Type Units stattree node stattree ID

Platform

gauge

bytes

ModuleEnvStats

MemAllocKernel

nshield_chassis_mem_alloc_kernel_bytes

The total amount of RAM allocated for kernel use, or non-SEE use, in a module. This is mainly used for the object store, for example, for keys and logical tokens, and for big-number buffers.

Resource Type Units stattree node stattree ID

Platform

gauge

bytes

HostEnvStats

MemAllocKernel

nshield_module_mem_alloc_user_bytes

The total amount of RAM allocated for user-mode processes in the module. This will be zero for non-SEE use.

This value includes the size of the SEE Machine image and the total heap space available to it. The module’s kernel does not know, and therefore cannot report, how much of the user-mode’s heap is currently free and how much is in use.

Resource Type Units stattree node stattree ID

Platform

gauge

bytes

ModuleEnvStats

MemAllocUser

nshield_module_mem_alloc_user_bytes

The total amount of RAM allocated for user-mode processes in the module. This will be zero for non-SEE use.

This value includes the size of the SEE Machine image and the total heap space available to it. The module’s kernel does not know, and therefore cannot report, how much of the user-mode’s heap is currently free and how much is in use.

Resource Type Units stattree node stattree ID

Platform

gauge

bytes

HostEnvStats

MemAllocUser

nshield_chassis_virtual_mem_bytes

The total memory in the system.

Resource Type Units EnvMon

Platform

gauge

bytes

memory_virtual_total

nshield_chassis_virtual_mem_free_bytes

The amount of physical RAM left unused by the system, in kilobytes.

Resource Type Units EnvMon

Platform

gauge

bytes

memory_virtual_free

nshield_chassis_virtual_mem_available_bytes

An estimate of the amount of memory available for starting new applications without swapping.

Resource Type Units EnvMon

Platform

gauge

bytes

memory_virtual_available

Storage

nshield_module_nvram_free_bytes

The total amount of free space in the NVRAM of the HSM.

This is only available on XC and nShield 5 HSM variants.

Resource Type Units stattree node stattree ID

Platform

gauge

bytes

ModuleEnvStats

NVMFreeSpace

nshield_module_nvram_erase_per_endurance

The wear level of the HSM’s NVRAM, expressed as a percentage of the "erase count:endurance" ratio.

This is only available on XC and nShield 5 HSM variants.

Resource Type Details stattree node stattree ID

Platform

gauge

A value between 0 and 1

ModuleEnvStats

NVMWWearLevel

nshield_module_worn_blocks_per_nvram

The percentage of worn blocks in the NVRAM of the HSM

This is only available on XC and nShield 5 HSM variants.

Resource Type Details stattree node stattree ID

Platform

gauge

A value between 0 and 1

ModuleEnvStats

NVMWornBlocks

nshield_chassis_manufacturer_disk_percentage

The percentage used of the storage reserved for manufacturing data.

Resource Type Details EnvMon

Platform

gauge

A value between 0 and 1

disk_usage_longterm

nshield_chassis_system_disk_percentage

The percentage used of the storage reserved for internal csoftware components.

Resource Type Details EnvMon

Platform

gauge

A value between 0 and 1

disk_usage_persistent

nshield_chassis_user_disk_percentage

The percentage used of the storage available for user configuration and logs.

Resource Type Details EnvMon

Platform

gauge

A value between 0 and 1

disk_usage_user

Internal software statistics

nshield_pci_irqs

The number of interrupts from the host. This is approximately equal to the total of HostReadCount and HostWriteCount.

This is only applicable to PCI HSMs.

Resource Type Details stattree node stattree ID

Platform

counter

ModulePCIStats

HostIRQs

nshield_pci_unhandled_irqs

The number of unidentified interrupts from the host. If this reports a nonzero value, it is likely that there is a problem with a driver or the PCI bus.

This is only applicable to PCI HSMs.

Resource Type Details stattree node stattree ID

Platform

counter

ModulePCIStats

HostUnhandledIRQs

nshield_pci_read_reconnect

The number of deferred reads that have now completed. This should be the same as HostReadDeferred, or one less than it if there is a currently deferred read.

This is only applicable to PCI HSMs.

Resource Type Details stattree node stattree ID

Platform

counter

ModulePCIStats

HostReadReconnect

nshield_AIS31_preliminary_alarms

The number of times the AIS31 random number test has failed. Because this test is a statistical test, a small number of failures is expected. If it fails too often, it will trigger a SOS-HRAO alarm and the module will fail.

Resource Type Details stattree node stattree ID

Platform

counter

ModuleEnvStats

AIS31PrelimAlarms

nshield_correctable_memory_errors

The number of correctable memory errors that have been corrected by the error checking and correction (ECC) mechanisms. Typically, this count should be 0, although a small number of errors are to be expected occasionally. If this count increases rapidly, by multiple thousands per second, there has been a malfunction.

Resource Type Details stattree node stattree ID

Platform

counter

ModuleEnvStats

MceCount

nshield_spi_communication_attempts

The number of times the main processor on an XC module has had to repeat an attempt to communicate with the security processor due to a communication failure. Loss of communication between the main processor and the security processor cause the module to enter an alarm state and fail. This sometimes triggers an SOS-HV alarm.

Resource Type Details stattree node stattree ID

Platform

counter

ModuleEnvStats

SpiRetries

System metrics

keysafe5_certificate_expiry

The length of time until the current KeySafe 5 certificate expires.

Type Unit Labels

gauge

seconds

agent="agentid"

type="agent"

subject="subject"

issuer="issuer"