Multi-tenancy recovery of a failed VCM

In a multi-tenant system it is possible for the HSM hosting a VCM to reboot independently of the tenant machine or for network outages to prevent communication between the tenant and their VCM. This can result in the enquiry command reporting a module as failed.

In most cases, the module can be restored to its previous state by a combination of the following:

Check the network

Ensure that there is network connectivity between the tenant and their VCM. The commands needed to do this will depend on your network and operating system, but you should ideally send ICMP echo requests and ensure that a reply is received, as in the example below where the VCM has an IP address of 192.168.0.107.

ping 192.168.0.107
PING 192.168.0.107 (192.168.0.107) 56(84) bytes of data.
64 bytes from 192.168.0.107: icmp_seq=1 ttl=63 time=19.4 ms
64 bytes from 192.168.0.107: icmp_seq=2 ttl=63 time=0.787 ms
64 bytes from 192.168.0.107: icmp_seq=3 ttl=63 time=0.773 ms

If you are unable to send and receive packets to and from the VCM, ensure that you have followed all the steps in network configuration and use standard network debugging tools to fix any network issues before continuing.

If any of the network settings you made while following network configuration do not survive reboots, this can be a common cause of failure and you should try to ensure that all settings survive reboot.

Temporarily reduce load if you are able and your HSM is at high load

If the HSM is running close to its maximum capability, you may experience timeouts in some administrative actions. Testing has shown that on a high-speed HSM this normally only occurs with at least 10 active VCMs all running at their maximum capacity.

As mentioned in the release notes, it is not recommended to run a multi-tenant HSM at such extreme load, but if you do so and experience some failure then it is very helpful to temporarily reduce the load on some VCMs while you attempt to recover. The load can then be restored once the failure is resolved. If you are a tenant, you will need to ask the service provider to do this for you.

Try recovering with an ncoreapi retry command

A failed VCM can sometimes be recovered using the retry command. In the example below, the failed VCM is shown as module 1 in the enquiry command.

nopclearfail -m1 -r

Restart the hardserver

If a retry command does not restore the VCM, then try restarting the tenant hardserver. The example below is for a Linux operating system.

sudo /opt/nfast/sbin/init.d-ncipher restart

On a Windows operating system use the Windows Control Panel to restart the nfast server service or use the following commands

net stop "nfast server"
net start "nfast server"

Try clearing the module

If none of the procedures above have recovered the VCM, then you may be able to recover it by clearing the VCM with the following sequence of commands. In the example, the failed VCM appears as module 1 in the enquiry command.

nopclearfail -m1 -M

< WAIT for 30 seconds before issuing the next command >

nopclearfail -m1 -c

Contact support

If none of these methods work then contact Entrust Support.