CVM Out of memory

Hi!
So today we received an alert that one of our CVMs where unresponsive.

First thing i did was to se if i could ping it from another CVM. Witch we could not do. So off to Prism Elements interface to have a look if the CVM was powered on. It sure was. But if i launched console, it was totaly unresponsive.

I decided to make a reboot of the CVM since it was not responsive and it had already started to detach from the metadata-ring in the cluster.

Below is the required steps to reboot the CVM from within the AHV host.

From any operational CVM in the cluster, ssh to the AHV host of the affected CVM

ssh root@10.10.10.40

First, show a list of all the VMs running on the host.

virsh list --all

You should receive a output something similar to this

[root@NX-<SERIAL>-A ~]#  virsh list --all
 Id   Name                                   State
------------------------------------------------------
 1    NTNX-NX-<SERIAL>-A-CVM                 running
 2    d20d3dee-ead0-45ef-bedb-cc81bb02312b   running
 3    0ffeb6ef-ccca-483c-8c54-c390d462e44f   running
 5    3315c434-1a16-414d-89f3-48e9bd7a23a7   running
 6    5d491c9f-dae0-463d-aed7-f2b239ff39f1   running
 7    6f6eac3b-b8d7-4343-8438-2fa891dcaea8   running
 8    87152a33-137d-41e1-a547-126a5e5219ee   running

We can confirm that the CVM is running.

There is a way to take a look at what is happening on the CVM via the CVM serial log, lets take a look at that log by running the below command.

cd /tmp/
less NTNX.serial.out.0

Here we saw the reason for why the CVM was unresponsive.

ntnx-<serial>-a-cvm login: [5382403.543324] Out of memory: Kill process 17354 (counters_collec) score 1001 or sacrifice child
[5382403.544799] Killed process 17354 (counters_collec), UID 1000, total-vm:273040kB, anon-rss:62056kB, file-rss:0kB, shmem-rss:0kB
[5382403.552031] Out of memory: Kill process 17522 (counters_collec) score 1001 or sacrifice child
[5382403.553321] Killed process 17522 (counters_collec), UID 1000, total-vm:273040kB, anon-rss:62056kB, file-rss:0kB, shmem-rss:0kB
[5382405.556172] Out of memory: Kill process 17345 (insights_collec) score 1001 or sacrifice child
[5382405.557210] Killed process 17345 (insights_collec), UID 1000, total-vm:1093060kB, anon-rss:35716kB, file-rss:0kB, shmem-rss:0kB
(END)

Clean and simple, the CVM have ran out of memory. So, lets reboot it using virsh:

virsh destroy NTNX-NX-<SERIAL>-A-CVM

Now allow sometime, and then confirm the CVM is powered off by running the list command again

virsh list --all

We should now see that the CVM is powered of

[root@NX-<SERIAL>-A ~]#  virsh list --all
 Id   Name                                   State
------------------------------------------------------
 1    NTNX-NX-<SERIAL>-A-CVM                 shutdown
 2    d20d3dee-ead0-45ef-bedb-cc81bb02312b   running
 3    0ffeb6ef-ccca-483c-8c54-c390d462e44f   running
 5    3315c434-1a16-414d-89f3-48e9bd7a23a7   running
 6    5d491c9f-dae0-463d-aed7-f2b239ff39f1   running
 7    6f6eac3b-b8d7-4343-8438-2fa891dcaea8   running
 8    87152a33-137d-41e1-a547-126a5e5219ee   running

And then, at last, let's power the CVM on again.

virsh start NTNX-NX-<SERIAL>-A-CVM

Then once again, allow for some time, and take a look at the virsh list --all again to confirm that it's running.

Exit the host, and go back to a functional CVM

Confirm that you can ping the affected CVM

ping 10.10.10.113

You should now receive reply on your ICMP packets. Now connect to the CVM that was affected

Then you can monitor the progress of the cluster services starting by running the this command:

ssh 10.10.10.113
watch -d genesis status

Once all the cluster services are up on the CVM confirm that the reciliency status becomes Green in the Prism Element interface and that the CVM is added back to the metadataring

nodetool -h 0 ring

Conclusion

It was wierd that the CVM ran out of memory, but since our cluster has grown pretty mutch lately, the CVM RAM usage has been quite constrained. So we're looking into increasing the CVM RAM with support.

During this, we did not experience any user workload disruption because of the AWSOME fault tolerance of Nutanix :)

Also the support receives 10/10 for fast response and quick support regarding the CVM RAM increase ticket we created.

To see how to increase the CVM RAM take a look at my previous post

Until Nex time
Have a great one :)