CVM Out of memory
Hi!
So today we received an alert that one of our CVMs where unresponsive.
First thing i did was to se if i could ping it from another CVM. Witch we could not do. So off to Prism Elements interface to have a look if the CVM was powered on. It sure was. But if i launched console, it was totaly unresponsive.
I decided to make a reboot of the CVM since it was not responsive and it had already started to detach from the metadata-ring in the cluster.
Below is the required steps to reboot the CVM from within the AHV host.
From any operational CVM in the cluster, ssh to the AHV host of the affected CVM
ssh root@10.10.10.40
First, show a list of all the VMs running on the host.
virsh list --all
You should receive a output something similar to this
[root@NX-<SERIAL>-A ~]# virsh list --all
Id Name State
------------------------------------------------------
1 NTNX-NX-<SERIAL>-A-CVM running
2 d20d3dee-ead0-45ef-bedb-cc81bb02312b running
3 0ffeb6ef-ccca-483c-8c54-c390d462e44f running
5 3315c434-1a16-414d-89f3-48e9bd7a23a7 running
6 5d491c9f-dae0-463d-aed7-f2b239ff39f1 running
7 6f6eac3b-b8d7-4343-8438-2fa891dcaea8 running
8 87152a33-137d-41e1-a547-126a5e5219ee running
We can confirm that the CVM is running.
There is a way to take a look at what is happening on the CVM via the CVM serial log, lets take a look at that log by running the below command.
cd /tmp/
less NTNX.serial.out.0
Here we saw the reason for why the CVM was unresponsive.
ntnx-<serial>-a-cvm login: [5382403.543324] Out of memory: Kill process 17354 (counters_collec) score 1001 or sacrifice child
[5382403.544799] Killed process 17354 (counters_collec), UID 1000, total-vm:273040kB, anon-rss:62056kB, file-rss:0kB, shmem-rss:0kB
[5382403.552031] Out of memory: Kill process 17522 (counters_collec) score 1001 or sacrifice child
[5382403.553321] Killed process 17522 (counters_collec), UID 1000, total-vm:273040kB, anon-rss:62056kB, file-rss:0kB, shmem-rss:0kB
[5382405.556172] Out of memory: Kill process 17345 (insights_collec) score 1001 or sacrifice child
[5382405.557210] Killed process 17345 (insights_collec), UID 1000, total-vm:1093060kB, anon-rss:35716kB, file-rss:0kB, shmem-rss:0kB
(END)
Clean and simple, the CVM have ran out of memory. So, lets reboot it using virsh:
virsh destroy NTNX-NX-<SERIAL>-A-CVM
Now allow sometime, and then confirm the CVM is powered off by running the list command again
virsh list --all
We should now see that the CVM is powered of
[root@NX-<SERIAL>-A ~]# virsh list --all
Id Name State
------------------------------------------------------
1 NTNX-NX-<SERIAL>-A-CVM shutdown
2 d20d3dee-ead0-45ef-bedb-cc81bb02312b running
3 0ffeb6ef-ccca-483c-8c54-c390d462e44f running
5 3315c434-1a16-414d-89f3-48e9bd7a23a7 running
6 5d491c9f-dae0-463d-aed7-f2b239ff39f1 running
7 6f6eac3b-b8d7-4343-8438-2fa891dcaea8 running
8 87152a33-137d-41e1-a547-126a5e5219ee running
And then, at last, let's power the CVM on again.
virsh start NTNX-NX-<SERIAL>-A-CVM
Then once again, allow for some time, and take a look at the virsh list --all again to confirm that it's running.
Exit the host, and go back to a functional CVM
Confirm that you can ping the affected CVM
ping 10.10.10.113
You should now receive reply on your ICMP packets. Now connect to the CVM that was affected
Then you can monitor the progress of the cluster services starting by running the this command:
ssh 10.10.10.113
watch -d genesis status
Once all the cluster services are up on the CVM confirm that the reciliency status becomes Green in the Prism Element interface and that the CVM is added back to the metadataring
nodetool -h 0 ring
Conclusion
It was wierd that the CVM ran out of memory, but since our cluster has grown pretty mutch lately, the CVM RAM usage has been quite constrained. So we're looking into increasing the CVM RAM with support.
During this, we did not experience any user workload disruption because of the AWSOME fault tolerance of Nutanix :)
Also the support receives 10/10 for fast response and quick support regarding the CVM RAM increase ticket we created.
To see how to increase the CVM RAM take a look at my previous post
Until Nex time
Have a great one :)