Search This Blog

Thursday, 19 June 2014

VMware, Data Protector and virtual machines which won't consolidate

Working on a customer's systems recently,  there were a large number of virtual machines with the following error message:

Configuration Issues
Virtual machine disks consolidation is needed.

But if I tried to right-click in vCenter and select Snapshots -> Consolidate, what I got was "unable to access file <unspecified filename> since it is locked".

This was also causing error messages in the backup log, because HP Data Protector attempts to consolidate disks at the start of a full backup.

The VMware KB articles suggested various things to identify the lock. I ssh'ed in and ran
tail -f vmware.log | grep lock
to identify what the lock could be. As it turned out, it wasn't quite a lock. The file that couldn't be opened was a .vmdk file - no surprises there. So I ran
lsof | grep the-vmdk-file
This showed that two different processes had it open.
ps | grep process-id-from-the-previous-step
showed that the two processes were both /bin/vmx, but it was possible to distinguish them by their child vmx-vthread processes.

One of them was the process running the virtual machine (no surprises there), and the other was a process belonging to the hostname of the computer that runs their HP DataProtector VEPA agent.

This customer has a virtual machine inside their VMware environment which runs their VMware backups. They don't have to worry about correctly presenting LUNs or having an extra device attached to their SAN fabric. They do source-side deduplicated backups from this virtual machine, so it doesn't generate as much network traffic as it otherwise would.

What had happened was that some backup had failed spectacularly leaving the snapshots mounted on the VEPA agent virtual machine. Looking at the settings for the agent virtual machine it proudly said that it had 13 virtual disks - when it should only have had one, its boot disk.

Naturally, VMware couldn't consolidate the snapshots because as far as it was concerned, those snapshots were still in use. VMware also couldn't delete the virtual disks off the agent machine either, because there were snapshots depending on them.

So the solution was:

  • Remove the snapshots on the agent machine.
  • Remove the extraneous disks from the agent machine.
  • Run the snapshot consolidation from the vCenter GUI.