Search This Blog

Thursday, 29 January 2015

Using Data Protector to back up your AWS (Amazon EC2) instances

Your cloud-hosted infrastructure (whether it is on Azure, AWS, Google App Engine, IBM Rackspace or HP Cloud) is going to consist of:
  • Cattle, which are machines that you have automatically created and which contain no state that can be lost. They might have a replica of some data, but there will be other copies. If these machines fail, you just restart them or create a new instance. Hopefully you have this process automated.
  • Pets, which are machines that you administer and are installed manually. When these fail, you want to restore from a backup.
If you are a completely green-field site, then you won't have any backup infrastructure. But if you already have some in-house servers, then you will probably have existing backup infrastructure that you would want to make use of.

For example, the cheapest storage in the cloud at the moment appears to be Amazon Glacier, which costs USD10 per terabyte. But if you already have a tape library (or even a single standalone modern tape drive), you can easily have long-term cold storage at $0.50 per TB or less, and you probably already have some tapes.

Likewise, if you already have a Data Protector cell manager license, you might as well keep using it because it will work out cheaper than any dedicated cloud-hosting provider.

Broadly speaking, there are four options: a virtual tape library, external control EBS, StoreOnce in the cloud or the HP cloud device.

Virtual tape library

This option is appropriate if you are very, very constrained by your budget and need to be very conservative in how you do backup changes. If you are currently backing up to a tape library, then this lets you keep the illusion of the same thing but put it into the cloud.
  • Create a Linux instance in an availability zone that you are not otherwise using. 
  • Install mhvtl on to it, and configure a virtual tape library with it. 
  • Mount a very large block image (persistent storage device) on /opt/mhvtl. 
  • You can now use this tape library just as if it were a real tape library.
If the Linux instance fails, then start a new instance, install mhvtl again, and change the host controlling the library. Note that media agent licenses are concurrent so if you make sure that you use this tape library only when you aren't using your in-house library, there is no additional licensing cost associated with this.

AWS (Elastic Block Store) volumes

The problem with the virtual tape library solution is that you are somewhat constrained by the size of the block storage that you are using. But with an external control device, you can attach and detach Elastic Block Store (EBS) volumes on demand as required. You can add slots to the external device by adding additional block stores.

  • Create a Linux instance in an availability zone that you are not otherwise using.
  • Write an external control script which takes the DP command arguments and attaches and detaches EBS volumes to the Linux box.
  • Create an External device, using that script.

StoreOnce low-bandwidth replication

The previous two options don't offer a way of using in-house tape drives.

If you have a way of breaking up your backups into chunks of less than 20TB, then you can use the software StoreOnce component on an EC2 instance. It works on Windows and Linux; just make sure that you have installed a 64-bit image. The only licensing you will need is some extra Advanced Backup to Disk capacity.

An alternative is to buy a virtual storage appliance (VSA) from HP, and then creating an Amazon Machine Image (AMI) out of it. This has the advantage that it can cope with larger volumes, and it also has better bandwidth management (e.g. shaping during the day, and full speed at night).

The steps here are:
  • Run up a machine in an availability zone which is different to whatever it is you are wanting to back it up. Use a Windows, Linux or VSA image as appropriate. Call it InCloudStorage-server
  • Create a StoreOnce device (e.g. "InCloudStorage").
  • Create backups writing to "InCloudStorage".
  • Create a StoreOnce device in-house. Call it "CloudStorageCopy".
  • If you are not using a VSA you will need to create a gateway for CloudStorageCopy on InCloudStorage-server. Remember to check the "server-side deduplication button".
  • Create a post-backup copy job which replicates the backups (which went to InCloudStorage) to CloudStorageCopy using that server-side deduplicated gateway.
  • Create a post-copy copy jobs to copy these out to tape.
The beauty of this scheme is that you can seed the CloudStorageCopy with any relevant data. As most of the virtual machines you are backing up will be very similar, you will achieve very good deduplication ratios. 20:1 is probably reasonable to expect, or possibly higher. So instead of having to transfer 100GB of backup images from the cloud to your office each day, you might only be transfering 5GB, which is quite practical.

HP cloud device

I discussed this in . If you are using the HP cloud, then this is almost a no-brainer -- you don't even need to provision a server. For the other cloud providers, it depends on the bandwidth you get (and the cost of the bandwidth!) to the HP cloud whether this makes sense or not.

Greg Baker is an independent consultant who happens to do a lot of work on HP DataProtector. He is the author of the only published books on HP Data Protector ( He works with HP and HP partner companies to solve the hardest big-data problems (especially around backup). See more at IFOST's DataProtector pages at