Disaster Recovery in AWS – Part 1

Disaster Recovery in AWS – How to do it and how much it costs?

Design, implementation, and operations of the Disaster Recovery Center (DRC) are a pain in the ass. On the one hand, you need it. On the other, you don’t want to have it. First, because of the costs, and second, because of the fairly complicated technology stack behind it – servers, storage, network, firewall, VPN, replication software, the entire stack of an application containing app servers, DB servers, etc. When you think about building – somewhere, somehow – a 1:1 of your environment, I can bet you feel slight excitement with the accompaniment of intense nausea at the same time. Thousands of questions come to mind – how, with what, this, that… Hey! Stop. There are solutions and today we will talk about one that is, fair to say, quite impressive. Let’s set up our DRC on AWS with CloudEndure Disaster Recovery.

CloudEndure Disaster Recovery into Amazon Web Services (AWS) is a Software-as-a-Service (SaaS) solution. The solution is powered by innovative workload mobility technology, which continuously replicates applications from physical, virtual, or cloud-based infrastructure to a low-cost “staging area” (detailed below) that is automatically provisioned in any target AWS Region of the customer’s choice. During failover or testing, an up-to-date copy of applications can be spun up on demand and be fully functioning in minutes.

You can use CloudEndure to replicate databases, including Microsoft SQL Server, Oracle, and MySQL, as well as enterprise applications such as SAP. CloudEndure Disaster Recovery enables rapid recovery of the application, database, files, OS configuration, and system state — meaning that operations continue smoothly with fully functioning workloads. In addition to a self-service, web-based Console with centralized management for all of a customer’s projects, CloudEndure provides APIs that enable developers to implement large-scale automation and other advanced capabilities. Here’s how it works:

CloudEndure Disaster Recovery utilizes block-level, Continuous Data Replication, which ensures that target machines are spun up in their up-to-date state during a disaster or drill. Organizations can thereby achieve sub-second Recovery Point Objectives (RPOs).

The Continuous Data Replication takes place in a low-cost “staging area” in AWS based on EBS volumes. In the event of a disaster, CloudEndure triggers an automated machine conversion process (p2c/v2c/c2c) and a scalable orchestration engine that can spin up machines in the target AWS Region within minutes. CloudEndure’s Disaster Recovery solution also provides the resilience of a warm standby solution at the low cost of a cold standby solution.

CloudEndure replication is done at the OS level (rather than hypervisor or SAN level). Once installed and activated, the CloudEndure Agent begins initial replication, reading all of the data on the machines at the block level and replicating it to a low-cost “staging area” that is automatically provisioned in a customer’s AWS account, in a target network of their choice. Customers define replication settings, such as subnets, security groups, and replication tags, through the self-service, web-based CloudEndure Console. The initial replication can take anywhere from several minutes to several days, depending on the amount of data to be replicated and the bandwidth available between the source infrastructure and the target AWS Region. No reboot is required nor is there system disruption throughout the initial replication.

After the initial replication is complete, the source machines are continuously monitored to ensure constant synchronization, up to the last second. Any changes to source machines are asynchronously replicated in real-time into the AWS “staging area”. When replicating machines across similar infrastructures, such as between AWS Regions or Availability Zones, the replicated machines can boot natively in the target environment, as there are no significant differences in infrastructure. However, when replicating machines across dissimilar infrastructures, such as from on-premises or other clouds into AWS, machine conversions are required to ensure that the replicated machines can continue to run natively within AWS. This includes modifications to hypervisors, drivers, and other variations. Without proper conversion, such transitions between physical machines, hypervisor variations, or different clouds will result in non-bootable target machines. CloudEndure addresses this using proprietary machine conversion technology, which handles all hypervisor and OS configuration changes, boot process changes, OS activation, and installation of target infrastructure guest agents. CloudEndure claim that the automated machine conversion process takes approximately 30 seconds, for us it was around 3-7 minutes, and ensures that Windows or Linux machines replicated from physical, virtual, and cloud-based infrastructure will natively boot and run transparently in the customer’s preferred target AWS Region.

Of course, you need to set up your environment on AWS sites like networking, IAM roles, policies and connection points. So, the idea is simple – we have some workload on our on-prem datacenter. And there is a network configuration we need to do 1:1 in AWS. Why? Because we don’t want to change anything in the configuration of guest operations system and our application in case of a DR event. That wouldn’t be good. Just imagine we have a DR event. Everything burns and your workloads start in AWS infrastructure. So, we need to take care of networking, but also access. Great, but how can we access it? We need to set up a VPN access and give that endpoint to users in case of a disaster. We will talk about a detailed design in the next part of this series.

But before we start, let’s talk a little bit about the cost of DR in general and the costs in the Cloud. We based our model on 100 VM’s for 36 months. We compared Cloud DRC and legacy DRC made from scratch – we have nothing right now, pure empty greenfield. Let’s set the scene – our standard VM has 8GB of RAM, 2 vCPU and 100GB of storage. Storage is kept on FC disc array. The hypervisor is VMware vSphere.

What we need to set up is:

  1. Servers – same or quite similar amount as in production. It has to be at least similar to guarantee the user experience when disaster event happens. So, for 100 VM’s we need 800 GB of RAM and 200 vCPU overall. If we assume oversubscription on RAM at the level of 4:1 and on vCPU RAM at 2:1, it turns out we need 200 GB of physical RAM (pRAM) and 100 CPU/cores. Let’s assume that HT in CPU will give us 25%. In that case we will need 75 CPU/Cores in CPU with HT support. It’s simple math – 2 servers with 256 GB of RAM and 2 CPU with 20 Cores. This is a super small setup.
  2. Network – We have two servers with 2 x 10G interfaces + 2 x 16GB FC interfaces – we need to have our own ethernet and fiber channel switches, or buy ports from datacenter provider.
  3. Storage -100 standard VM’s = 10TB of IO intense storage. We need small disk array, like 3PAR 7200 or DELL EMC Unity.
  4. Software – If we have three servers, each with 2 CPUs, we need to buy a VMware vSpherefo licence for 6 CPUs. So, let’s opt for standard stuff, nothing fancy. Plus, we need some software to conduct replication and DRC activities. Let’s keep it in the family and use VMware SRM. And because we have more than 75 VM’s, we have to choose Enterprise. We also need additional vCenter.
  5. Place – some DC, somewhere.

Let’s write it down in a chart:

We need to add the cost of a setup – about 30% of the price. And the cost of maintenance. This can vary, but let’s say another 20% in 3 years. So, our legacy DRC will cost us: 387 000 USD for 36 months, for 100 VM’s. That’s 107.50 USD per VM/month.

Now, we calculate costs of DRC in AWS. We will need some VPC, EBS, EC2, as well as transfer and licensing for CloudEndure. Let’s start with a stage environment set up in eu-central-1 on 3-years standard reserved instances, no upfront. We will use EC2 instances (t3.micro) + magnetic storage – monthly cost of around 1042.60 USD. In total (36M), it will amount to 37 533.6 USD. For prod infrastructure we will use eu-central-1 and EC2 instances (t3.large) with gp2 storage and 50 public IPs. Let’s assume that outbound network traffic will hit something around 100GB. This infrastructure will only be used twice per year, for 8 hours – for testing or in case of a DR event. If we work with an on-demand model, a single production test will cost us around 1457.00 USD. This means 8742.00 USD in total (2 x per year * 3yr = 6 times). Next, we can’t forget about VPC, Subnets, NAT Gateway, R53, VPN – that’s another 145.60 USD monthly, and 5241.6 USD in total (36M). Last thing, ALB during DR testing or DR event costs 1328.60 USD per month. So, one day costs 45 USD. In our case, that’s 270 USD in total (2 x per year * 3yr = 6 times). Once again, let’s sum it up in a chart:

Finally, we need to add setup costs, around 10% of the total value (no racking, no software installation, just simple AWS infra preparation and CloudEndure configuration), and maintenance costs – this can vary, but let’s say another 10% in a span of 3 years. So, our AWS based DRC will cost us: 368 864.64 USD for 36 months, for 100 VM’s. That is 102.46 USD per VM/month.

Let’s compare those calculations:

You don’t need a PhD to see that the costs are almost identical. But remember, there are other factors you have to consider – risk, time to deliver, day two operations, compliance, governance and knowledge needed to set everything up. We will talk about those in the next part of the series.