No fear of the outage
We know that any given day part of our infrastructure may simply go down and somehow we got used to that feeling, and adapted to live with it. Just take a look at the global leaders’ outages posted under the following links:
And the most recent one from OVH:
One conclusion comes to mind. They tend to fail from time to time. Sometimes it’s the whole region, another time just a particular service, like the AWS Kinesis described below:
Nothing will change in this area, even though the aforementioned cloud providers are multiplying their activities to predict any potential outages, and constantly raising the quality of the services to the very top. Certainly, all those precautions won’t prevent the mistakes from happening. Providers are going to make them as well as us, their customers. Yet, though we know it all too well, we keep acting like fools and complain, putting all the blame for unsuccessful or partially inefficient businesses on providers and service outages. My first assumption is that we tend to forget that our environments evolve from day one, and what we considered as adequate and sufficient at the start is adequate no longer.
Face some facts
Cloud providers give you equipment. Some of them even share the responsibility with their customers, when it comes to certain parts of the infrastructure. What does it mean for you? The answer is simple. By picking up your favourite vendor, you are presented with a variety of services, from simple backup tools to full disaster recovery templates just waiting to be implemented, that help you protect yourself from failures. The difference between those providers comes down to the number of out-of-the-box services. Of course, if you have enough time and resources you can code your own tools, but let’s leave that option out for now. For me, reinventing the wheel always felt like a waste of time, but you are free to have your own opinion.
Where should I start from with my provider?
Based on my experience, I’ll use AWS as an example, but I’m sure you’ll find similar counterparts provided by your selected “business cloud holders”.
Once you accept that anything can go wrong with your environment, try to answer a few questions:
- Does my business force a zero-downtime?
- Does my business allow a disaster recovery scenario in which the data is being restored after the outage?
- Does my business allow a disaster recovery scenario in which resources are being deployed after the outage? (more tighten RTO)
- Does my business allow only a disaster recovery scenario where already deployed resources are being started and scaled after the outage? (more tighten RTO)
- Does my business require me to have already implemented and working emergency resources, so I can scale them up when experiencing an outage?
- Have I ever walked through the entire disaster recovery plan?
- Do I have an up-to-date, documented recovery plan for each of my cloud services?
- Do I have an up-to-date, documented recovery plan for my entire region?
- Do I have an up-to-date, documented recovery plan for my entire Availability Zone or Data Center if necessary?
- Do I constantly improve operational excellence through regular chaos engineering outages simulations?
Region is a physical location around the world where AWS clusters data centers. AWS calls each group of logical data centers an Availability Zone. Each AWS Region consists of multiple, isolated, and physically separate AZ’s within a geographic area.
Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones. They provide inexpensive, low-latency network connectivity to other Availability Zones in the same AWS Region. Each region is completely independent.
By answering and rating these questions you will get a clear picture of where on the operational excellence map you are right now, and whether you will sleep well when outage occurs.
To make it clear, the idea behind this article is not to judge the cloud providers but rather to advise you on what your team should take into consideration.
Dream big, but start with small steps
Every company that uses AWS services, from small startups to big enterprises, must consider region outage as a potential problem. However, preparing for such failures requires us to be prepared for particular service outages in advance. Start with simple steps. Prepare a list of AWS services you’ve either implemented or are about to implement. Your workloads’ data will require a backup strategy that is going to run periodically or is/will be a part of a continuous job (for example a pilot light scenario, which I’ll cover in the next section). How often you run your backup will determine your achievable recovery point. If some parts of your cloud environment don’t have such a strategy, list those blind spots and add necessary actions to your roadmap. Some AWS services may not have an out-of-the-box protection or backup feature, so you might have to introduce a development overhead to your team to prepare a customised solution.
If either of the following services is a part of your infrastructure:
- Amazon EC2 instance
- Amazon EBS
- Amazon RDS
- Amazon DynamoDB tables
- Amazon EFS
- AWS Storage Gateway volumes
- Amazon FSx for Windows File Server and Amazon FSx for Lustre
Then definitely check AWS Backup service (https://aws.amazon.com/backup/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc). It will easily enable centralized and automated data protection by taking care of the backups, even across regions. By having a backup in another region, you set yourself on the right path toward realizing a big dream of being prepared for a regional outage.
Let’s see how your company can deal with that.
Low cost “Pilot light” scenario – a good first step
Each financial department meticulously protects its company from additional costs, especially those which don’t generate an income. By selecting the pilot light approach, the data is being replicated from one region to another and a copy of your core workload infrastructure is provisioned. The only services that are always on in the “backup” region are databases and object storage. However, other services like the EC2 instances are preconfigured with the application code, mandatory configuration and dependencies turned off and only launched when the failover or test is invoked.
If you compare this approach to the backup and restore one, you’ll see it has a preprovisioned infrastructure all in place, waiting to be launched and scaled out for the traffic needs. Of course, spinning up the whole environment takes time but unlike in the case of restoration from backups where different unexpected issues might come up, the risk of recovery failure is pretty much limited, and you still save money by minimizing active resources, which should make you CFO satisfied.
Highers costs = “Warm standby” scenario – a shorter recovery time
For a great number of companies a “pilot light” scenario has literally no value for their businesses. If you’re a big enterprise or a matured, recognizable startup with millions of users, you want your recovery time after downtime to be as short as possible. This is the point where a “warm standby” scenario comes into play. Generally, it relies on a fully functional, scaled down copy of your production environment set in another AWS region. Contrary to the pilot light, the workload is always-on in a warm standby scenario, which gives a solid base for performing tests, and strengthens team’s confidence in its ability to recover from failure.
For those who can’t spot a difference between pilot light and warm standby scenarios, here’s how I see them. For me, the first one cannot process any requests without launching instances and non-core parts of your infrastructure, whereas the second one is capable of dealing with the traffic but at the reduced capacity level (requires scaling up). Of course, you shouldn’t be surprised that you will be charged extra for that “ready to go” backup infrastructure, which for many is still a mental block. Nonetheless, if you happen to be in a group of matured companies, you really should consider or even start the next, more complex scenario – “multi-site active/active”.
Highest costs = Highest protection – “Multi-site active/active” scenario
Essentially, it’s an extension of the warm standby case. In this particular scenario, users are able to access workloads in all of the regions they are deployed in. Although it sounds amazing, it’s the most complex and most expensive scenario. But it has a huge benefit, as the recovery time is reduced to near zero for most disasters.
NOTE: I used the word “most” deliberately because issues like data corruption are generally based on backups which obviously introduce a delayed recovery point.
It’s really worth mentioning that in the active/active model there is no such thing as failover, simply because your workload is being served from more than one region. Does it mean that there shouldn’t be any regular disaster recovery tests planned? To avoid situations where all work goes to waste, your team has to focus on how the workload reacts to loss of a whole AWS region.
You have to ask yourself – Is traffic properly routed away from the failed AWS region to another available (active) one that is meant to handle rerouted traffic? By selecting active/active scenario, you choose to maintain a near zero recovery time from region outage and proper traffic rerouting. This is the place where most of your trials, improvements and time will have to be spent.
Even if you have an active/active well-designed infrastructure, you will still be exposed to the biggest risk there is. Human error. We all make mistakes and nothing can change that, so instead of treating them as a negligible fact, put tests designed to eliminate errors on top of your priorities list.
Don’t forget about human errors
Essentially, the vast majority of our customers are more concerned about human errors than services or regions outage. By human errors I mean, for example, unwanted deletion or modification of an s3 object, or “accidental” dropping of RDS table. In first case, you should protect yourself by leveraging object versioning, which protects your data in S3, from the consequences of deletion or modification actions. Basically, it can be achieved simply by retaining the original version of the object before the action. It’s worth adding that if you are using S3 replication (and you definitely should) to back the data up to your backup region, then by default, which is not that obvious, when an object is deleted in the source bucket, Amazon S3 adds a delete marker only in the source bucket. In other words, if you delete an s3 object from your bucket by accident, you’ll still have the backed up object.
What should be looked more closely when using AWS?
We walked through the cases of building disaster recovery scenarios. Now, I’ll share some points that may serve as good topics for internal brainstorming:
- Follow the IaC (Infrastructure as a Code) approach to avoid the lack of auditability of changes, and to avoid the lack of inconsistency in terms of configuration in multi-region deployments. Besides well-known Terraform and CloudFormation, there’s a new player in town – CDK – that allows defining IaC with familiar programming languages. It’s a pretty cool tool but for some might regard it as a bleeding-edge solution.
- AWS ROUTE 53 supports geoproximity, failover or latency based policies for routing your customer’s requests in multi-region deployments.
- DATABASES and their global tables (Dynamodb, Aurora) feature a selected scenario a/p or a/a that has a direct impact on your READs/WRITEs design:
- with a/p – writes occur only to the primary Region, same as READs
- with a/a:
- READs – common scenario is local read, which means that data is served from the closest region to the customer
- write global – where writes are being routed to a single region, and in case of failure, another region is promoted as a primary one – good example is Aurora global database
- write local – similarly to READs, routes WRITEs to the closest region – good example are Dynamodb global tables
Good intentions are not enough – It’s all about the internal well-defined habits
Unfortunately, being aware of the danger and having an already provisioned backup region in any of the aforementioned scenarios won’t make your sleep better. Evaluation of your cloud environment, through teams’ internal development or changes in dependencies among applications’ components, introduces new points which might not recover properly when outage happens. Therefore, to build operational excellence you have to define and implement internal habits of periodical tests. Right now, I’m not using chaos engineering terminology deliberately, just to avoid opening a Pandora’s Box for skeptics. All I’m saying is that when you work with a regularly changing cloud environment, you simply can’t expect to get successful failovers without well-defined procedures, internal culture and, last but not least, regular tests. Remember, disasters usually happen at an inconvenient time, with no warning. Invest your time, you won’t ever regret that.