For most people chaos is a state of complete confusion and lack of order, and this definition exists in dictionaries all around the world. So, does Chaos Engineering stand for a set of unexpected engineering tasks carried out without any order? There is only one answer to that question – No.
In our world, the technical world, where different businesses work and cooperate closely with each other, we could never do a thing like that. For me, Chaos Engineering is creating possibilities to perform and analyze proactively any point of failure or any undesirable action in order to identify service behavior. Moreover, it provides a more stable product, not through root cause analysis made after incidents, but rather in controlled conditions with a decreased blast radius.
Why is this important from the business side of things?
It’s not hard to imagine that behind our everyday routines there are many important systems. Let’s use Seats Service as a first example. It decides which airplane seats are free, and which fares & passengers are assigned to those and remaining seats.
So, what may happen if this service freezes for a while? Will the passengers manage to buy their tickets at all? Will they get assigned to the wrong seats on the plane? Will the service assign air fares to some percent of free seats?
All of the above could, but does not have to happen. If you do not perform any chaos tests, you won’t be able to determine the real impact of system or service inaccessibility on your business for quite a while, not to mention the real costs they entail. All incidents generate colossal revenue losses for companies, but not exactly at the moment the accidents happen. This is due to the long or unmeasurable MTTR (stands for Mean Time To Repair).
If you have not determined specific behaviors for unexpected situations, you won’t be able to define procedures or set toolsets for quick service repairs.
Another example? More common one?
How about Light Traffic Control System? You can find those in every big city in the world. This kind of service is responsible for controlling traffic lights and therefore it regulates car traffic inside the city. Such system is connected inseparably with time every car driver and car passenger spends in order to move from point A to point B. Many of them are our customers or suppliers, some even us.
So, what may happen if this service crashes for a while?
It may create many traffic jams. Two cars may collide as both drivers see green lights. But if the service properly identifies wrong behavior, it may also switch all lights off.
This example shows the importance of identifying different behaviors. Without it, we may lose not only money, but also lives.
5 Steps to perform Chaos Engineering in your company the mature way
Look at the big picture. Proactively check for lack of specified behaviors to unexpected situations. Typical response you’ll hear: ‘We don’t know what exactly happens when application gets some malformed packets.’
- Identify failures and build failure scenarios
Convert response from Step 1 into more complicated scenarios. In this example, create a tool for generating packets and set measurable metrics to check service behavior.
- Chaos Testing
Perform scenario in the production phase, but with a decreased blast radius. Why during production? Only then you will have the same conditions as during the incident. You must be careful. This test should be transparent for end users and should not generate unintended and unwanted losses.
- Gather results
Write a report, or get one from a toolset, and make plans for improvement. Or write down notes that will help fix a faulty application. Or design a procedure for operation teams with clearly specified “what if?” scenarios.
Finish Step 4 and verify it by running the scenario from Step 2 one more time.
Fine, but which toolset should I use for this? What’s the right way to perform these steps?
- Netflix Simian Army
- AWS Lambda Injection
- Traffic Control Toolset on EC2 (Old method)
All of these toolsets/tools have some advantages and some cons, most likely expensive solution, big barrier to entry when used in common situations, or some other limitations. I will tell you about all of them.
Chaos Engineering was started by Netflix Engineers. Knowing this, let’s review the set of principles drawn up by Netflix:
You can see a connection with my 5 steps to perform Chaos Engineering in your company the mature way, right?
Now, the basic toolset created by Netflix is called Chaos Monkey. So, let’s start with it.
What exactly is Chaos Monkey?
Chaos Monkey is a piece of software that was created in 2011 by Netflix, and later became part of a larger suite of programs called Simian Army, a collection of software tools designed to test the AWS infrastructure. The software is open source, allowing other cloud services users to adapt it for their own use.
More tools have been added to test different security and configurations issues.
The software simulates instance failures of services running in Auto Scaling Groups (ASG) by closing one or more virtual machines.
The main and basic rule of Chaos Monkey is “the best way to elude major failures is to fail constantly…”
Unlike unexpected failures, which by definition occur randomly and often without warning, the software is opt-out by default. It can also be configured for opt-in. However, unlike unexpected failures, which seem to occur at the worst possible times, the software is opt-out by default. It can also be configured for opt-in!
Chaos Monkey allows for simulated disturbances and failures to occur, so they can be analyzed and monitored.
Netflix engineers plan to add more monkeys to Simian Army. They are open for community suggestions.
Right, this toolset is adequate for some situations, but still needs additional EC2 or server for others. And from my perspective, this always generates too many costs for a company. Nevertheless, it is a good tool for those who start their adventure with Chaos Engineering.
Another Netflix tool is, in fact, few “Monkey” sets called Netflix Simian Army.
Fault-tolerance is key in cloud computing because 100% uptime is never guaranteed (anything can break at any time). Cloud architecture has to be designed with components in mind. So, when one of them fails, it won’t drag the whole system down with it!
Our weakest part should not dictate the performance of the whole infrastructure.
We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, datacenter-/availability-zone-, and even servers located in different parts of the world.
Designing a fault tolerant architecture is only part of the process:
- Latency Monkey is a solution in RESTful client-server communication that introduces artificial delays to simulate and measure the service degradation and upstream services response.
- Conformity Monkey looks for instances not following best practices and terminates them.
- Doctor Monkey checks the health of each instance and looks at other signs, such as CPU usage, to determine unhealthy processes.
- Janitor Monkey cleans the cloud environments of all redundant waste.
- Security Monkey is an extension to Conformity Monkey. It checks for security violations and shuts these instances off.
- 10–18 Monkey (short for Localization-Internationalization or l10n-i18n) detects problems all across the world in instances from other countries, character sets and languages.
- Chaos Gorilla is like Chaos Monkey, but simulates unavailability of an entire Amazon availability zone.
Okay, at this moment we have great toolset for many fault-injection scenarios, but we still need additional EC2/server to perform the task. But it doesn’t change the fact that Netflix Engineers has done a great engineering work. Right now, only K8S/Containers things are not supported. Having said that, let’s move to the latest Netflix Toolset: ChAP.
ChAP – the newest member of Netflix’s toolset
At a high level, the platform checks the deployment pipeline for a user-oriented service. After that, it launches experiment and control clusters of said service, sending traffic to each. A FIT scenario is executed on experimental groups and the results are reported to service owner.
“The best experiments do not disturb the customer experience!”
In ChAP, we take direct small subset of traffic and distribute it evenly between the experimental clusters and the control clusters.
Some failure modes are only visible when the ratio of failures to total requests in a system crosses certain thresholds.
Load balancing and request routing for FIT requests are evenly spread throughout our production capacity.
Some instances where critical thresholds are reached because of failing requests:
- When a downstream service is latent, thread pools get exhausted
- When a fallback is more computationally expensive than the happy path, CPU usage increases
- When errors lead to exceptions being logged, this may cause lock contention in your logging system and self-inflicted DOS attack
Okay, we have a great toolset for practically every imaginable fault scenario or injection, but they all need additional EC2/server or something else to perform the task.
And do we really need this? No. This is only one way of performing tests, as we can always opt for the serverless…
There is one very important reason: Chos Lambda is Chaos Monkey within Lambda, but without the need of any additional EC2 instance!
“Tools such as chaos-lambda by Shoreditch Ops look to replicate Netflix’s Chaos Monkey, but execute from inside a Lambda function instead of an EC2 instance – hence bringing you the cost saving and convenience Lambda offers.”
- Problem origins:
- modularity (unit of deployment) shifts from “services” to “functions”
- it’s harder to secure a function than a service, which encapsulated a set of functionalities
- Intermediary services (e.g. Kinesis, SNS, API Gateway, etc.) have their own failure modes
- More configurations are available, which leads to more cases of misconfiguration
- Failure modules:
- Improperly tuned timeouts can cause services to also timeout
- Missing error handling lets exceptions escape easier
- Missing fallbacks when service is unavailable
Okay, and what about other solutions like, for example, SAAS Gremlin? You can read a comparison with serverless (Lambda) in the next part with code examples. But before that, I will show you my final guidelines about Chaos Engineering and Chaos Testing in your firm:
- These are a must if you want to find defects in your software and avoid higher costs of fixing them during incidents
- Do not rush the implementation. You should start from small steps
- If you are still thinking you could develop and release products without Chaos Engineering in your company, go back to life examples and imagine yourself in these situations.
Do you want to be a customer of such broken software? Thank you for reading my introduction to Chaos Engineering. Stay tuned for more.