November 28th | Meetup: AI in finance đŸŠŸ | Register now
September 25, 2021
February 26, 2020

Chaos engineering, the way to decrease revenue loss in your firm

Why is chaos engineering important? Here's why - along with the steps necessary to perform it in your company, and a useful tool set.

Wojciech KoroƄski

For most people chaos is a state of complete confusion and lack of order, and this definition exists in dictionaries all around the world. So, does Chaos Engineering stand for a set of unexpected engineering tasks carried out without any order? There is only one answer to that question – No.

In our world, the technical world, where different businesses work and cooperate closely with each other, we could never do a thing like that. For me, Chaos Engineering is creating possibilities to perform and analyze proactively any point of failure or any undesirable action in order to identify service behavior. Moreover, it provides a more stable product, not through root cause analysis made after incidents, but rather in controlled conditions with a decreased blast radius.

Why is this important from the business side of things?

It’s not hard to imagine that behind our everyday routines there are many important systems. Let’s use Seats Service as a first example. It decides which airplane seats are free, and which fares & passengers are assigned to those and remaining seats.[.zoom]

So, what may happen if this service freezes for a while?

Will the passengers manage to buy their tickets at all? Will they get assigned to the wrong seats on the plane? Will the service assign air fares to some percent of free seats?

All of the above could, but does not have to happen. If you do not perform any chaos tests, you won’t be able to determine the real impact of system or service inaccessibility on your business for quite a while, not to mention the real costs they entail. All incidents generate colossal revenue losses for companies, but not exactly at the moment the accidents happen. This is due to the long or unmeasurable MTTR (stands for Mean Time To Repair).

This example shows the importance of identifying different behaviors. Without it, we may lose not only money, but also lives.

5 Steps to perform chaos engineering in your company, the mature way

  1. Analysis
    Look at the big picture. Proactively check for lack of specified behaviors to unexpected situations. Typical response you’ll hear: ‘We don’t know what exactly happens when application gets some malformed packets.’
  2. Identify failures and build failure scenarios
    Convert response from Step 1 into more complicated scenarios. In this example, create a tool for generating packets and set measurable metrics to check service behavior.
  3. Chaos testing
    Perform scenario in the production phase, but with a decreased blast radius. Why during production? Only then you will have the same conditions as during the incident. You must be careful. This test should be transparent for end users and should not generate unintended and unwanted losses.
  4. Gather results
    Write a report, or get one from a toolset, and make plans for improvement. Or write down notes that will help fix a faulty application. Or design a procedure for operation teams with clearly specified “what if?” scenarios.
  5. Improvement
    Finish Step 4 and verify it by running the scenario from Step 2 one more time.

Fine, but which tool set should I use for this? What’s the right way to perform these steps?

‍
All of these tool sets/ tools have some advantages and some cons, most likely expensive solution, big barrier to entry when used in common situations, or some other limitations. I will tell you about all of them.

Chaos Engineering was started by Netflix Engineers. Knowing this, let’s review the set of principles drawn up by Netflix:

You can see a connection with my 5 steps to perform Chaos Engineering in your company the mature way, right?

Now, the basic toolset created by Netflix is called Chaos Monkey. So, let’s start with it.

What exactly is Chaos Monkey?

Chaos Monkey is a piece of software that was created in 2011 by Netflix, and later became part of a larger suite of programs called Simian Army, a collection of software tools designed to test the AWS infrastructure. The software is open source, allowing other cloud services users to adapt it for their own use

More tools have been added to test different security and configurations issues.

The software simulates instance failures of services running in AutoScaling Groups (ASG) by closing one or more virtual machines

The main and basic rule of Chaos Monkey is "the best way to elude major failures is to fail constantly
"

Unlike unexpected failures, which by definition occur randomly and often without warning, the software is opt-out by default. It can also be configured for opt-in. However, unlike unexpected failures, which seem to occur at the worst possible times, the software is opt-out by default. It can also be configured for opt-in!

Chaos Monkey allows for simulated disturbances and failures to occur, so they can be analyzed and monitored.

Netflix engineers plan to add more monkeys to Simian Army. They are open for community suggestions.

Right, this tool set is adequate for some situations, but still needs additional EC2 or server for others. And from my perspective, this always generates too many costs for a company. Nevertheless, it is a good tool for those who start their adventure with Chaos Engineering.

Another Netflix tool is, in fact, few “Monkey” sets called Netflix Simian Army.

Fault - tolerance is key in cloud computing because 100% uptime is never guaranteed (anything can break at any time). Cloud architecture has to be designed with components in mind. So, when one of them fails, it won't drag the whole system down with it!

Our weakest part should not dictate the performance of the whole infrastructure.

We can use techniques like graceful degradation on dependency failures, as well as node-, rack-, data center-/availability-zone-, and even servers located in different parts of the world.

Designing a fault tolerant architecture is only part of the process:

Okay, at this moment we have great tool set for many fault-injection scenarios, but we still need additional EC2/server to perform the task. But it doesn’t change the fact that Netflix Engineers has done a great engineering work. Right now, only K8S/Containers things are not supported. Having said that, let’s move to the latest Netflix Tool set: ChAP.

ChAP - the newest member of Netflix’s tool set

At a high level, the platform checks the deployment pipeline for a user - oriented service. After that, it launches experiment and control clusters of said service, sending traffic to each. A FIT scenario is executed on experimental groups and the results are reported to service owner.

“The best experiments do not disturb the customer experience!”

In ChAP, we take direct small subset of traffic and distribute it evenly between the experimental clusters and the control clusters.

Some failure modes are only visible when the ratio of failures to total requests in a system crosses certain thresholds.

Load balancing and request routing for FIT requests are evenly spread throughout our production capacity.

Some instances where critical thresholds are reached because of failing requests:

Okay, we have a great toolset for practically every imaginable fault scenario or injection, but they all need additional EC2/server or something else to perform the task.

And do we really need this? No. This is only one way of performing tests, as we can always opt for the serverless


Chaos - Lambda

Why Chaos - Lambda?

There is one very important reason: Chaos Lambda is Chaos Monkey within Lambda, but without the need of any additional EC2 instance!

“Tools such as chaos - lambda by Shoreditch Ops look to replicate Netflix’s Chaos Monkey, but execute from inside a Lambda function instead of an EC2 instance - hence bringing you the cost saving and convenience Lambda offers.”

* Problem origins:

* Failure modules:

Okay, and what about other solutions like, for example, SAAS Gremlin? You can read a comparison with serverless (Lambda) in the next part with code examples. But before that, I will show you my final guidelines about Chaos Engineering and Chaos Testing in your firm:

Do you want to be a customer of such broken software? Thank you for reading my introduction to chaos engineering.

Technologies

AWS Tools and SDKs
AWS Tools and SDKs
Serverless Framework
Serverless Framework

Series

Remaining chapters

No items found.
Insights

Related articles

Let's talk about your project

We'd love to answer your questions and help you thrive in the cloud.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We'd like to keep improving our site - and your anonymous analytical cookies would help with that. Is that OK with you?
Analytics
These items help us understand how our website performs, how visitors interact with the site, and whether there may be technical issues. The information we collect for this purpose is fully anonymous.
Confirm