Are you afraid of Chaos in your environment? “Taming chaos” is much more than just a fancy expression.
Ask yourself, how many have times you experienced failures in your cloud or on-premise environments. How many of them had crucial impact on overall state of your application (we’re all concerned that the time is being spent on improvements serves our business applications) and how many have gone unnoticed or visible by other teams? The first category is not as worrisome as the failures are often fixed via small pull requests. So, let’s skip them and jump straight to those which you’re the most afraid of.
Most of the problems are actually the result of unexpected system behaviours raised during change injections. Literally, we put our code changes into automated pipelines, thinking it’s enough for determining whether the whole system is going work or not. Unfortunately, very often, we forget about our scope of changes and the big picture. Being a part of a particular team or project kills our perception on the global view. Focusing on a new release (and it doesn’t matter if it’s infrastructure or application part) we’re forgetting that it’s just a part of bigger, more complex ecosystem. Therefore, the absolute at this point is that the system as a whole should make sense to you, but subsections of the system don’t have to. In short, the tests, run during CI/CD pipeline, are typically provided to determine whether a property is true or false, but not to generate new knowledge about the entire system as such. The key to success is to find ways to explore new areas and this is a place for chaos engineering troubleshooting.
The proliferation of cloud services shifted the approach towards application update releases to a completely new level. They insisted on changes in internal cross-team communication methods as well. Prior to that, let’s call it the DevOps approach, we had (or even have it right now) a terrible conflict in terms how quickly software can be released into prod. On the one side developers have their opinion that it is just the launching of a new feature and approach it on the lines of “let’s launch anything, any time, without being scared of post-mortems”. On the other hand, Ops team wants to be sure that service don’t break during their on-call shift – “we wouldn’t like to change anything once it works”. That was the reason in my opinion why era of SRE has eventually approached. Those engineers are aware of software engineering and they’ve got necessary skills regarding networking, operating systems and high-available infrastructure requirements that have to be met. The main difference between typical sysadmin is an ability to understand global view and indispensable goals coming from an application that have to be implemented in the infrastructure. Another one, equally important, is the lack or less doubts about chaos challenges, because they treat it as mandatory part of system confidence level.
Treat experiments as a discipline that improves complex systems. First of all, ask yourself “What if?” “What if I simulate a failure of an entire datacentre of a cloud region?”, “What if I inject latency between my two microservices for a percentage of traffic or predefined time?” It doesn’t matter what type of question you ask yourself. Treat it as a kind of game of challenging yourself and your team to see if you’re aware of all possibilities. If you’re not sure or cannot answer all of them (they might differ, depending on the application) then it’s time to consider some ways of solving it. Exposing unknown weaknesses in your production system is cool, but keep in mind that if you are certain that a Chaos Engineering will lead to a significant problem with the system, there’s no sense in running it. You’ll have to be sure that current state is a steady one and you’re aware upcoming steps. If not, do not play with chaos engineering, because the only thing you’ll generate are problems.
Without visibility into your system’s behaviour, you won’t be able to draw conclusions from any planned experiments. Define a level of satisfaction which allows to move further and to start injecting “failures army”. Monitoring tool is like a coach in football team, he’s watching everything and is able to notify in case of “player’s injury”. It is going to be your eye on the whole process and the summary after all.
Chaos Engineering cannot be associated with something that you do randomly. Everyone in your company has to be introduced with its principles, otherwise is meaningless. Acknowledge the fact that every experiment has to be planned. There’s no place for doubts. You’ll understand how the system reacts under different circumstances by running experiments, provided you and other colleagues have prepared rollback procedures. Then your system is being pushed above some well-known limits and you observe what happens. The important thing is to follow systematic approach in order to maximize the outcome. Each time you think about experiment, have a hypothesis in mind about what you believe the outcome will be.
Hypotheses in experiments are usually formed like: “the events we are injecting into the system will not cause change in steady state”. We might fail different type of services, critical or noncritical. Having an example of noncritical one generating the personalized list of items to the user. In case of a failure the system should return a default list. After all we’re increasing user’s experience level and system resiliency.
One year ago, I set up a start-up to help other companies in automating their activities but even more important was to help them finding opportunities to get new knowledge about their environments and understand which areas haven’t been covered yet. It’s long-term process, considered by majority as unnecessary but I do believe that “taming chaos” is much more than just a fancy expression. Over the past years we’ve been moving our assets to the clouds forgetting that rules remain intact. Clouds shouldn’t kill your senses and shouldn’t allow you to treat them as out-of-the box goodies. Challenge your environments no matter they are on-prem or fancy cloud ones, otherwise you’ll end up in a state of anxiety. More Chaos stuff coming up, so stay tuned Legions!
We'd love to answer your questions and help you thrive in the cloud.