In the last few years a lot has happened in the field of Chaos Engineering. From the originally radical new approach of intentionally causing errors and failures, a large community has emerged. The principles have been refined and the popularity of Chaos Engineering is growing.
I don't want to reintroduce the topic but rather show why the topic is important, how we are currently dealing with it, and why Chaos Engineering is just the beginning. Our journey in the world of distributed and dynamic systems has just begun, there is still a lot to do and the tasks in the daily work routine are only increasing. Why is the hurdle with Chaos Engineering still so high, what blocks us and what is missing?
The first time the topic of Chaos Engineering really came into vogue was from Netflix’s engineering tools. The motivation was the change from a monolith operated on its own hardware to a microservice-driven cloud infrastructure and the need to better protect against failures.
The tools used were not precise operations tools but simple scripts or commands fired at shell level. The ever-increasing complexity and automation at the infrastructure level made it more and more difficult to target virtual machines. Where once it was enough to run a few servers, now there were hundreds of virtual machines that were completely automated up and down.
This led to the next evolutionary step of the tools used to perform experiments. The Chaos Monkey was born and now randomly performed virtual machine shutdowns. The failures in production can now be performed much more frequently and randomly with the help of the Chaos Monkey. The improvements brought about by this approach were noticeable to the teams and more importantly to the end users. The software and systems were now able to handle and catch errors.
But this is not where the journey ended and the Chaos Engineering team at Netflix quickly realized the potential behind it. The execution of Chaos Experiments should be made available to all teams and everyone should benefit from it. Out of this came, among other things, the idea to replace the Monkeys with something that allows more control and can deliver more insights. Allowing systems to fail randomly gave way to the possibility of a more targeted and planned approach.
ChAP was born - the first Chaos Automation Platform was created which provided important insights and visibility on where teams needed to focus efforts and which service had a big potential for failure.
Those who deal with the topic of Chaos Engineering today can draw from the knowledge of Netflix’s original approach. The community that has emerged has achieved great progress and built up an enormous amount of knowledge. Nevertheless, I have had to realize that after the initial euphoria, disillusionment quickly sets in during the first experiments. The principles of Chaos Engineering are quickly understood and the first simple experiments are carried out fast, but it is not as simple as in all the tutorials. Developers, SREs and Ops have a very busy daily routine, their work is like a rehearsed symphony and must not get out of sync. The beat and how the working day is structured is determined by many factors. Full backlogs, priority 1 incidents, users that challenge the systems and last but not least important new features that have to be rolled out. The systems seem almost fragile and each of us knows the one corner in the system that is only held together by duct tape.
It's even getting more challenging as it is clear to many that we are reaching limits according to the old patterns of testing and quality control. Today's systems and the needs for their stability and reliability continue to grow, and the knowledge gained through Chaos Engineering helps us meet these challenges. But is this alone enough?
Many tools and products in the field of Chaos Engineering are nothing more than a hammer. The use of this hammer always follows the same pattern and unfortunately quickly loses the reference to keep an eye on the status of the system.
Choose a hammer from the toolbox
Use the hammer and hit a part of your complex system
Verify if the hammer worked there
Selecting an attack from the kit and observing the place where the attack was executed is no big deal and everyone will start that way. The first thing I want to understand is whether the Chaos Engineering tool I'm using does what it's supposed to do; does that improve my system and protect me from failures? No! Today's systems are very complex and sometimes already bring features along to protect against failures (like Kubernetes does). But it is much more than that, we want to learn something, improve our everyday life and protect ourselves against failures.
Just like developers, SREs, and Ops days running like a rehearsed symphony, the functioning of today's complex systems can be equated to a opera with a full orchestra, numerous actors, and a conductor.
In order to keep up with competitive pressure and changing customer expectations, ever shorter release cycles have become necessary. Being able to adapt to the market faster than the competition is a distinct advantage. The frequent changes brought about by this change make it seem impossible to test everything.
What can we do and what is a possible way out?
I'm convinced that Chaos Engineering is just the beginning and that much more will develop from this original idea in the years to come. With Chaos Engineering alone we will not be able to face today's situations in distributed and complex systems.
We have to create a culture of resilience and work together more collaboratively. The knowledge that exists in people's heads must be made accessible to others. Unfortunately, simply talking about it is not enough. We need to be reminded by suitable tools to deal with the risks on a recurring basis. A tool that does the analytical work for us, but does not get in our way and supports us in assessing the risks would be the right step in my eyes. The final decision must always remain with the team and those responsible.They have the experience in their system and know their use cases and, more importantly, their customers.
The balance between system reliability and development velocity/delivery performance needs to be restored. Often technical speed wins and we slay the problems with technology instead of dealing with the problem.
When I talk to people about resilience, I perceive a different meaning. A developer will usually be application and service oriented whereas an SRE or Ops naturally has a different view on the topic and the system.
Together, we must be able to learn from mistakes more easily and continuously. Mistakes are part of life. To blindly and untruthfully improve everything now and make even the last small service highly available is not the right decision and leads to even more unnecessary complexity. We need a guide through our complex systems that helps us better assess risk. An overview of the current resilience, vulnerabilities and what impact they can have is necessary to set the right priority. Today the impact of failures can be mitigated and we as developers, SRE and Ops can all learn to get consistently better at this.
This is the task and goal we have set for ourselves at steadybit and I am happy to keep you up to date.