If you are also wondering how to shift left Resilience and Chaos Engineering to Developers, you are reading the right article.
Wow, what a ride. We are now in our third year and have learned a lot. We have been heads-down building the product; we interviewed many people and talked to many users. And we have continuously improved our product and put all our knowledge about Resilience and Chaos Engineering into it.
However, we had one crucial insight on our way: Chaos Engineering is great, but without additional tooling, it’s primarily valuable for experienced engineers who understand their systems very well. Today, I’m very proud to introduce you to an important milestone for our product. But before we dive deeper into that, maybe a slight step back. What trends do we currently see in the market that influenced our product decisions in the last months?
Who are the developers we see as responsible for service reliability and resilience? The answer is simple: Every developer but different about their role and concrete competencies. They have one thing in common: They hate production incidents.
But we need to keep in mind that a junior developer may not create a resilience design and enforce it themselves. They may be informed about issues during the software delivery process when a problem was found in their code related to potential problematic software resilience design. A DevOps Engineer will be more likely to think about the specific challenges since he has an overview of the entire workflow. The SRE has the final responsibility for the reliability of the production system and will also be interested in having consistent and automated processes that significantly improve the stability of the software.
But the answer is also that the topic is very complex, and methods like Chaos and Resilience Engineering are not immediately familiar to everyone. That’s why we built Steadybit, and our goal is to support each of the developers in the best possible way. We also believe that people are essential in creating an authentic Culture of Resilience.
We love to automate things to concentrate on our daily business and not get distracted by daily production incidents. However, we learned that the best way to avoid that is to start very early in the process (“Shift Left”).
We’ve also learned: A developer has no time to do chaos experiments. This learning does not mean that chaos experiments are flawed, but it is always about limiting the cognitive load for a developer. Patterns such as Retry, Circuit Breaker, and Rolling Update are known. Still, they are often only implemented in later stages of the software dev cycle or even only when a failure occurs or an incident happens.
As a result, failures and bugs in individual subcomponents often lead to cascading effects that could be avoided. Steadybit provides an answer for that. We make any existing resilience issues transparent at an early stage and accompany the developer during the solution, and, if desired, also fix these issues. The goal is always to create as little or no further cognitive load for the developer but to give them the focus on his actual task: Velocity – The implementation of features.
As Ben already mentioned in his post: “Moving from experiments to expectations means that many complicated facets of Chaos Engineering can be avoided, and Resilience Engineering be democratized.”
Expert knowledge is now no longer required to operate systems stable and resiliently.
We believe that it makes more sense to start with the expectation and not break things. And of course: The developer is responsible for the applications and takes care of them. I’m even more pleased to present the first milestone of our new feature, “Resilience Policies” to you today. We are pursuing three fundamental goals with this feature:
Of course, we use our CLI:
npm install -g steadybit
The connection to the Steadybit platform is made quickly:
➜ steadybit config profile add ? Profile name: steadybit ? API access token: [hidden] ? Base URL of the Steadybit server: https://platform.steadybit.io
If you don’t have an account yet, sign up quickly here: https://www.steadybit.com/get-started/…
As we continuously improve every part of our product, some things have changed since the last post. For example, you can now directly define the policies that should apply to your service.
steadybit service init
This selection then results in a service definition with two resilience policies:
The basic model is that one policy always consists of 1:n tasks. Let’s have a deeper look at the policy recovery-pod:
A task always represents a ready-to-use experiment or a weak spot validation that needs to be executed. However, the “experiment” can be more like a test with fixed input and output states. An essential part of this is continuously checking whether the experiment or the test was successful. For the task recovery-of-single-container, the affected pod should be replaced within a given time – ending with the Kubernetes Deployment back in a ready state. If this is not the case, the task would be declared failed. Steadybit will check that internally with the following experiment (minimalistic example – see here for complete code):
In the end, it’s all about continuously measuring the MTTR of a single service: How long does it take you to get everything fully operational again?
The CLI will synchronize the service definition with the platform in the next step. You can think of it like you’re doing it usually in the context of Kubernetes Manifests (kubectl apply):
steadybit service apply
The following command executes the defined tasks resulting from the policies and displays a report in the console.
steadybit service verify
As an option, you can directly jump into the platform UI:
steadybit service open
Here you can see all assigned tasks of the fashion service on the right sidebar:
So, 40% of the assigned tasks are already completed in this case. But especially the recovery-pod task seems to be failing. Let’s have a deeper look at that issue:
Some fashion pods are not ready again within 2minutes (default waiting time). Therefore we found a potential issue because a long recovery time can probably lead to reliability problems, especially when deploying a few times a day. Maybe this predefined policy does not suit your needs? Let’s move on to the next chapter.
Developers love customizing things to their own needs. We feel you. Thus, all policies and tasks that steadybit currently brings are publicly stored in this GitHub Repository and can be adapted or customized to your own needs. Feel free to submit a PR on it to benefit other people.
And of course, since we as developers always want to automate everything to make our daily work more accessible and reduce all avoidable cognitive load, we also offer various GitHub Actions to facilitate interaction with the policies and tasks:
jobs:    apply:        runs-on: ubuntu-latest        steps:            - uses: actions/checkout@v3            - uses: steadybit/service-verify@v1            with:                 apiAccessToken: ${{ secrets.STEADYBIT_TOKEN }}
There is so much more to come. Unfortunately, it doesn’t all fit in one blog post. I promise you can look forward to many more features to improve your service reliability. But we are also still learning, and that’s why we need you and your feedback! We would be thrilled if you try it out and contact us. Don’t worry; you don’t need to go through a sales call 😉 Just you and the awesome folks from Steadybit. We are also developers and feel you.
We can solve the problems that hurt by working together in the community. So we are excited to build the best resilience product in the world with you!
Start now: https://www.steadybit.com/get-started/