What is resilience testing with Steadybit?

Resilience testing with Steadybit involves evaluating the robustness of systems by simulating failures and unexpected conditions to ensure they can recover effectively. This method helps organizations identify weaknesses and improve their overall system reliability.

How are rolling updates incorporated into resilience testing?

Rolling updates are a crucial aspect of resilience testing as they allow for incremental changes to be deployed without taking the entire system offline. By testing these updates during internal game days, teams can discover potential issues in a controlled environment before they impact production.

What are internal game days?

Internal game days are organized events where teams simulate various failure scenarios within their systems. These exercises help identify vulnerabilities and assess how well systems can withstand disruptions, ultimately enhancing overall resilience.

Why is resilience testing important in modern systems?

With the complexity of modern systems, resilience testing is essential to ensure that applications can handle failures gracefully. It builds confidence among developers and stakeholders that systems will remain operational under adverse conditions, reducing downtime and improving user experience.

What conclusions can be drawn from resilience testing?

Conclusions from resilience testing often highlight areas for improvement within a system's architecture and operational procedures. They provide insights into how well a system can recover from failures and inform strategies for enhancing its overall reliability.

How does Steadybit facilitate resilience testing?

Steadybit facilitates resilience testing by providing tools that allow teams to simulate various failure scenarios easily. This enables organizations to proactively identify weaknesses in their systems, test recovery processes, and ultimately enhance their operational resilience.

All Blog Posts

Continuous Verification with Steadybit: Boost Resilience

Chaos Engineering Guides

10.10.2022 Ben Blackmore - 5 min read

Continuous Verification with Steadybit: Boost Resilience

This article will look closely at continuous verification with Steadybit through resilience testing and how it helped us internally.

In articles and books, you find many resources about chaos engineering, specifically its experimental nature. Suppose you have gained knowledge and insights into your system through these: How do you translate knowledge and insights into confidence? We believe that you can only build confidence over time through continuous learning (significant and relevant knowledge requires continuous investment) and continuous verification (what once was true doesn’t have to be true in the future – we need to correct the verification and/or our understanding).

Resilience Testing with Steadybit

Within the wider industry, we apply several testing methodologies for continuous verification. For example, unit, integration, end-to-end and performance/load testing are cornerstones of software engineering practices. These methodologies help us gain confidence and improve reliability in many aspects of our systems. However, these tests are typically either not executed within deployed environments (unit and integration tests) or under turbulent conditions (end-to-end and performance/load tests).

Through resilience testing, we introduce turbulent conditions into existing tests (for example, in k6 load tests or integration tests) or via dedicated resilience tests.

Within Steadybit, you can turn a combination of attacks, checks/probes, load tests and arbitrary actions (through ActionKit) into an experiment. There are no restrictions on the number of steps within an experiment or its complexity. As a result, you can use Steadybit’s experiment capability to implement resilience tests.

Resilience Testing Rolling Updates

As part of our recurring internal game days, we discovered that deployments of our product weren’t working anymore without short interruptions for our customers. We used a Kubernetes deployment with sufficient replicas and a rolling update strategy, but something was wrong. Within the next iteration, we investigated this regression to identify and fix the cause – which turned out to be an AWS ALB misconfiguration related to sticky sessions.

Let us pause here: How would you write a test to verify your expected rolling update strategy? How do you ensure that rolling updates continue to work? Feel free to let us know via Twitter through @SteadybitHQ, or any other channel you like!

To ensure that the rolling update strategy continues to work in the future, we leveraged Steadybit to write a resilience test.

The image above shows part of Steadybit’s experiment designer. Analogous to a movie editor’s timeline view, you can leverage the experiment designer to execute steps concurrently and sequentially.

The green box represents the system modification we are executing. In this case, a simulated rolling update through the kubectl rollout restart command injected through Steadybit. As mentioned above, we learned during our last game day that the system misbehaved during rolling updates. Therefore, the rolling update is at the core of the experiment.

The yellow boxes depict invariants, pre-, and post-conditions that need to be maintained. They represent checks that are verified for a certain amount of time. These checks verify how we want the system to behave, and they directly result from the observations from the last game day. Meaning: We turned observations from a game day into checks for continuous verification. Consequently, they result in fast, repeatable and cheap verification that doubles as documentation. To give you an overview, these are the type of checks we leverage:

No pending rollouts: This check is implemented by our Kubernetes extension. It internally leverages kubectl rollout status.
All pods ready: This check is bundled within Steadybit’s agents and compares the ready pod count to the desired pod count.
No degraded synthetic checks: We leverage Checkly for synthetic checks (pings/HTTP calls). Checkly exposes a Prometheus scraping endpoint through which we turn synthetic check executions into metrics. We leverage a PromQL query (checkly_check_status{tags=~”production,.*”}) with the help of our Prometheus extension to check that there are no failed synthetic checks.
API calls are successful: We also execute HTTP API calls against our system to verify that it is continually available. This HTTP check is a native capability of Steadybit’s agents.

Conclusions

With all the complexity of modern systems, confident development and operation are vital. Resilience testing and continuous verification build confidence in the systems you maintain and help you evolve them. It is essential always to have an up-to-date picture of the system’s risks and perform such resilience tests with a high degree of automation. Steadybit can be leveraged to author and execute these resilience tests. Try it yourself; you can get started for free!

Get started today

Full access to the Steadybit Chaos Engineering platform.
Available as SaaS and On-Premises!

Start Free Trial

or sign up with

Google GitHub

Book a Demo

Let us guide you through a personalized demo to kick-start your Chaos Engineering efforts and build more reliable systems for your business!

Book Demo