background - stars

Learning Chaos Engineering

What is Chaos Engineering?

Chaos engineering is the practice of experimenting on a system in order to build confidence in the system’s capability to gracefully handle a variety of conditions in production.

These proactive experiments intentionally put systems under stress so teams can identify weaknesses before they occur in production and negatively impact customers. Experiments will often simulate real-world conditions like high traffic, unexpected outages, latency from system dependency, or resource bottlenecks. By testing the reliability of your systems with different experiment types, software teams can build predictability in what to expect from their systems.

While there is a long history of intentional fault injection in software development, the discipline of chaos engineering was really popularized in 2010 by the reliability team at Netflix with their use of an internal tool called Chaos Monkey, which would randomly turn on and off instances in production.

Chaos engineering has evolved since then with the advent of new experiment types and enterprise chaos engineering tools like Steadybit and Gremlin. Teams can now deploy highly controlled and customized experiments across their systems in a few clicks. It’s never been safer and easier to run reliability tests to prove the resilience of your systems.

How Reliability Experts Describe Chaos Engineering

Learn the basics of chaos engineering from industry leaders who have successfully run programs at major companies.

"What is Chaos Engineering?" with Casey Rosenthal

In this clip from "Experiments in Chaos", Casey Rosenthal shares how he collaborated with resilience leaders to precisely define chaos engineering.

Read More
Switching from Reactive to Proactive Reliability Approaches

Adrian Hornsby shares his perspective on why teams struggle to shift to a more proactive reliability approach.

Read More
Introducing Chaos Engineering Practices Into Your Organization

In this clip, Tom Handal from Roblox shares his story of bringing chaos engineering into the organization at scale.

Read More

How Does Chaos Engineering Work?

The goal of chaos engineering is to learn about your systems by running experiments, either in pre-production or production environments. Like any experiment, there are standard steps you would follow to ensure that you are learning from your efforts:

By following a systematic process and challenging assumptions, teams are better able to generate actionable insights into system performance.

Step 1: Define Expectations

Start by identifying your system’s baseline behavior using metrics like latency, throughput, or error rates. Observability or monitoring tools are critical in providing visibility into these types of metrics, so you can develop specific benchmarks. If you haven’t configured a monitoring tool, that is often a good first step before diving deeper into chaos engineering.

Step 2: Hypothesize Outcomes

Next, create a hypothesis for how you expect your system to behave when a specific fault or stress is introduced. This hypothesis should include a trackable metric, such as maintaining a certain performance standard or a clear application functionality (e.g. “If a primary database fails, read replicas will maintain query availability.”)

A successful experiment is an experiment where the hypothesis was correct (e.g. “We expect the average CPU utilization to not exceed 50% at any point”). A failed experiment means that the hypothesis wasn’t correct and adjustments need to be made.

Step 3: Run an Experiment

Manually run your experiment by introducing your test conditions. Typically, this involves artificially interrupting services, adding latency spikes, stressing CPU or memory, or simulating an outage. We’ll explain how teams actually inject these failures in a section below, but it typically requires using a script or ClickOps to make these changes, or utilizing a chaos engineering platform to orchestrate experiments.

Step 4: Observe Behavior:

As your experiment runs, you should be able to use monitoring tools to compare the actual performance against expected outcomes. For example, if you expect service to somewhat degrade upon injecting latency, you can monitor to see if the degradation actually exceeds your expectations or not. Once you review the logs, you should be able to determine if your experiment was a success or failure, or in other words – if your hypothesis was correct.

Step 5: Refine and Repeat

Now that you have run your experiment, you can make use of the results. 

If your hypothesis was correct, then you have accurately predicted how your system would behave. It may have even exceeded your expectations. If you predicted a positive reliability result, then you should have more confidence in your systems now. 

If your hypothesis was proven incorrect and your system was unable to perform adequately under the conditions, then you have likely revealed a reliability issue that could be a liability in production if the experiment conditions were to become a reality. Once you have made changes to address the issue, you can then rerun the experiment to validate that it is now successful.

A successful experiment validates that, at the time of running the experiment, the hypothesis was correct. To ensure that you maintain this result over time, teams will pick certain experiments to automate and incorporate in their CI/CD workflows.

Key Principles of Chaos Engineering

These practices are important for implementing chaos engineering and reliability testing in an effective way.

precise icon

Start Small and Focused

Start by selecting smaller targets to limit the impact of your experiment. For example, instead of running experiments in production, start in a non-production environment to lower the potential negative impact. Instead of testing a full cluster, target a subset. Instead of a critical service, start with a service that has fewer downstream impacts.

Start small and once you have seen success with your experiments, then you can slowly expand to incorporating multiple services in your experiments.

communication icon

Communicate Early and Often

Each new experiment offers a chance opportunity to learn about how systems behave and respond under certain conditions. Those sociotechnical systems include both the applications and the teams supporting them.

It’s critical to proactively communicate ahead of time so everyone has visibility and understands the potential impacts of experiments. With clear communication, you can educate teams internally on why you are running experiments and share key insights from the results.

Read More: Principles of Ethical Chaos Engineering

enablement icon

Enable Teams to Experiment Safely

If you want to scale reliability testing and chaos engineering across an organization, it’s critical that it’s easy to create and run experiments.

Chaos engineering platforms like Steadybit are designed to enable teams with the ability to discover targets, analyze reliability gaps, and launch experiments with a drag-and-drop timeline editor. Then automated development workflows can use experiments to validate reliability standards without slowing down development teams. Adoption depends on strong enablement.

Getting Started with Chaos Engineering

Starting with chaos engineering can seem complex, but it’s similar to adopting any other type of new practice.

Here’s some guidance on how to begin:

1. Set System Benchmarks and Metric-Based Goals

Before you run experiments, you should start by establishing baseline metrics that represent your system’s steady state, such as latency, error rates, or throughput. You can track these metrics in your Observability tool, and they will help you gauge the impact of experiments on your system health.

Once you have collected metrics on your steady state, you can create goals for your chaos engineering efforts. For example, maybe you want to reduce the number of critical incidents per quarter by 20% on tested services. Alternatively, you could aim to create experiments that replicate your last 3 critical incidents so you can continually validate your fixes moving forward. Every organization is different, so pick goals that make the most sense for your use case.

2. Start Small to Limit Negative Impacts

Begin with a narrow scope to limit the potential impact of experiments. It’s better to increase your scope as you gain confidence than risk running an experiment that could have outsized negative impacts. For example, test a single service or a non-production environment first. Chaos engineering tools like Steadybit allow you to easily adjust the blast radius for your experiments so you can start experimenting safely and expand targets later easily.

3. Experiment and Make Adjustments

Use the insights from your initial experiments to strengthen your systems. By running experiments, you validate either that your expectations of your system were correct or wrong. You can either adjust your expectations or make changes to ensure that your systems are resilient enough to meet your initial hypothesis.

With this iterative cycle, you should be able to prove out different system behaviors and validate your SLOs.

4. Scale Up Gradually

Once you’ve gained confidence in smaller experiments, expand the scope to include more critical systems or larger portions of your infrastructure. While you may want to start experimenting on non-critical systems to develop your process, testing the reliability of your business critical systems is where you will gain the most ROI from chaos engineering.

Get Started: Ready to explore chaos engineering? Schedule a demo with Steadybit and transform your systems into resilience powerhouses.

Run Experiments Across SaaS and On-Prem Systems

Learn more about the most common types of reliability tests you can run for different technologies.

chaos experiment library graphic

Chaos Experiment Library

Browse an open source library of experiment actions.

kubernetes chaos engineering graphic

Chaos Engineering for Kubernetes

Stress tests pods & clusters.

AWS chaos engineering graphic

Chaos Engineering for AWS

Run experiments on targets across the AWS ecosystem.

Azure chaos engineering graphic

Chaos Engineering for Azure

Run experiments on Azure VMs and containers.

Inject Faults with Chaos Experiments

Here are examples of various types of attacks you can run to stress test your systems.

Network
Kubernetes
Cloud Services
Physical & Virtual Hosts
Applications
Observability

System Failures are Inevitable in Complex Systems

Organizations need to shift their mindset to start developing truly fault-tolerant systems.

Accepting that System Failures are Inevitable

In complex systems, failures are bound to occur. Hear how teams can build systems that are designed to be resilient.

Read More
Explaining the Prevention Paradox

When things don't break, SRE teams rarely get the attention that they do during an outage. This "Prevention Paradox" can make it challenging to advocate for reliability efforts.

Read More
Introducing Chaos Engineering as a Daily Practice

Hear how the teams at Roblox started with familiar tests to begin adopting chaos engineering practices.

Read More

Common Challenges When Rolling Out Chaos Engineering

When engineering teams want to start a chaos engineering program, they often get pushback if the organization leadership hasn’t already bought in. Here are some of the challenges that teams face when working to rollout chaos engineering across their organization:

Roadmap Competition

While proactive reliability programs can deliver tremendous business value, they are sometimes harder to sell than developing new features or optimizing existing investments, such as Observability tooling or new tools within a cloud provider ecosystem. Chaos engineering is a new practice for many engineers, so teams may be hesitant to embrace this approach and make time for exploratory learning.

Solution: Identify your most unreliable customer-facing services and quantify the business impact of downtime.

Cultural Resistance

Many teams hesitate to embrace chaos engineering due to fears of causing unnecessary disruptions or the perception that it adds extra work. When engineers are stuck in a reactive mode, they are constantly context-switching and responding to new alerts. Adding more “chaos” can seem overwhelming, even though each experiment is meant to bring order to the chaos that already exists in production.

Solution: Run GameDays or reliability workshops to give engineers an opportunity to get familiar with running experiments.

Resource Constraints

Implementing chaos engineering across applications requires time, tooling, and cooperation. Organizations budget for a certain headcount and tool expenses, so increasing spend requires making a thoughtful business and gaining buy-in from leadership.

Not many organizations budget for revenue loss expected from outages, but those still inevitably occur. If executives are not already convinced of the value of delivering highly available applications, it may take a major incident to get their attention and budget approval.

Solution: Create a business case that shows the potential value of preventing critical incidents earlier in the development lifecycle.

Technical Complexity

If you aren’t confident in how your systems run currently, it might seem difficult to conducting experiments reliably as well. Engineers using open source scripts to run experiments in an unstandardized way to generate ad hoc results is the easiest way to start chaos engineering, but also the hardest approach to scale and get predictable value from.

Deploying experiments across different technologies requires nuance and engineering know-how. This complexity often keeps teams from moving forward since their initial steps forward don’t seem to be leading down a clear path.

Solution: Adopt a chaos engineering solution that can standardize experiments and deploy across a wide variety of tools & technologies.

Learn About the Role of Chaos Engineering

These are some of the topics you can start to review.

top chaos engineering tools graphic

Top Chaos Engineering Tools

Compare approaches to find the right tool for you.

ROI of chaos engineering graphic

The ROI of Chaos Engineering

Read about how you can build a reliability business case.

compliance graphic

DORA Compliance

Learn how chaos engineering helps with DORA compliance.

The Steadybit Academy - Learning from Chaos

Getting Started with Steadybit

Explore tutorials to get started with a tool like Steadybit.

Measuring the Business Value of Chaos Engineering

So far, we’ve defined chaos engineering, explained the key principles, and outlined examples, but how can you make a business case for prioritizing chaos engineering over other initiatives?

While there are lots of ways to calculate an ROI, these are the most common approaches:

Preventing Costly Incidents in Production

If you have applications that are associated with revenue or relied upon by customers, downtime is a tangible cost. You can project your current incident costs by calculating:

# number of incidents in the last year X cost of each hour of downtime X average MTTR = annual incidents costs

If you break it out by incident severity, it could look something like this:

Current Reliability Profile # in the Last Year Business Cost Per Hour Avg. MTTR (Hours) Total Incident Costs
Incidents – Low Severity 400 $5,000 1.5 $3,000,000
Incidents – Medium Severity 20 $10,000 1.5 $300,000
Incidents – Critical Severity 2 $200,000 2.5 $1,000,000
$4,300,000

 

With this approach, you can then project savings by estimating a reduction in the number of production incidents per year due to proactive reliability testing. In the above example, if implementing chaos engineering was able to help reduce incidents by 20%, then that would result in annual savings of $860,000. You can try our interactive ROI calculator to see how chaos engineering could benefit your organization.

Finding and Mitigating Reliability Risks

If you have applications with outstanding reliability issues, they carry a potential risk of downtime. Even if your organization hasn’t experienced significant downtime yet, there is meaningful value in finding and fixing these issues and reducing the overall risk of outages.

If you utilize the incident cost calculation above, you can determine a value for each reliability risk by severity. For example, if by running chaos experiments, you are able to identify a critical severity issue and fix it, you have prevented an issue that could have resulted in $500,000 in damages if it had occurred in production. 

As you run more experiments and scale your testing across applications, you will uncover your existing reliability issues and streamline your ability to remediate risk across your systems. If you run 100 experiments and 10 of them reveal reliability issues that need to be fixed, you can fix all or a subset of those issues and document the potential business liabilities that have now been addressed.

Improving Incident Response and Operational Readiness

If teams rely on incidents occurring in production to test run books, run root cause analysis, and remediate issues, they are missing the opportunity to practice and optimize their response in a low-stress simulated scenario.

Running chaos experiments enables organizations to continually sharpen their incident response processes, validate that runbooks are up-to-date, and improve their Mean Time To Resolve (MTTR) issues.

Fidelity Investments recently presented on how they scaled chaos engineering efforts across their applications. As they expanded their “chaos coverage”, they were able to meaningfully decrease their MTTR.

Hear Chaos Engineering Journeys from Idea to Roll Out

In these videos, reliability leaders share their experience rolling out chaos engineering and measuring its impact.

The Impact of Fidelity Investments Rolling Out Chaos Engineering

See the impact to average MTTR as the team at Fidelity rolled out chaos engineering.

How Salesforce is Using Chaos Engineering to Achieve Reliability

Hear from Krishna Palati to hear how engineers at Salesforce are using Steadybit to run chaos experiments.

Read More
Building System Confidence at ManoMano

Antoine Choimet shares his experience as a chaos engineer at ManoMano.

Read More

How should chaos engineering integrate with Observability tools?

Observability tools like DataDog, Dynatrace, Honeycomb, Grafana, New Relic, and more offer organizations real-time insights and monitoring into their system performance.

For example, DataDog captures data from servers, containers, databases, and third-party services, providing comprehensive visibility for cloud-scale applications. When integrated with a chaos engineering tool like Steadybit, it tracks the impact of chaos experiments in real time, enabling teams to correlate chaos events with changes in system metrics and logs.

Similarly, New Relic delivers robust application performance monitoring with a focus on distributed systems, offering detailed insights into applications, infrastructure, and digital customer experiences.

Users can customize their observability environment with features like custom applications and dashboards, allowing for tailored analyses of chaos experiments and greater control over system health.

To learn from chaos experiments, you need the right data. It’s important to have a tight integration between your chaos engineering tool and monitoring tools so teams can track and act on key metrics like latency and error rates.

For example, Steadybit integrates with monitoring tools to tracks metrics like:

  • Latency: Measure response times under stress.
  • Error Rates: Identify transaction failures.
  • Resource Utilization: Monitor CPU, memory, and network usage.

What tools do people use for chaos engineering?

Open Source Chaos Engineering Tools

There are many open source tools for chaos engineering like Chaos Monkey, LitmusChaos, Chaos Mesh, and Pumba. For the most part, these open source tools are technology specific and good for initial experimentation.

For anyone trying to rollout chaos engineering and reliability testing at a multi-team or organization-wide level, commercial tools offer easier deployment, better user experience, and enterprise-grade security features.

Commerical Chaos Engineering Tools

Some companies build their own custom chaos engineering solutions, but then struggle with the maintenance and development of new functionality. Instead of building a custom solution, many teams will opt to buy a commercial platform.

The three leading commercial tools are Steadybit, Gremlin, and Harness Chaos.

If you are interested in evaluating a commercial chaos engineering platform, you’ll find that Steadybit is easiest to use, customize, and deploy due to its open source extension framework and timeline-based experiment editor.

What are examples of chaos experiments?

There are a wide range of possible chaos experiments you could run to test your systems. In this section, we’ll outline common experiment types and list specific examples for each.

Dependency Failures

With the rise of microservices, systems are more reliant on internal and external dependencies to fulfill requests effectively. Experiments that simulate service failures allow teams to see how their systems respond to outages.

Here are experiment examples: 

  • Simulate increased latency or packet loss to test service response times and throughput
  • Emulate the unavailability of a critical service to observe the system’s resilience and failure modes
  • Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms
  • Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.

Resource Constraints

Each system has resource limitations on things like CPU, memory, disk I/O, and network bandwidth. Even with cloud providers and autoscaling options, it’s useful to run experiments that stress your systems to see how resource constraints impact performance. 

Here are experiment examples: 

  • Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
  • Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.

Network Disruptions

Various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiments can help show how a system responds and adapts to network unreliability. 

Here are experiment examples: 

  • Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.
  • Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.
  • Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.
  • Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.
  • Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.

Read More: Types of Chaos Experiments (+ How To Run Them According to Pros)

If you want to see even more specific experiment templates, you can view over 80 examples in the Reliability Hub, an open source library sponsored by Steadybit.

What are GameDays?

GameDays are events that Site Reliability Engineering teams run to strengthen observability and SRE skills across software development teams. These events will often feature one or multiple chaos experiments, where teams will need to respond to an incident and troubleshoot the situation. This could involve racing team members to uncover the root cause of an incident.

These workshops are an opportunity for everyone to learn about their systems and share knowledge across teams to improve organizational resilience.

Rolling out chaos engineering at your organization?

Reach out to consult with our team of experts and hear how Steadybit could help.

Tune in to "Experiments in Chaos" to learn more

Start Running Experiments

Want to try chaos engineering today?

Create a free trial with Steadybit to build experiments and start testing your systems.

ufo image around a planet