What is Chaos Engineering?
Chaos Engineering is the proactive approach of testing and strengthening resilience across applications and services by deliberately introducing controlled disruptions. These disruptions reveal weaknesses and behaviors under real-world conditions like high traffic, unexpected outages, or resource bottlenecks.
While there is a long history of intentional fault injection in software development, the discipline of chaos engineering was really popularized in 2010 by the reliability team at Netflix with their use of an internal tool called Chaos Monkey, which would randomly turn on and off instances in production.
Since then, reliability teams around the world have adopted new practices, technologies, and tools to implement this type of reliability testing in a safe and strategic way.
How does Chaos Engineering work?
The goal of chaos engineering is to learn about your systems by running experiments, either in pre-production or production environments. Like any experiment, there are standard steps you would follow to ensure that you are learning from your efforts:
By following a systematic process and challenging assumptions, teams are better able to generate actionable insights into system performance.
Step 1: Define Expectations
Start by identifying your system’s baseline behavior using metrics like latency, throughput, or error rates. Observability or monitoring tools are critical in providing visibility into these types of metrics, so you can develop specific benchmarks. If you haven’t configured a monitoring tool, that is often a good first step before diving deeper into chaos engineering.
Step 2: Hypothesize Outcomes
Next, create a hypothesis for how you expect your system to behave when a specific fault or stress is introduced. This hypothesis should include a trackable metric, such as maintaining a certain performance standard or a clear application functionality (e.g. “If a primary database fails, read replicas will maintain query availability.”)
A successful experiment is an experiment where the hypothesis was correct (e.g. “We expect the average CPU utilization to not exceed 50% at any point”). A failed experiment means that the hypothesis wasn’t correct and adjustments need to be made.
Step 3: Run an Experiment
Manually run your experiment by introducing your test conditions. Typically, this involves artificially interrupting services, adding latency spikes, stressing CPU or memory, or simulating an outage. We’ll explain how teams actually inject these failures in a section below, but it typically requires using a script or ClickOps to make these changes, or utilizing a chaos engineering platform to orchestrate experiments.
Step 4: Observe Behavior:
As your experiment runs, you should be able to use monitoring tools to compare the actual performance against expected outcomes. For example, if you expect service to somewhat degrade upon injecting latency, you can monitor to see if the degradation actually exceeds your expectations or not. Once you review the logs, you should be able to determine if your experiment was a success or failure, or in other words – if your hypothesis was correct.
Step 5: Refine and Repeat
Now that you have run your experiment, you can make use of the results.
If your hypothesis was correct, then you have accurately predicted how your system would behave. It may have even exceeded your expectations. If you predicted a positive reliability result, then you should have more confidence in your systems now.
If your hypothesis was proven incorrect and your system was unable to perform adequately under the conditions, then you have likely revealed a reliability issue that could be a liability in production if the experiment conditions were to become a reality. Once you have made changes to address the issue, you can then rerun the experiment to validate that it is now successful.
A successful experiment validates that, at the time of running the experiment, the hypothesis was correct. To ensure that you maintain this result over time, teams will pick certain experiments to automate and incorporate in their CI/CD workflows.
Key Principles of Chaos Engineering
Chaos Engineering can be an intimidating term for stakeholders who are not familiar with the practice. “We have enough chaos already!” is a common response.
To implement chaos engineering and reliability testing in an effective way, it’s important to follow some key principles:
Start Small and Focused, Then Expand
The best way to get started is by selecting a relatively small target and limiting the impact of your experiment. For example, instead of running experiments in production, lower the potential negative impact by running experiments in a non-production environment. Instead of running an experiment that will impact a full cluster, target a subset. Instead of a critical service, start with a service that has fewer downstream impacts.
Start small and once you have seen success with your experiments, then you can slowly expand to incorporating multiple services in your experiments.
Communicate Early and Often Across Teams
With each new experiment, there is an opportunity to learn about how systems behave and respond under certain conditions. Those sociotechnical systems include both the applications and the software teams behind them.
Before running an experiment, it’s critical to proactively communicate ahead of time so everyone has sufficient visibility and understands the potential impacts. Stress your applications – not your people. With proactive communication, you can also educate teams internally on why you are running experiments, key insights from the results, and what changes are needed to boost reliability.
Experiments can be leveraged as an opportunity for SRE teams to communicate to Dev teams about reliability issues and make sure that resilience is prioritized across the board. For example, many Site Reliability Engineering (SRE) teams run events called GameDays where they run specific experiments and collaborate with developers to run root cause analyses and improve their incident response processes.
Read More: Principles of Ethical Chaos Engineering
Enable Teams to Run Experiments Easily
If an organization wants to scale reliability testing and chaos engineering across applications and teams, it’s critical that they help make creating and running experiments as simple, fast, and automated as possible. There are commercial chaos engineering platforms like Steadybit that are designed to provide teams with the ability to quickly scan environments for targets, identify reliability gaps, and launch experiments with a drag-and-drop timeline editor. These experiments can then be incorporated into automated development workflows to validate certain reliability standards without burdening development teams with time-intensive processes. When there is strong enablement for chaos engineering programs, then real adoption across teams is much more likely.
Examples of Chaos Experiments
There are a wide range of possible chaos experiments you could run to test your systems. In this section, we’ll outline common experiment types and list specific examples for each.
Dependency Failures
With the rise of microservices, systems are more reliant on internal and external dependencies to fulfill requests effectively. Experiments that simulate service failures allow teams to see how their systems respond to outages.
Here are experiment examples:
- Simulate increased latency or packet loss to test service response times and throughput
- Emulate the unavailability of a critical service to observe the system’s resilience and failure modes
- Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms
- Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.
Resource Constraints
Each system has resource limitations on things like CPU, memory, disk I/O, and network bandwidth. Even with cloud providers and autoscaling options, it’s useful to run experiments that stress your systems to see how resource constraints impact performance.
Here are experiment examples:
- Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
- Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.
Network Disruptions
Various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiments can help show how a system responds and adapts to network unreliability.
Here are experiment examples:
- Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.
- Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.
- Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.
- Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.
- Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.
Read More: Types of Chaos Experiments (+ How To Run Them According to Pros)
If you want to see even more specific experiment templates, you can view over 50 examples in the Reliability Hub, an open source library sponsored by Steadybit.
Conducting Chaos Experiments
Chaos experiments follow a simple, repeatable process: plan, execute, and analyze. With the right tools, teams can safely test their systems at any scale.
To execute experiments effectively, follow a structured approach:
- Plan: Identify critical scenarios to test while limiting the blast radius.
- Execute: Use Steadybit’s tools to simulate failures in a controlled environment.
- Analyze: Measure system performance against defined objectives and implement improvements.
Steadybit supports time-based and recurring experiment scheduling, integrating seamlessly with CI/CD pipelines to ensure consistent testing aligned with deployment workflows.
Common Uses for Chaos Engineering
Teams use Chaos Engineering to prepare for outages, validate disaster recovery plans, and improve reliability. It’s about building systems that can handle the unexpected.
- Building Resiliency: Detect and address system vulnerabilities to ensure uptime.
- Disaster Recovery: Validate failover processes and test RTO/RPO compliance.
- Compliance Validation: Prove system readiness for audits or regulatory requirements.
- Improving Site Reliability: Test infrastructure components like load balancers and API gateways under stress.
Tool for Chaos Engineering: Steadybit
Steadybit leads the way with features tailored for ease of use and maximum impact.
Steadybit simplifies Chaos Engineering with an intuitive platform for designing, running, and analyzing experiments. It’s built for teams that want reliable systems without the complexity. Steadybit includes:
- Intuitive experiment design interface
- Prebuilt and customizable failure scenarios
- Real-time observability and safety controls
- Seamless integration with monitoring tools like Datadog and New Relic
Observability Integrations:
DataDog and New Relic are powerful tools for monitoring and observability, offering real-time insights into system performance.
DataDog captures data from servers, containers, databases, and third-party services, providing comprehensive visibility for cloud-scale applications. Integrated with Steadybit, it tracks the impact of chaos experiments in real time, enabling teams to correlate chaos events with changes in system metrics and logs.
Similarly, New Relic delivers robust application performance monitoring with a focus on distributed systems, offering detailed insights into applications, infrastructure, and digital customer experiences.
Users can customize their observability environment with features like custom applications and dashboards, allowing for tailored analyses of chaos experiments and greater control over system health. Both tools complement Steadybit’s capabilities, ensuring actionable insights for building resilient systems.
Metrics and Observability
To learn from chaos experiments, you need the right data. Steadybit integrates with monitoring tools so teams can track and act on key metrics like latency and error rates. Steadybit tracks:
- Latency: Measure response times under stress.
- Error Rates: Identify transaction failures.
- Resource Utilization: Monitor CPU, memory, and network usage.
By integrating with leading observability tools, Steadybit ensures that every experiment provides actionable data.
Challenges in Chaos Engineering and How Steadybit Overcomes Them
While Chaos Engineering delivers tremendous value, teams often face challenges when adopting it. Steadybit is designed to address these obstacles, making it easier to embrace chaos experiments and integrate them into daily workflows.
Cultural Resistance
Many teams hesitate to embrace Chaos Engineering due to fears of causing unnecessary disruptions or the perception that it adds extra work.
Steadybit tackles this head-on with a user-friendly platform that demystifies the process.
By offering intuitive workflows and safety mechanisms, Steadybit helps teams see Chaos Engineering as a manageable and valuable practice rather than a risky or overwhelming initiative.
Resource Constraints
Organizations often worry about the time and effort required to implement Chaos Engineering.
Steadybit reduces this burden by enabling targeted, focused experiments that deliver high-value insights without consuming excessive resources.
The platform’s integration with existing tools and workflows ensures minimal disruption while still providing actionable data.
Technical Complexity
The technical complexity of creating and managing chaos experiments can be daunting, particularly for teams new to the practice. Steadybit simplifies the entire process, from designing experiments to analyzing results.
Predefined scenarios, customizable templates, and real-time observability tools lower the barrier to entry, enabling teams to focus on learning and improving rather than wrestling with implementation.
Getting started with Chaos Engineering
Starting with Chaos Engineering can seem complex, but Steadybit ensures a smooth and straightforward journey.
Here’s how to begin:
1. Define Critical Metrics
Identify and establish baseline metrics that represent your system’s steady state, such as latency, error rates, or throughput. These metrics will help you measure the impact of experiments and evaluate system health.
2. Start Small
Begin with a narrow scope to limit the potential impact. For example, test a single service or a non-production environment first. Steadybit’s platform helps you control the blast radius to ensure safe experimentation.
3. Experiment and Refine
Use the insights from initial experiments to strengthen your systems. Steadybit’s guided workflows and detailed observability features provide clear, actionable feedback, helping teams iterate and improve with confidence.
4. Scale Up Gradually
Once you’ve gained confidence in smaller experiments, expand the scope to include more critical systems or larger portions of your infrastructure. Steadybit makes it easy to scale chaos experiments safely.
Get Started: Ready to explore Chaos Engineering? Schedule a demo with Steadybit and transform your systems into resilience powerhouses.