Learning Chaos Engineering

What is Chaos Engineering?

Chaos engineering is the practice of experimenting on a system in order to build confidence in the system’s capability to gracefully handle a variety of conditions in production.

These proactive experiments intentionally put systems under stress so teams can identify weaknesses before they occur in production and negatively impact customers. Experiments will often simulate real-world conditions like high traffic, unexpected outages, latency from system dependency, or resource bottlenecks. By testing the reliability of your systems with different experiment types, software teams can build predictability in what to expect from their systems.

While there is a long history of intentional fault injection in software development, the discipline of chaos engineering was really popularized in 2010 by the reliability team at Netflix with their use of an internal tool called Chaos Monkey, which would randomly turn on and off instances in production.

Chaos engineering has evolved since then with the advent of new experiment types and enterprise chaos engineering tools like Steadybit and Gremlin. Teams can now deploy highly controlled and customized experiments across their systems in a few clicks. It’s never been safer and easier to run reliability tests to prove the resilience of your systems.

How Reliability Experts Describe Chaos Engineering

Learn the basics of chaos engineering from industry leaders who have successfully run programs at major companies.

"What is Chaos Engineering?" with Casey Rosenthal

In this clip from "Experiments in Chaos", Casey Rosenthal shares how he collaborated with resilience leaders to precisely define chaos engineering.

Switching from Reactive to Proactive Reliability Approaches

Adrian Hornsby shares his perspective on why teams struggle to shift to a more proactive reliability approach.

Introducing Chaos Engineering Practices Into Your Organization

In this clip, Tom Handal from Roblox shares his story of bringing chaos engineering into the organization at scale.

How Does Chaos Engineering Work?

The goal of chaos engineering is to learn about your systems by running experiments, either in pre-production or production environments. Like any experiment, there are standard steps you would follow to ensure that you are learning from your efforts:

By following a systematic process and challenging assumptions, teams are better able to generate actionable insights into system performance.

Step 1: Define Expectations

Start by identifying your system’s baseline behavior using metrics like latency, throughput, or error rates. Observability or monitoring tools are critical in providing visibility into these types of metrics, so you can develop specific benchmarks. If you haven’t configured a monitoring tool, that is often a good first step before diving deeper into chaos engineering.

Step 2: Hypothesize Outcomes

Next, create a hypothesis for how you expect your system to behave when a specific fault or stress is introduced. This hypothesis should include a trackable metric, such as maintaining a certain performance standard or a clear application functionality (e.g. “If a primary database fails, read replicas will maintain query availability.”)

A successful experiment is an experiment where the hypothesis was correct (e.g. “We expect the average CPU utilization to not exceed 50% at any point”). A failed experiment means that the hypothesis wasn’t correct and adjustments need to be made.

Step 3: Run an Experiment

Manually run your experiment by introducing your test conditions. Typically, this involves artificially interrupting services, adding latency spikes, stressing CPU or memory, or simulating an outage. We’ll explain how teams actually inject these failures in a section below, but it typically requires using a script or ClickOps to make these changes, or utilizing a chaos engineering platform to orchestrate experiments.

Step 4: Observe Behavior:

As your experiment runs, you should be able to use monitoring tools to compare the actual performance against expected outcomes. For example, if you expect service to somewhat degrade upon injecting latency, you can monitor to see if the degradation actually exceeds your expectations or not. Once you review the logs, you should be able to determine if your experiment was a success or failure, or in other words – if your hypothesis was correct.

Step 5: Refine and Repeat

Now that you have run your experiment, you can make use of the results.

If your hypothesis was correct, then you have accurately predicted how your system would behave. It may have even exceeded your expectations. If you predicted a positive reliability result, then you should have more confidence in your systems now.

If your hypothesis was proven incorrect and your system was unable to perform adequately under the conditions, then you have likely revealed a reliability issue that could be a liability in production if the experiment conditions were to become a reality. Once you have made changes to address the issue, you can then rerun the experiment to validate that it is now successful.

A successful experiment validates that, at the time of running the experiment, the hypothesis was correct. To ensure that you maintain this result over time, teams will pick certain experiments to automate and incorporate in their CI/CD workflows.

Key Principles of Chaos Engineering

These practices are important for implementing chaos engineering and reliability testing in an effective way.

Start Small and Focused

Start by selecting smaller targets to limit the impact of your experiment. For example, instead of running experiments in production, start in a non-production environment to lower the potential negative impact. Instead of testing a full cluster, target a subset. Instead of a critical service, start with a service that has fewer downstream impacts.

Start small and once you have seen success with your experiments, then you can slowly expand to incorporating multiple services in your experiments.

Communicate Early and Often

Each new experiment offers a chance opportunity to learn about how systems behave and respond under certain conditions. Those sociotechnical systems include both the applications and the teams supporting them.

It’s critical to proactively communicate ahead of time so everyone has visibility and understands the potential impacts of experiments. With clear communication, you can educate teams internally on why you are running experiments and share key insights from the results.

Enable Teams to Experiment Safely

If you want to scale reliability testing and chaos engineering across an organization, it’s critical that it’s easy to create and run experiments.

Chaos engineering platforms like Steadybit are designed to enable teams with the ability to discover targets, analyze reliability gaps, and launch experiments with a drag-and-drop timeline editor. Then automated development workflows can use experiments to validate reliability standards without slowing down development teams. Adoption depends on strong enablement.

Getting Started with Chaos Engineering

Starting with chaos engineering can seem complex, but it’s similar to adopting any other type of new practice.

Here’s some guidance on how to begin:

1. Set System Benchmarks and Metric-Based Goals

Before you run experiments, you should start by establishing baseline metrics that represent your system’s steady state, such as latency, error rates, or throughput. You can track these metrics in your Observability tool, and they will help you gauge the impact of experiments on your system health.

Once you have collected metrics on your steady state, you can create goals for your chaos engineering efforts. For example, maybe you want to reduce the number of critical incidents per quarter by 20% on tested services. Alternatively, you could aim to create experiments that replicate your last 3 critical incidents so you can continually validate your fixes moving forward. Every organization is different, so pick goals that make the most sense for your use case.

2. Start Small to Limit Negative Impacts

Begin with a narrow scope to limit the potential impact of experiments. It’s better to increase your scope as you gain confidence than risk running an experiment that could have outsized negative impacts. For example, test a single service or a non-production environment first. Chaos engineering tools like Steadybit allow you to easily adjust the blast radius for your experiments so you can start experimenting safely and expand targets later easily.

3. Experiment and Make Adjustments

Use the insights from your initial experiments to strengthen your systems. By running experiments, you validate either that your expectations of your system were correct or wrong. You can either adjust your expectations or make changes to ensure that your systems are resilient enough to meet your initial hypothesis.

With this iterative cycle, you should be able to prove out different system behaviors and validate your SLOs.

4. Scale Up Gradually

Once you’ve gained confidence in smaller experiments, expand the scope to include more critical systems or larger portions of your infrastructure. While you may want to start experimenting on non-critical systems to develop your process, testing the reliability of your business critical systems is where you will gain the most ROI from chaos engineering.

Get Started: Ready to explore chaos engineering? Schedule a demo with Steadybit and transform your systems into resilience powerhouses.

Run Experiments Across SaaS and On-Prem Systems

Learn more about the most common types of reliability tests you can run for different technologies.

Chaos Experiment Library

Browse an open source library of experiment actions.

Explore the Reliability Hub

Chaos Engineering for Kubernetes

Stress tests pods & clusters.

Explore Kubernetes Actions

Chaos Engineering for AWS

Run experiments on targets across the AWS ecosystem.

Browse Templates

Chaos Engineering for Azure

Run experiments on Azure VMs and containers.

View Attacks

Inject Faults with Chaos Experiments

Here are examples of various types of attacks you can run to stress test your systems.

Network

Kubernetes

Cloud Services

Physical & Virtual Hosts

Applications

Observability

Explore the Reliability Hub

Browse open source actions that you can easily add to experiments.

Browse Experiments

Delete Pod

This attack allows you to delete one or multiple pods to test the resilience of your application.

Cause Crash Loop

This action continuously kills specified containers in a selected pod.

Rollout Restart Deployment

Simulate the rollout of a Kubernetes deployment using a kubectl command.

Pause Docker Container

Run this action to pause one or more containers for a certain amount of time.

Taint a Node

Use this attack to taint one or multiple nodes for a given duration.

Drain Node

Use this attack to drain one or multiple nodes and check performance degradation.

Stop Container

Check the exit behavior and restart process by terminating one or more containers.

Explore the Reliability Hub

Browse open source actions that you can easily add to experiments.

Browse Experiments

Change Azure VM State

This action allows you to reboot, delete, stop or deallocate Azure virtual machines.

Change EC2 Instance State

Reboot, stop, hibernate and terminate EC2 instances during an experiment.

Change GCP VM State

Reset, delete, stop or suspend GCP virtual machines during an experiment.

Run AWS FIS Experiment

Execute AWS FIS Experiments via Steadybit to manage everything in one place.

Trigger DB Instance Stop

Test disaster recovery processes by stopping RDS database instances.

Reboot RDS Instances

This action enables you to reboot a single RDS database instance.

Trigger DB Cluster Failover

This action triggers DB cluster failover by promoting a standby instance to primary.

Explore the Reliability Hub

Browse open source actions that you can easily add to experiments.

Browse Experiments

Stress CPU

Test your application's resilience to high CPU load by generating load for one or more cores.

Stress IO

Generate read/write operations on hard disks or ephemeral storage for a given duration.

Stress Memory

Stress a specific amount of memory using ongoing memory allocations, reads and writes.

Trigger Shutdown Host

This action triggers a reboot or shutdown of the host to validate failover processes and impact.

Fill Disk

This action fills the container's ephemeral storage with random data for a given duration.

Time Travel

Test your application's ability to handle time changes by changing the clock time.

Change CPU Frequency

Dynamically adjust the CPU frequency limits across all cores for a specified duration.

Explore the Reliability Hub

Browse open source actions that you can easily add to experiments.

Browse Experiments

Inject Latency

Use this action to inject latency into AWS Lambda or Azure functions.

Inject Exception

This action injects exceptions into applications for a set amount of time.

Inject Status Code

Inject a fixed status code to test how upstream services respond to specific HTTP statuses.

Inject Controller Exception

Inject a RuntimeException into a Spring™ MVC controller before the handler method is executed.

Inject Java Method Exception

Inject a RuntimeException into a public Java method for a given amount of time.

Java Method Delay

Run this attack to inject latency into any Java-based application for a given duration.

Fill Diskspace

This action fills the temporary disk space of on AWS Lambda or Azure function.

Explore the Reliability Hub

Browse open source actions that you can easily add to experiments.

Browse Experiments

Create Maintenance Window

Create a maintenance window to avoid false positives in your monitoring system.

Check Monitor Status

This action collects information about a specified monitor and verifies an expected status.

Create Monitor Downtime

Mute Datadog monitors during experiments to not create unnecessary noise.

Check Grafana Alert Rule State

Collect information about the state of the Grafana alert rules during an experiment.

Gather Prometheus Metrics

Collect Prometheus metrics during an experiment to help validate your hypothesis.

Check SLO State in Splunk

Collect information on the SLO state in Splunk so you can check application performance.

Create Muting Rule in New Relic

Mute alerts for a specified amount of time so experiments don't create extra noise.

Explore the Reliability Hub

Browse open source actions that you can easily add to experiments.

Browse Experiments

System Failures are Inevitable in Complex Systems

Organizations need to shift their mindset to start developing truly fault-tolerant systems.

Accepting that System Failures are Inevitable

In complex systems, failures are bound to occur. Hear how teams can build systems that are designed to be resilient.

Explaining the Prevention Paradox

When things don't break, SRE teams rarely get the attention that they do during an outage. This "Prevention Paradox" can make it challenging to advocate for reliability efforts.

Introducing Chaos Engineering as a Daily Practice

Hear how the teams at Roblox started with familiar tests to begin adopting chaos engineering practices.

Common Challenges When Rolling Out Chaos Engineering

When engineering teams want to start a chaos engineering program, they often get pushback if the organization leadership hasn’t already bought in. Here are some of the challenges that teams face when working to rollout chaos engineering across their organization:

Roadmap Competition

While proactive reliability programs can deliver tremendous business value, they are sometimes harder to sell than developing new features or optimizing existing investments, such as Observability tooling or new tools within a cloud provider ecosystem. Chaos engineering is a new practice for many engineers, so teams may be hesitant to embrace this approach and make time for exploratory learning.

Solution: Identify your most unreliable customer-facing services and quantify the business impact of downtime.

Cultural Resistance

Many teams hesitate to embrace chaos engineering due to fears of causing unnecessary disruptions or the perception that it adds extra work. When engineers are stuck in a reactive mode, they are constantly context-switching and responding to new alerts. Adding more “chaos” can seem overwhelming, even though each experiment is meant to bring order to the chaos that already exists in production.

Solution: Run GameDays or reliability workshops to give engineers an opportunity to get familiar with running experiments.

Resource Constraints

Implementing chaos engineering across applications requires time, tooling, and cooperation. Organizations budget for a certain headcount and tool expenses, so increasing spend requires making a thoughtful business and gaining buy-in from leadership.

Not many organizations budget for revenue loss expected from outages, but those still inevitably occur. If executives are not already convinced of the value of delivering highly available applications, it may take a major incident to get their attention and budget approval.

Solution: Create a business case that shows the potential value of preventing critical incidents earlier in the development lifecycle.

Technical Complexity

If you aren’t confident in how your systems run currently, it might seem difficult to conducting experiments reliably as well. Engineers using open source scripts to run experiments in an unstandardized way to generate ad hoc results is the easiest way to start chaos engineering, but also the hardest approach to scale and get predictable value from.

Deploying experiments across different technologies requires nuance and engineering know-how. This complexity often keeps teams from moving forward since their initial steps forward don’t seem to be leading down a clear path.

Solution: Adopt a chaos engineering solution that can standardize experiments and deploy across a wide variety of tools & technologies.

Learn About the Role of Chaos Engineering

These are some of the topics you can start to review.

Top Chaos Engineering Tools

Compare approaches to find the right tool for you.

The ROI of Chaos Engineering

Read about how you can build a reliability business case.

Calculating Value

DORA Compliance

Learn how chaos engineering helps with DORA compliance.

The Steadybit Academy - Learning from Chaos

Getting Started with Steadybit

Explore tutorials to get started with a tool like Steadybit.

Visit the Academy

Measuring the Business Value of Chaos Engineering

So far, we’ve defined chaos engineering, explained the key principles, and outlined examples, but how can you make a business case for prioritizing chaos engineering over other initiatives?

While there are lots of ways to calculate an ROI, these are the most common approaches:

Preventing Costly Incidents in Production

If you have applications that are associated with revenue or relied upon by customers, downtime is a tangible cost. You can project your current incident costs by calculating:

# number of incidents in the last year X cost of each hour of downtime X average MTTR = annual incidents costs

If you break it out by incident severity, it could look something like this:

Current Reliability Profile	# in the Last Year	Business Cost Per Hour	Avg. MTTR (Hours)	Total Incident Costs
Incidents – Low Severity	400	$5,000	1.5	$3,000,000
Incidents – Medium Severity	20	$10,000	1.5	$300,000
Incidents – Critical Severity	2	$200,000	2.5	$1,000,000
				$4,300,000

With this approach, you can then project savings by estimating a reduction in the number of production incidents per year due to proactive reliability testing. In the above example, if implementing chaos engineering was able to help reduce incidents by 20%, then that would result in annual savings of $860,000. You can try our interactive ROI calculator to see how chaos engineering could benefit your organization.

Finding and Mitigating Reliability Risks

If you have applications with outstanding reliability issues, they carry a potential risk of downtime. Even if your organization hasn’t experienced significant downtime yet, there is meaningful value in finding and fixing these issues and reducing the overall risk of outages.

If you utilize the incident cost calculation above, you can determine a value for each reliability risk by severity. For example, if by running chaos experiments, you are able to identify a critical severity issue and fix it, you have prevented an issue that could have resulted in $500,000 in damages if it had occurred in production.

As you run more experiments and scale your testing across applications, you will uncover your existing reliability issues and streamline your ability to remediate risk across your systems. If you run 100 experiments and 10 of them reveal reliability issues that need to be fixed, you can fix all or a subset of those issues and document the potential business liabilities that have now been addressed.

Improving Incident Response and Operational Readiness

If teams rely on incidents occurring in production to test run books, run root cause analysis, and remediate issues, they are missing the opportunity to practice and optimize their response in a low-stress simulated scenario.

Running chaos experiments enables organizations to continually sharpen their incident response processes, validate that runbooks are up-to-date, and improve their Mean Time To Resolve (MTTR) issues.

Fidelity Investments recently presented on how they scaled chaos engineering efforts across their applications. As they expanded their “chaos coverage”, they were able to meaningfully decrease their MTTR.

Hear Chaos Engineering Journeys from Idea to Roll Out

In these videos, reliability leaders share their experience rolling out chaos engineering and measuring its impact.

The Impact of Fidelity Investments Rolling Out Chaos Engineering

See the impact to average MTTR as the team at Fidelity rolled out chaos engineering.

How Salesforce is Using Chaos Engineering to Achieve Reliability

Hear from Krishna Palati to hear how engineers at Salesforce are using Steadybit to run chaos experiments.

Building System Confidence at ManoMano

Antoine Choimet shares his experience as a chaos engineer at ManoMano.

How should chaos engineering integrate with Observability tools?

Observability tools like DataDog, Dynatrace, Honeycomb, Grafana, New Relic, and more offer organizations real-time insights and monitoring into their system performance.

For example, DataDog captures data from servers, containers, databases, and third-party services, providing comprehensive visibility for cloud-scale applications. When integrated with a chaos engineering tool like Steadybit, it tracks the impact of chaos experiments in real time, enabling teams to correlate chaos events with changes in system metrics and logs.

Similarly, New Relic delivers robust application performance monitoring with a focus on distributed systems, offering detailed insights into applications, infrastructure, and digital customer experiences.

Users can customize their observability environment with features like custom applications and dashboards, allowing for tailored analyses of chaos experiments and greater control over system health.

To learn from chaos experiments, you need the right data. It’s important to have a tight integration between your chaos engineering tool and monitoring tools so teams can track and act on key metrics like latency and error rates.

For example, Steadybit integrates with monitoring tools to tracks metrics like:

Latency: Measure response times under stress.
Error Rates: Identify transaction failures.
Resource Utilization: Monitor CPU, memory, and network usage.

What tools do people use for chaos engineering?

Open Source Chaos Engineering Tools

There are many open source tools for chaos engineering like Chaos Monkey, LitmusChaos, Chaos Mesh, and Pumba. For the most part, these open source tools are technology specific and good for initial experimentation.

For anyone trying to rollout chaos engineering and reliability testing at a multi-team or organization-wide level, commercial tools offer easier deployment, better user experience, and enterprise-grade security features.

Commerical Chaos Engineering Tools

Some companies build their own custom chaos engineering solutions, but then struggle with the maintenance and development of new functionality. Instead of building a custom solution, many teams will opt to buy a commercial platform.

The three leading commercial tools are Steadybit, Gremlin, and Harness Chaos.

If you are interested in evaluating a commercial chaos engineering platform, you’ll find that Steadybit is easiest to use, customize, and deploy due to its open source extension framework and timeline-based experiment editor.

What are examples of chaos experiments?

There are a wide range of possible chaos experiments you could run to test your systems. In this section, we’ll outline common experiment types and list specific examples for each.

Dependency Failures

With the rise of microservices, systems are more reliant on internal and external dependencies to fulfill requests effectively. Experiments that simulate service failures allow teams to see how their systems respond to outages.

Here are experiment examples:

Simulate increased latency or packet loss to test service response times and throughput
Emulate the unavailability of a critical service to observe the system’s resilience and failure modes

Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms

Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.

Resource Constraints

Each system has resource limitations on things like CPU, memory, disk I/O, and network bandwidth. Even with cloud providers and autoscaling options, it’s useful to run experiments that stress your systems to see how resource constraints impact performance.

Here are experiment examples:

Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.

Network Disruptions

Various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiments can help show how a system responds and adapts to network unreliability.

Here are experiment examples:

Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.
Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.

Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.

Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.

Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.

If you want to see even more specific experiment templates, you can view over 80 examples in the Reliability Hub, an open source library sponsored by Steadybit.

What are GameDays?

GameDays are events that Site Reliability Engineering teams run to strengthen observability and SRE skills across software development teams. These events will often feature one or multiple chaos experiments, where teams will need to respond to an incident and troubleshoot the situation. This could involve racing team members to uncover the root cause of an incident.

These workshops are an opportunity for everyone to learn about their systems and share knowledge across teams to improve organizational resilience.

Rolling out chaos engineering at your organization?

Reach out to consult with our team of experts and hear how Steadybit could help.

Tune in to "Experiments in Chaos" to learn more

Episode 1: Tackling the Prevention Paradox with Adrian Hornsby

In episode 1 of "Experiments in Chaos", Benjamin Wilms sits down with Adrian Hornsby, a leading expert in the chaos engineering space to discuss common mistakes, best practices, and the future of the practice.

Experiments in Chaos • Jun 2025

Episode 2: Embracing Psychological Safety with Russell Miles

In episode 2 of "Experiments in Chaos", Benjamin Wilms sits down with Russell Miles, a leading expert in the resilience engineering space, to discuss the keys to system reliability.

Experiments in Chaos • Jun 2025

Episode 3: Putting Chaos Engineering to Work with Casey Rosenthal

In this episode, Benjamin Wilms chats with Casey Rosenthal, "The Chaos Engineering Guy", about what it takes to develop a proactive approach to reliability.

Experiments in Chaos • Jun 2025

Episode 4: Enabling Reliability in the Cloud with Carlos Rojas

In this episode, Benjamin Wilms chats with Carlos Rojas, author of "Resilience Engineering for the Cloud", about DevOps vs. platform engineering, the roles and responsibilities related to reliability, and forecasting new trends.

Experiments in Chaos • Jul 2025

Episode 5: Rolling Out Chaos Engineering with Tom Handal

In episode 5 of "Experiments in Chaos", Benjamin Wilms sits down with Tom Handal, a Principal Engineer at Roblox to discuss his experience rolling out chaos engineering across the organization in production.

Experiments in Chaos • Nov 2025

Episode 6: Preparing for the Chaos of Agentic Systems with Jessica Kerr

In this episode, Benjamin Wilms sat down with Jessica Kerr, an Engineering Manager of Developer Relations at Honeycomb, to discuss the impact of new AI agents on software systems.

Experiments in Chaos • Dec 2025

Start Running Experiments

Want to try chaos engineering today?

Create a free trial with Steadybit to build experiments and start testing your systems.

Try Steadybit

Learning Chaos Engineering

What is Chaos Engineering?

How Reliability Experts Describe Chaos Engineering

How Does Chaos Engineering Work?

Step 1: Define Expectations

Step 2: Hypothesize Outcomes

Step 3: Run an Experiment

Step 4: Observe Behavior:

Step 5: Refine and Repeat

Key Principles of Chaos Engineering

Start Small and Focused

Communicate Early and Often

Enable Teams to Experiment Safely

Getting Started with Chaos Engineering

1. Set System Benchmarks and Metric-Based Goals

2. Start Small to Limit Negative Impacts

3. Experiment and Make Adjustments

4. Scale Up Gradually

Run Experiments Across SaaS and On-Prem Systems

Inject Faults with Chaos Experiments

Network

Kubernetes

Cloud Services

Physical & Virtual Hosts

Applications

Observability

Blackhole Subnet Attack

Blackhole Zone Attack

Corrupt Outgoing Packages

Drop Outgoing Traffic

Block DNS

Block Traffic

Delay Outgoing Traffic

Delete Pod

Cause Crash Loop

Rollout Restart Deployment

Pause Docker Container

Taint a Node

Drain Node

Stop Container

Change Azure VM State

Change EC2 Instance State

Change GCP VM State

Run AWS FIS Experiment

Trigger DB Instance Stop

Reboot RDS Instances

Trigger DB Cluster Failover

Stress CPU

Stress IO

Stress Memory

Trigger Shutdown Host

Fill Disk

Time Travel

Change CPU Frequency

Inject Latency

Inject Exception

Inject Status Code

Inject Controller Exception

Inject Java Method Exception

Java Method Delay

Fill Diskspace

Create Maintenance Window

Check Monitor Status

Create Monitor Downtime

Check Grafana Alert Rule State

Gather Prometheus Metrics

Check SLO State in Splunk

Create Muting Rule in New Relic

System Failures are Inevitable in Complex Systems

Common Challenges When Rolling Out Chaos Engineering

Roadmap Competition

Cultural Resistance

Resource Constraints

Technical Complexity

Learn About the Role of Chaos Engineering

Measuring the Business Value of Chaos Engineering

Preventing Costly Incidents in Production

Finding and Mitigating Reliability Risks

Improving Incident Response and Operational Readiness

Hear Chaos Engineering Journeys from Idea to Roll Out