The Role of Chaos Engineering in the Reliability Pillar of the AWS Well-Architected Framework

11.09.2025 Patrick Londa - 6 minutes
The Role of Chaos Engineering in the Reliability Pillar of the AWS Well-Architected Framework

Reliability is one of the six core pillars of the AWS Well-Architected Framework, a collection of best practices on configuring and running applications and services using the AWS ecosystem. By using this guidance when setting up new projects or adapting existing ones, you can continually improve reliability posture.

AWS customers agree to a shared responsibility model, with the balance shaped by the types of resources and services used. For example, AWS is responsible for the overall infrastructure that runs AWS Cloud services, including hardware, software, networking, and facilities. If customers are using non-managed services like EC2 instances, they are responsible for all the resiliency configuration and management themselves, including categories like:

  • Networking, Quotas, and Constraints
  • Workload Architecture
  • Observability and Failure Management
  • Continuous Testing of Critical Infrastructure
  • Change Management and Operational Resilience

In this post, we’ll explore the role of chaos engineering within the Reliability pillar of the AWS Well-Architected Framework, and dive into how organizations can utilize chaos experiments to continually optimize their resilience for each of these categories.

Designing a Reliable Network and Service Architecture

Whether you are setting up a new project or adapting an existing one, the two areas of the Reliability Pillar that provide the most configuration guidance are “Networking, Quotas, and Constraints” and “Workload Architecture”.

For example, in the “Networking, Quotas, and Constraints” category, there are specific best practices outlined for configuring your network topology, like:

  • Use highly available network connectivity for your workload public endpoints
  • Provision redundant connectivity between private networks in the cloud and on-premises environments
  • Ensure IP subnet allocation accounts for expansion and availability
  • Prefer hub-and-spoke topologies over many-to-many mesh
  • Enforce non-overlapping private IP address ranges in all private address spaces where they are connected

The “Workload Architecture” section provides recommendations like moving to distributed services or microservices to gain added flexibility over capabilities like throttling, retries, queue management, and timeouts. 

In short, these two sections outline best practices for how networks and applications should be configured for optimal availability. So how does chaos engineering factor in?

If you want to validate that you have configured your systems correctly, chaos experiments and reliability testing provide a quick way to validate your setup and see how services respond to conditions.

For example, you could run a simple load test to see if the hypothesis, “I expect our IP subnet to scale to successfully accommodate X amount of traffic”, is true. This test would show you whether your network is configured to scale sufficiently in a vacuum. But what if you also add latency into the mix? Would it still work as expected?

Chaos experiments enable teams to expand beyond just happy-path testing to truly validate that your infrastructure is configured to handle a variety of scenarios.

Inversely, if you are running chaos experiments on an existing environment, you can see where your system is failing and review these Well-Architected guidelines to see if there are configuration gaps you can address.

Implementing Observability and Disaster Recovery Processes

Observability is critical to be able to monitor how your applications are performing. If you don’t know when your services are down, how can you gauge your organization’s level of resilience and prioritize improvements?

Observability tools like Datadog, Dynatrace, and Grafana provide teams with visualization and alerting capabilities so they can more easily monitor how their services are performing. When combined, Observability and chaos engineering offer teams the ability to not only monitor their systems, but also provoke them with failure injection to see how they respond and if alerts are configured correctly.

Just as you can use chaos engineering to validate your infrastructure configuration, you can also use it to validate your observability configuration. For example, you can run a CrashLoop error in Kubernetes and validate if an alert in Grafana has been triggered as expected. Without an experiment that can simulate these issues, you have to wait for them to occur naturally in Production and troubleshoot them in-the-moment.

Disaster recovery is another situation where you don’t want to wait around to see what happens in production. With chaos experiments, you can proactively test whether your failover processes work effectively when databases become unavailable or networks go down. 

With validated alerts, you’ll know when services go down and you can ensure you have tested backup processes in place to failover to healthy resources and automate healing. By running experiments that prove the efficacy of your failover processes, you also gather documentation that is useful for meeting DORA compliance.

Continuous Testing and Change Management

So far, we have explained how chaos engineering can be used to validate infrastructure configuration, observability alerts, and disaster recovery processes. If environments never changed, you could stop there but most modern systems are in a constant state of change and evolution.

Running continuous testing ensures that as your system changes, you maintain an optimal reliability posture. As we mentioned earlier, functional and performance tests often check for happy-path scenarios where systems are working as expected. 

The “Continuous Testing of Critical Infrastructure” category specifically recommends running chaos experiments to prepare for unexpected situations:

“Create and document repeatable experiments to understand how your system behaves when things don’t work as expected. These tests will prove effectiveness of your overall resilience and provide a feedback loop for your operational procedures before facing real failure scenarios.”

One-off chaos experiments provide teams with valuable feedback about how systems perform, but scheduling and automating experiments to repeat as part of a CI/CD workflow delivers continuous feedback with each deployment.

When chaos experiments are embedded in your CI/CD processes, as your systems change, you will be alerted early to new issues that could threaten your reliability posture. When these checks are embedded into development workflows, it helps reinforce that reliability is a priority for all team members to keep in mind.

The “Change Management and Operational Resilience” section of the Reliability Pillar also encourages building runbooks for incident response and ensuring they are up-to-date. By injecting faults intentionally, you can rehearse the steps of your incident response process and optimize each step so you are better prepared to handle unexpected events. 

Platform engineering and Site Reliability teams can also host GameDays or workshops to collaborate and train their development teams on observability best practices and how to analyze logs to find issues fast and reduce your Mean-Time-To-Resolution (MTTR).

Scaling Chaos Engineering Initiatives Across AWS Projects

When teams are just getting started with chaos engineering, they often will independently set up new chaos experiments using open source tools or scripts to begin getting valuable feedback on how their systems respond. We wrote a blog recently that lists out some of the most popular open source chaos engineering tools.

If you want to scale chaos engineering as a practice across your organization, it’s worth looking at a dedicated commercial tool designed for enterprise adoption.

Chaos engineering platforms like Steadybit are a critical foundation for organizations that want to scale their chaos experiments across teams and ensure there are safety guardrails in place.

With open source extensions, Steadybit automates the discovery of potential targets and provides experiment templates so it’s easy to standardize tests across projects. 

Beyond open source tools and AWS FIS, Steadybit is designed to plug into any environment, including On-Prem and SaaS deployments, so you can easily extend to any relevant tools, like cloud providers, monitoring & load testing tools, and API vendors. This level of flexibility and customization allows teams to truly make Steadybit their own.

If reliability is a priority for your team, schedule a demo with us to see how easy it can be to introduce and scale chaos engineering across your organization. You can also get started yourself with a free trial to start experimenting right away.