Availability zone outages are rare but incredibly disruptive. On October 20th, a DNS issue in Amazon’s US-EAST-1 region resulted in outages across all of its 5 zones. Critical applications stopped working and customers were left stranded across industries. Once again, engineering teams saw just how vulnerable their services could be to dependencies.
Following the outage, some people took the opportunity to advocate for a multi-cloud or multi-region strategy. Others seemed to throw their hands up and accept that outages like this are just an inevitable result of any modern cloud architecture.
Regardless of your configuration, Availability zone (AZ) outages will occur. Some degradation may be unavoidable. By testing for chaotic conditions proactively, you can ensure that your services still deliver an optimal customer experience.
Reliability can be a true competitive advantage in keeping customers happy and winning new ones.
In this article, we’ll explain how you can use chaos engineering principles to run availability zone outage experiments on your systems. Don’t guess how your systems will respond. Test it.
If you run all of your services from within one availability zone, an availability zone outage will disable your entire system. That’s why it’s a reliability best practice to have coverage across multiple availability zones, so when one AZ goes down, traffic can be rerouted.
During rerouting, the load on the remaining Availability Zones (AZs) will increase. If only one AZ remains, it must absorb the full scale-up on its own. If multiple AZs remain, they can distribute the increased demand, allowing for a smoother transition. However, this works effectively only if each AZ has been provisioned with enough headroom to absorb the full projected load in the event of an AZ loss.
Paying for resources in multiple availability zones can be expensive, but outages impacting customers can be equally or much more expensive. You can read more about the benefits of using at least 3 availability zones in this AWS post here.
By testing how your systems perform their failover processes, you can better understand where to invest to efficiently maximize your performance and reliability.
Proactive reliability tests, or chaos experiments, intentionally inject failures in your systems in a controlled, safe way. To simulate an availability zone outage, you can run what’s called a blackhole attack.
A blackhole attack simulates a complete network outage within a specific availability zone (AZ) or subnet. It effectively drops all incoming and outgoing network traffic for the targeted area, creating a “blackhole” where services become unreachable.
Real-world events that can mimic a blackhole zone attack include:
By simulating this attack, you can see exactly how your system behaves when a significant part of its infrastructure suddenly goes dark.
More specifically, you can check if your traffic failover process works correctly, how the remaining services handle the increased load, and whether your monitoring tools detect the issue and raise an alert in a timely manner.
For this walkthrough, we will use Steadybit, a leading chaos engineering platform with a drag-and-drop editor for designing and running experiments quickly.
Before you begin, ensure you have the following prerequisites in place:
You can use open source scripts to run experiments like this, but that approach makes it more challenging to deploy experiments and review results.
Once you have Steadybit set up, you’re ready to build your first blackhole experiment. Here are the next steps to take:
By installing the Steadybit agent and related extensions, Steadybit is able to automatically discover your cloud resources like AWS EC2 instances and subnets. You can review your targets and identify the AZ or subnet(s) you want to isolate.
We generally recommend starting your experiments with non-production environments (e.g. “Dev” or “QA”) to avoid negatively impacting real users.
In the Steadybit UI, navigate to the Experiment Editor to design your experiment. You can either use a pre-built template or build from scratch. If you want to start with a blank canvas, here’s what you would do:
You can see an example of this type of experiment here. The experiment is designed to check that the HTTP request is successful initially before start the blackhole zone attack. There is a wait step to account for recovery time. Then, the HTTP check resumes to ensure that the service has recovered in the amount of time expected:

Experiment Template: Load Balancer Covers an AWS Zone Outage
If you want to run an experiment that only simulates a subnet outage, you would take the same steps, except you would select the “Blackhole Subnet” action instead and specify an AZ to take out during the experiment.
When you are happy with your experiment design, you can hit “Run Experiment” to watch it run in real-time.
As the experiment executes, Steadybit will apply the blackhole by modifying the network ACLs to deny all inbound and outbound traffic for the selected zone(s). Throughout the experiment, any checks that you set up will allow you to actively monitor your system’s key performance indicators (KPIs). Referencing your observability dashboard can also be helpful for things like error rates, latency, and resource utilization in the remaining active zones.
Once the experiment is finished, Steadybit will automatically revert all changes, restoring network traffic to the targeted subnet. Now, you can analyze exactly how your systems handled the outage.
For example, if you found that traffic did not failover, you might need to adjust your load balancer or Kubernetes service configurations. If alerts didn’t fire, you may need to make updates in your observability tool.
Whatever the results are, you now have real data you can review to see if you can make any improvements.
After implementing any fixes, you should run the experiment again to verify that your changes have resolved the problems that you intended to solve.
Since systems are constantly changing, this iterative cycle of testing, learning, and improving is critical to building truly resilient systems. To accomplish this, some teams build experiments into their CI/CD workflows as a quality gate to maintain reliability standards over time.
By identifying and validating the limits of your systems, you can map out key break points and know how your services will react to different conditions.
By proactively testing for failures, you can ensure your services remain highly available and delivering for your customers. Failures are inevitable, but being prepared for failures is a choice that your organization actively makes.
Are you ready to test the resilience of your systems?
Get started with a free trial of Steadybit today to explore how chaos experiments can upgrade the reliability of services across your organization.