What are regions and availability zones?

Regions are geographical areas that contain multiple availability zones (AZs). Each AZ is a distinct location within a region that is engineered to be isolated from failures in other AZs. This architecture allows for high availability and fault tolerance in applications.

Why should I deploy my application in multiple availability zones?

Deploying your application across multiple AZs enhances its resilience and availability. In the event of an outage in one AZ, traffic can be rerouted to another AZ, ensuring continuous operation and minimizing downtime.

What prerequisites should I consider before creating an experiment?

Before starting your experiment, ensure that your application is deployed in multiple AZs. This setup is crucial for validating the performance and reliability of your application under various conditions.

How do I run an experiment to test my application?

To run an experiment, you will need to set up the necessary configurations for your application across the selected AZs. Once configured, you can initiate the experiment and monitor the results to assess the application's behavior across different environments.

What outcomes can I expect from running an experiment with my application?

Running the experiment will yield results that help you understand how well your application performs in a multi-AZ setup. You can expect insights into latency, uptime, and overall resilience of your application during various failure scenarios.

What is the conclusion regarding the use of availability zones?

In conclusion, leveraging availability zones is essential for enhancing the reliability and performance of applications. By utilizing multiple AZs, you can significantly reduce the risk of downtime and ensure a more robust service delivery.

All Blog Posts

How to Survive an AWS Zone Outage

AWS Chaos Engineering Guides

17.12.2021 Dennis Schulte - 4 min read

The use of cloud services, such as AWS, Azure or GCP, helps us to make software available to our customers in a short time. The resources can be used on demand and are often more cost-effective than hosting your own data center. But a cloud has some special characteristics that need to be considered. For example, individual components may be scaled dynamically and a single server may be replaced by a new server or similar.

To put it in the words of the Amazon CTO:

"We need to build systems that embrace failure as a natural occurrence."

AWS offers several services and concepts to help build resilient systems. To that end, today we’re going to look at the concept of regions and zones, and consider what our architecture needs to look like to survive the failure of one.

What are regions and zones?

Regions and Availability Zones (AZ) are powerful concepts to build highly available applications. Each region has multiple availability zones, where each zone is one or more discrete data centers with separate power, network and connectivity. So if an application is distributed across AZs, they are better isolated from each other to be protected from physical disasters such as ice storms, floods, earthquakes and more.

Prerequisites

Application deployed in multiple AZs (If you have no application deployed you can use our demo application and spin up an AWS EKS cluster with a few commands documented here). The EKS setup provides a ready to use cluster with 2 nodes.
A tool to simulate an AZ outage. We use steadybit for that in our example, but you can also use any other tool that is capable of doing that. The AWS Well-architected Labs provides also an example for running an az outage.

Create Experiment

Before we start with our experiment we need to consider a few things. Looking at the Principles of Chaos, we should first determine the steady state of the application to later determine if our experiment was successful. The formulation of the hypothesis helps with this. For our simple example, we could define the following hypothesis:

"When the availability zone us-east-2a” goes down, the product service
still returns a product list with no issues."

In the next step, we want to create and run the experiment with steadybit. To do this, we first assign a meaningful title and also directly enter the formulated hypothesis in the field provided, in order to know later exactly what was actually tested in this experiment.

All hosts running in the availability zone “us-east-2a” are targeted. The metadata is automatically detected by steadybit and made available to the user for selection.

The definition of the blast radius is easy: we attack 100% of the hosts in the entire availability zone us-east-2a.

Various attacks can be selected to simulate a specific failure of an availability zone. The most minimally invasive solution is the blackhole attack. This prevents all data traffic to the selected target host in the AZ “us-east-2a” for the duration of the experiment. Alternatively you could choose to shutdown all hosts in the AZ, but then you have to verify that they all are restarted or replaced after the experiment.

To validate the steady state, we configure a state check that checks whether the hypothesis is consistently met before and during the runtime of the experiment. For this simple example, we select the “HTTP State” check, which continuously calls the configured endpoint and validates the result. Please replace the endpoint with your own cluster endpoint, e.g. “http://{hostname:port}/products”.

Run Experiment

Running the experiment yields a successful result. Kubernetes detects the failure of one node and routes all HTTP requests to the remaining functional node.

Conclusion

In this article, we looked at how availability zones help ensure the availability of services running in AWS. As a cross-check, you could now let both nodes fail and see what happens. Of course, it is recommended to run further experiments to cover as many scenarios as feasible that could affect the availability of the systems. What are your experiences to improve the availability of your AWS services?