Simulate DNS Outages with Steadybit

Simulate DNS Outages with Steadybit

Simulate DNS Outages with Steadybit

Infrastructure Failures
Infrastructure Failures




7 min


Did you know that your applications can fail if your DNS fails? In this blog post, we explain the logic behind DNS and how you can experiment with DNS failures in Steadybit.

Today we want to learn what DNS is, how it works in Kubernetes and how it can cause outages in our system. So let's get started.

How does DNS work?

DNS, or the Domain Name System, translates human readable domain names (for example, ) to machine readable IP addresses (e.g.: Every device that connects to the Internet has an assigned IP address. This machine-readable address is required to find the device in question on the Internet - just as the postal address is used to find a specific house.

In order to understand the individual steps involved in DNS resolution, you need to know the various components that a DNS query has to pass through. For the web browser, the DNS lookup takes place in the background. Apart from the original request, no interaction is required from the user's computer.

Types of DNS services

Authoritative DNS: An authoritative DNS service provides an update mechanism that developers use to manage their public DNS names. An authoritative DNS has ultimate authority over a domain. Its job is also to provide answers to recursive DNS servers with the IP address information.

Recursive DNS: Clients typically do not send their queries directly to authoritative DNS services. Instead, they first query another type of DNS service, also known as a resolver or recursive DNS service. A recursive DNS service is similar to a hotel concierge. It does not add DNS records itself, but acts as an intermediary that can retrieve the DNS information instead of you. If the DNS reference is already in the cache or memory of a recursive DNS, the DNS service can return the source or IP information itself to a DNS query. Otherwise, it forwards the query to one or more authoritative DNS servers to retrieve this information.

DNS in Kubernetes

Kubernetes creates DNS records for services and pods. You can contact services with uniform DNS names instead of IP addresses. Kubernetes DNS schedules a DNS pod and service on the cluster and configures the kubelets so that each container uses the IP of the DNS service to resolve DNS names. For more Kubernetes specific details see here:

What is a DNS outage / downtime?

Now that we know what DNS is, we can find an answer to what a DNS failure is. The users cannot resolve your domain name, so they receive an error message and cannot reach your website or use your application. DNS outages can lead to disgruntled customers, lost sales, and reputation damage.

Why should I test DNS failures?

There are many potential reasons for a failing DNS query . I will list a few easy to understand ones in the next section. However, there are a lot of other much more difficult to detect error sources. Most of them are very involved. If you want to learn more about them, I recommend reading:

What can lead to DNS outages?

DDoS attacks

DDoS, or a denial of service attack, is a type of cyber attack in which multiple devices work together to attack a victim's computer with a large amount of traffic, rendering them unable to respond to further inquiries. To avoid problems that a DDoS attack can cause, you need a load balancer that can split the traffic between your servers, even if it is very heavy. You also need DDoS-protected servers.

Maintenance of the authoritative name server

If you're only using an authoritative name server, anything that happens to it can affect your DNS. If it needs to be updated and rebooted, the server will be unable to respond to DNS queries. Since updates and maintenance are required, you should have a secondary DNS that can answer the queries in the meantime.

A problem in the data center where the authoritative name server is located

Problems can arise with cloud equipment as well, such as prolonged power outages, natural disasters in the area, fires or other problems. If you use a cloud service, you have no control over these problems, but you can use multiple servers in multiple data centers. If one goes down, there are always more available to answer the inquiries.

Bad configuration

DNS configuration errors can cause DNS downtime. It could be a human error, such as incorrect addressing due to incorrect spelling of the IP address or domain name, a script error, incorrect firewall configuration, etc.

  • If it's a typo, you can try querying the domain name and IP address to see which address is responding and which is not.

  • If the problem is with the firewall, you can check whether the ports have been allowed.

DNS propagation delay

When you add or remove DNS records (such as A or AAAA records), the changes don't always happen immediately. You edit the zone file within the primary DNS server and can push it to your secondary DNS servers, but there are many recursive DNS servers that you cannot control. They can keep your old IP address and make it available to clients even if you've published a new one.

What you can do about DNS propagation is to move the zone transfer to your secondary servers and keep lower TTL values for your DNS records.

Technically, while it's not a DNS outage as it only affects those with the older IP address of the domain name in the cache, it was still worth mentioning.

How to Avoid DNS Downtime

The best way to avoid DNS failures is to have a robust DNS network that offers redundancy and can withstand heavy traffic. The more servers you have, the better prepared you are. Additional features can also make DNS management easier and automate the troubleshooting process.

  • Use Secondary DNS services

  • Use DNS load balancing

How can you check whether you are prepared for DNS failures?

A DNS outage can be simulated as an attack in Steadybit. As an example we have a shop running in Kubernetes. The frontend calls up the products via a backend API via REST. To do this, we use the online shopping demo application, which consists of a microservice gateway that periodically requests products from other microservices via HTTP (hotdeals, fashion-bestseller and toys-bestseller). Please refer to Shopping Demo's Github repository for more details.

The attack now prevents the Kubernetes container from sending new DNS queries.

Let's do the test!

Unfortunately, we see a successful run.

So it seems we have basic resilience against DNS failures. We could stop here, but let's dig a little deeper and think about why and what does that mean? It means that we blocked the DNS queries for any of the 2 containers in the attack. But since each container has its own DNS cache, our HTTP queries continue to work.

But that wasn't our goal, we want to see whether our system can really survive DNS outage. So we're going to kill both pods during runtime.

Now if we run the experiment we will see it fail.

Why does it fail?

Because a pod no longer has a DNS cache after restarting, so it tries to reach the DNS server. We are blocking this on one of the two pods. So some of our HTTP requests fail.

Why don't they all fail?

Because the load balancer distributes the requests using the round robin principle.

So we learned that our system is not fully armed against DNS outages like this. In the first experiment we learned that the system can continue to work because of cached dns entries. But a new starting pod will still fail with DNS errors. What should we do now to make our system more robust?

We could build a retry in our gateway, which makes the product queries in our shop, which simply repeats the queries x times in the event of 500 HTTP status errors.

See our resilience4J blog post: Retries with resilience4J and how to check in your real world environment

Want to learn more about it? Just book a demo, we are looking forward to it.