Launching Explorer - The Companion of Your Chaos Engineering Journey

Product Updates

23.09.2023 Manuel Gerding - 6 min read

Launching Explorer - The Companion of Your Chaos Engineering Journey

Improving your system’s reliability can be challenging. Initially, you are looking at a large pile of infrastructure components from a dozen teams. They are all somehow connected, and every piece will fail eventually. While you can use Chaos Engineering to reveal the impact of each failure, you can’t predict when a failure will happen. This makes it hard to know where to start and where to continue to keep getting the most value from Chaos Engineering. Also, once you have identified the first findings with Chaos Engineering, you need to check what other components suffer from similar issues.

That’s why we released a new feature of Steadybit: the Landscape Explorer. The Landscape Explorer guides you on your Chaos Engineering journey to help you navigate through your system and make well-informed guesses when preparing upcoming Chaos Engineering experiments.

In this blog post, we first cover the underlying foundation of Steadybit that made this possible, followed by some hands-on examples of the Landscape Explorer. This blog post concludes with a glimpse of potential future steps.

Steadybit’s Core: Target Discovery

One foundation when performing Chaos Engineering experiments with Steadybit is the automatic discovery of targets.

A target is an infrastructure component that you can attack in an experiment. It is the actual place where the fault or behavior change is injected. Responsible for both discovery and fault injection is an extension, a small piece of software that runs in your infrastructure. An extension not only discovers the actual target but also attributes that are key-value-pair metadata about the target and can also be used to enrich other targets. All these attributes describe each individual target so that you can be sure to attack the right one in your experiment.

Let’s look at an example: The container extension can automatically discover containers, the host on which a container runs, and the used container image. The latter two are attributes added to the target container (container.host and container.image). Another extension, the AWS extension, is capable of discovering AWS EC2 hosts as targets and their related attributes like the AWS account (aws.account), region (aws.region), and zone (aws.zone). However, not only is the host running in the AWS context, but also the container running on that host. That’s why the AWS extension enriches these attributes to the container target. The Kubernetes extension performs another beneficial enrichment for containers running in a Kubernetes cluster. The extension enriches the containers with attributes like k8s.deployment, k8s.namespace, and k8s.cluster-name.

The following example shows a discovered container target and some attributes:

Discovered Container Target

You can use this discovery information now in an experiment.

Let’s say you want to limit the network bandwidth for a container in a specific AWS Availability Zone. By using Steadybit’s query language, you can easily do so by filtering down to the exact target instance (e.g., k8s.cluster-name='dev-demo' AND k8s.namespace='shop' AND k8s.deployment='gateway' AND aws.zone='eu-central-1b').

Consequently, this also allows Steadybit to validate targets before running the experiment to avoid trial-and-error approaches common in code-based Chaos Engineering.

Target Validation

The Landscape Explorer

Based on target discovery, we released a new feature: Landscape Explorer. By default, clicking ‘Landscape’ brings you to a Kubernetes Cluster map that arranges every discovered Kubernetes target and shows the corresponding namespace, deployment, and cluster.

Kubernetes Cluster Map

With this view, you already get a sense of how your cluster looks—much better than a long table provided by other tools.

In addition, you can fully customize the view using one of these capabilities:

Filter your targets to exclude them from visualization,
Group your targets by attribute values to have them next to each other (even into several subgroups),
Size by attribute to highlight multiple values discovered for one target,
Color by attribute values to visualize differences between your target’s attributes.

Let’s look at examples of how to use these features to guide your next Chaos Engineering experiments.

What Deployments Run Across Multiple Availability Zones?

Based on the Kubernetes map, we can easily add another dimension to identify which deployments run across multiple AWS availability zones and which run in just one zone. For that purpose, we can color-code components by aws.zone and check which deployments have only one color assigned and thus are running in one zone.

Multiple Availability Zones

Looking at the product-offering namespace, we already have many deployments running across multiple zones (e.g., product-offering-core), but a few still run in just one zone (e.g., po-work-equipment and po-catalog-core).

Let’s continue drilling down into Explorer to identify where to run our next Chaos Engineering experiments.

What Infrastructure Components Provide Most Endpoints?

The need for microservice availability usually correlates with API endpoints count, load, and business criticality. Let’s focus on API endpoints count here. Therefore, we can use discovered Spring MVC mappings (via spring.mvc-mapping) to size targets by endpoint numbers.

Endpoints Count

This exploration highlights significant infrastructure components needing high availability and Chaos Engineering attention. For instance, within product-offering, product-offering-core has many endpoints (and runs in multiple zones), whereas po-catalog-core has fewer endpoints but also runs in just one zone. However, we should continue drilling down before making conclusions.

Which Infrastructure Components are Business Critical?

As mentioned earlier, business criticality is also crucial for guiding your Chaos Engineering efforts since downtime impacts business-critical components significantly. Luckily, your organization describes component criticality via labels (service-tier) automatically discovered by Steadybit too. Components labeled service-tier 1 are most critical and should run across multiple availability zones; thus we search explicitly for those by applying an additional filter.

Business Critical Components

As visible above, most business-critical components within product-offering match expectations by running in at least two availability zones (e.g., product-offering-core, po-infrastructure). These are perfect candidates for Chaos Engineering experiments verifying failover success (see recipe Verify Smooth Operation During an AWS Zone Outage in our Reliability Hub). You can copy queries identifying targets directly from Explorer’s sidebar.

However, some deployments (e.g., po-catalog-core) run only within one Availability Zone; these reliability issues should be addressed immediately without verifying via Chaos Engineering (as failure is predictable).

Further Exploration Examples

These drill-downs show possibilities for starting your Chaos Engineering journey:

Which components still need migration to new container runtime? Based on discovered attribute container.runtime,
Which team is responsible for which components? Based on an organization-specific label (e.g., label.domain, or label.owner),
Which components aren’t deployed using Helm? Based on discovered attribute kubernetes.io/managed-by,
Which components employ outdated Helm versions vulnerable due security issues? Based on discovered attribute kubernetes.io/managed-by-version,
Who isn’t using official AWS Container images? Based on discovered attribute container.image.repository

Extend Steadybit’s Discovery Data for More Exploration

This blog post utilized attributes discovered by extensions shipped with/made by Steadybit; however extending Steadybit by writing custom extensions discovering additional attributes remains possible! Implementing HTTP APIs facilitates enriching discovered targets easily – adding incident counts per component or technology usage info easing organizational-wide chaos engineering tasks becomes feasible!

Outlook

Current feature set: Landscape Explorer marks merely starting point; based upon customer feedback exploring options adding more data e.g., configured teams/environments uncovered through experiments continues! Integrating Weakspots (Weakspot docs) aiding previous chaos engineering journeys revealing missing readiness probes/improperly distributed K8S deployments planned! Open-source contributions possible – enabling community-driven discovery/extensions today!

Disclaimer: Images depict Steadybit product features digitally altered safeguarding sensitive data.