Improving your system’s reliability can be challenging. Initially, you are looking at a large pile of infrastructure components from a dozen teams. They are all somehow connected, and every piece will fail eventually. While you can use Chaos Engineering to reveal the impact of each failure, you can’t predict when a failure will happen. This makes it hard to know where to start and where to continue to keep getting the most value from Chaos Engineering. Also, once you have identified the first findings with Chaos Engineering, you need to check what other components suffer from similar issues.
That’s why we released a new feature of Steadybit: the Landscape Explorer. The Landscape Explorer guides you on your Chaos Engineering journey to help you navigate through your system and make well-informed guesses when preparing upcoming Chaos Engineering experiments.
In this blog post, we first cover the underlying foundation of Steadybit that made this possible, followed by some hands-on examples of the Landscape Explorer. This blog post concludes with a glimpse of potential future steps.
One foundation when performing Chaos Engineering experiments with Steadybit is the automatic discovery of targets.
A target is an infrastructure component that you can attack in an experiment. It is the actual place where the fault or behavior change is injected. Responsible for both discovery and fault injection is an extension, a small piece of software that runs in your infrastructure. An extension not only discovers the actual target but also attributes that are key-value-pair metadata about the target and can also be used to enrich other targets. All these attributes describe each individual target so that you can be sure to attack the right one in your experiment.
Let’s look at an example: The container extension can automatically discover containers, the host on which a container runs, and the used container image. The latter two are attributes added to the target container (container.host
and container.image
). Another extension, the AWS extension, is capable of discovering AWS EC2 hosts as targets and their related attributes like the AWS account (aws.account
), region (aws.region
), and zone (aws.zone
). However, not only is the host running in the AWS context, but also the container running on that host. That’s why the AWS extension enriches these attributes to the container target. The Kubernetes extension performs another beneficial enrichment for containers running in a Kubernetes cluster. The extension enriches the containers with attributes like k8s.deployment
, k8s.namespace
, and k8s.cluster-name
.
The following example shows a discovered container target and some attributes:
You can use this discovery information now in an experiment.
Let’s say you want to limit the network bandwidth for a container in a specific AWS Availability Zone. By using Steadybit’s query language, you can easily do so by filtering down to the exact target instance (e.g., k8s.cluster-name='dev-demo' AND k8s.namespace='shop' AND k8s.deployment='gateway' AND aws.zone='eu-central-1b'
).
Consequently, this also allows Steadybit to validate targets before running the experiment to avoid trial-and-error approaches common in code-based Chaos Engineering.
Based on target discovery, we released a new feature: Landscape Explorer. By default, clicking ‘Landscape’ brings you to a Kubernetes Cluster map that arranges every discovered Kubernetes target and shows the corresponding namespace, deployment, and cluster.
With this view, you already get a sense of how your cluster looks—much better than a long table provided by other tools.
In addition, you can fully customize the view using one of these capabilities:
Let’s look at examples of how to use these features to guide your next Chaos Engineering experiments.
Based on the Kubernetes map, we can easily add another dimension to identify which deployments run across multiple AWS availability zones and which run in just one zone. For that purpose, we can color-code components by aws.zone
and check which deployments have only one color assigned and thus are running in one zone.
Looking at the product-offering
namespace, we already have many deployments running across multiple zones (e.g., product-offering-core
), but a few still run in just one zone (e.g., po-work-equipment
and po-catalog-core
).
Let’s continue drilling down into Explorer to identify where to run our next Chaos Engineering experiments.
The need for microservice availability usually correlates with API endpoints count, load, and business criticality. Let’s focus on API endpoints count here. Therefore, we can use discovered Spring MVC mappings (via spring.mvc-mapping
) to size targets by endpoint numbers.
This exploration highlights significant infrastructure components needing high availability and Chaos Engineering attention. For instance, within product-offering
, product-offering-core
has many endpoints (and runs in multiple zones), whereas po-catalog-core
has fewer endpoints but also runs in just one zone. However, we should continue drilling down before making conclusions.
As mentioned earlier, business criticality is also crucial for guiding your Chaos Engineering efforts since downtime impacts business-critical components significantly. Luckily, your organization describes component criticality via labels (service-tier
) automatically discovered by Steadybit too. Components labeled service-tier 1 are most critical and should run across multiple availability zones; thus we search explicitly for those by applying an additional filter.
As visible above, most business-critical components within product-offering
match expectations by running in at least two availability zones (e.g., product-offering-core
, po-infrastructure
). These are perfect candidates for Chaos Engineering experiments verifying failover success (see recipe Verify Smooth Operation During an AWS Zone Outage in our Reliability Hub). You can copy queries identifying targets directly from Explorer’s sidebar.
However, some deployments (e.g., po-catalog-core
) run only within one Availability Zone; these reliability issues should be addressed immediately without verifying via Chaos Engineering (as failure is predictable).
These drill-downs show possibilities for starting your Chaos Engineering journey:
container.runtime
,label.domain
, or label.owner
),kubernetes.io/managed-by
,kubernetes.io/managed-by-version
,container.image.repository
This blog post utilized attributes discovered by extensions shipped with/made by Steadybit; however extending Steadybit by writing custom extensions discovering additional attributes remains possible! Implementing HTTP APIs facilitates enriching discovered targets easily – adding incident counts per component or technology usage info easing organizational-wide chaos engineering tasks becomes feasible!
Current feature set: Landscape Explorer marks merely starting point; based upon customer feedback exploring options adding more data e.g., configured teams/environments uncovered through experiments continues! Integrating Weakspots (Weakspot docs) aiding previous chaos engineering journeys revealing missing readiness probes/improperly distributed K8S deployments planned! Open-source contributions possible – enabling community-driven discovery/extensions today!
Disclaimer: Images depict Steadybit product features digitally altered safeguarding sensitive data.