Important Concepts & Definitions
Now that we’ve introduced the context for this course, let’s align on some key concepts and definitions. We’ll cover terms that we utilize in the platform and explain why they are useful for your chaos engineering efforts.
Reliability Concepts
We’ll reference system reliability and chaos engineering often throughout this course. You may already be well-versed with these terms, but to make sure we all have common language, here are some definitions to keep in mind:
Chaos Engineering: The discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
System Reliability: The likelihood over time that applications or services will operate at or above certain performance standards, regardless of varied conditions.
Platform Terms
There are several sections of the Steadybit platform, including the Explorer, Reliability Advice, the Experiment Editor, Reliability Advice, the Experiment Editor, Teams & Environments, and Reports. We’ll cover each of these in a specific lesson.
There are a few sections that don’t have designated lessons in this 101 course, so here are quick summaries.
The Steadybit Dashboard
When you enter Steadybit, you will first land on the Dashboard. From this screen, you can see a high-level summary of recent experiment activities, a live view of discovered targets per environment, and a report on the most used types of attacks across your organization.
The Reliability Hub
The Reliability Hub is a large library of open source components that can be easily utilized in Steadybit. This catalog includes Actions, Targets, Advice, Templates, and Extensions. When Steadybit customers or team members contribute a new attack or integration, it is first reviewed for quality and then published to the Reliability Hub.
Development Kits
Steadybit has a hybrid architecture which allows teams to easily build and incorporate their own custom extensions and actions. We provide language-agnostic development kits for building your own Actions, Targets, Advice, Extensions, Events, and Preflight Actions.
Experiment Terms
Actions
Actions are basically the building blocks of experiments in Steadybit. In the Editor, you can drag-and-drop actions into the canvas to design your experiment. Here’s a quick review of some of the types of actions you could use:
- Attacks: Injects a fault or issue into your system (e.g. blocking access to the DNS, blocking traffic, delaying traffic, stressing resources, etc.)
- Checks: Checking whether the system is in a given state at a given time, often done by integrating with your Observability tool, like Datadog, Prometheus, Grafana, AppDynamics, or by using the information coming from the target discovery (e.g. the amount of Kubernetes pods that are currently running in your system).
- Load Tests: Load test integrations are pretty nice to use chaos engineering in a pre-production environment and have some virtual users using your system. For that, we have extensions available for K6, Gatling, JMeter, and LoadRunner.
- Other Actions: There are also “’other” kind of actions, like “Kubernetes event logs”, or actions to mute monitors to avoid a real escalation whenever you are triggering an alert.
Blast Radius
When you are selecting targets for any given experiment, it’s important to start with a small scope and expand as you build confidence in your systems. You can use the Query UI or Query language to define the type of target you want to attack. For example, you could specify a certain Kubernetes cluster, which happens to have 20 pods.
You can then use the Blast Radius feature to limit your impact within that specified target. You can target a set number (“# / total”) or percentage (“% / total”) of the total cluster. When you run the experiment, targeted pods will be randomly assigned based on your targeting query and blast radius parameters. For more information, you can learn more here.
Emergency Stop
Safety is critical to meaningful experimentation on your systems. You want to know that you can run experiments without triggering cascading impacts. If you ever need to stop an experiment or multiple experiments, you can always hit the “Emergency Stop” button in Steadybit.
By clicking “Current Activities” in the left-hand navigation menu, you can see all running experiments. When you click “Emergency Stop”, all experiments will be stopped right away and there will be a temporary delay that prevents new experiments from starting.
Lesson Summary
As you move forward with this course, these concepts and terms will be helpful to have in mind as we discuss implementing Steadybit and running experiments. This is not an exhaustive list of terms and if you come across something you don’t recognize, please reference this list in our documentation.
