Validate Risks with Experiments

Create and run experiments that provide real insights on your systems

Put your systems to the test and improve your operational readiness with a wide variety of experiments. Identify all of your reliability gaps and system limitations.

Simulate network-level outages, latency, and traffic issues
Run actions to stress resources like CPU, disk space, and memory
Change instance states and test database failover processes
Inject application-level faults with delays and method exceptions

See Recommendations

Our Advice feature provides you with a list of recommended experiments

Start with Experiment Templates

Create new experiments fast by selecting from over 50 pre-built templates

Create Custom Actions

Build experiments from scratch and add your own custom faults

Lowering the chaos engineering learning curve

When it’s easy to build experiments, reliability can be an inclusive cross-team effort.

Building Your First Experiment

Learn how to create reliability tests quickly with the Steadybit experiment editor.e exact experiment you want in minutes.

Creating Experiments from Scratch

Build new experiments fast with hundreds of no-code actions.

Using Experiment Templates

Use a library of 80+ templates to generate and review ready-to-run experiments.

Running Your First Experiment

Watch experiments run in real-time to see how each step impacts your system.

Scheduling Your Experiment Runs

Run future experiments as one-offs or recurring tests.

Automation & Workflow Options

Create automated workflows with the Steadybit API, CLI, and MCP Server.

Define targets with granular precision and set a safe blast radius

It’s important to start with a small blast radius when running experiments for the first time. For example, you may target only 10% of the pods in a cluster with a given attack. As you build confidence in how your system will respond, you can expand your blast radius and take on more risk.

Targeting in Steadybit uses an intuitive query language based on discovered metadata. It’s easy to be specific and safe, so you know exactly what your experiment will impact. Each action has a blast radius you can adjust with a simple toggle control. There is always an emergency stop button close by to hit the brakes and rollback changes.

Watch experiments run in real-time and validate monitoring alerts

When you start an experiment, you will be able to watch it run in real-time as each step is executed and review a summary of your system’s behavior. If your target is a Kubernetes cluster for example, you’ll see the Kubernetes event log so you can see each change and the results of health checks.

You can also watch to see if your observability tool is raising an alert when expected. Just install the relevant extension and view these real-time events in Steadybit.

Explore Integrations

Use a library of 200+ open source actions & templates

Create reliability tests across your tech stack with a wide range of pre-built actions and templates. View the full library in the Reliability Hub.

Actions

Templates

Explore More Actions

Browse open source actions that you can easily add to experiments.

Browse 150+ Actions

Check Kafka Consumer's Reaction to Record Loss

Deny access to the topic for consumers and delete records during this time.

Dynatrace Should Detect a Crash Looping as Problem

Verify that Dynatrace alerts you on pods not being ready to accept traffic.

Increase Latency Progressively for a Linux Host

Latency of a Host progressively increases to analyse at which point the communication breaks.

Check if Load Balancer Covers an AWS Zone Outage

Ensure that failover works seamlessly by simulating Zone outages.

Test if Kubernetes Deployment Survives Redis Latency

Verify that your application handles an increased latency in a Redis cache properly.

Stress CPU of Kubernetes Deployment

Stress the CPU of a subset of or entire Kubernetes deployment for an amount of time.

Test Scaling Up of ECS Service Within Given Time

Ensure that you can scale up your ECS service within a reasonable time.

Third-Party Service Becomes Unavailable for Kubernetes

See the effect of an unavailable 3rd-party service on your deployment's success metrics.

Test Graceful Degradation of Kubernetes Deployment While RabbitMQ Suffers High Latency

Test how your application handles high latency.

Ensure Reasonable Recovery Time When Losing a Pod

Kubernetes should bring up a new pod to ensure system stability when a pod is deleted.

Check that Prometheus Detects Unhealthy Deployments

Verify that Prometheus metrics catch unready pods in a Kubernetes deployment.

Test if Windows Host Reboot Is Alerted by Datadog

Testing if Datadog raises an alert when a Windows host is suddenly missing.

Draining a Node Should Reschedule Pods Quickly

When draining a node, Kubernetes should reschedule running pods on other nodes.

Check Faultless Redundancy During Rolling Update

Checks performance impacts of degradation during a rolling update.

New Relic Detects an Incident for CPU Spikes in an ECS Task

Validate your observability alerts for detecting a CPU spike in your AWS ECS cluster

Graceful Degradation when Microsoft SQL Server Database Can Not Be Reached

Ensure your system indicates an appropriate error message.

Check if Datadog Detects Lost Windows Host Connection

Check that Datadog detects when a Windows host suddenly loses connection.

Check Certificate TLS/SSL Expiry for Linux Hosts

Turn time forward and check whether your TLS/SSL certificates are valid.

Test if Kubernetes Deployment Survives Redis Downtime

Check that your application gracefully handles a Redis cache downtime.

Explore More Templates

Browse the full list of open source experiment templates in the Reliability Hub.

Browse 80+ Templates

Graphic titled 'Action Kit' featuring a set of action-driven tools and icons for project implementation

Missing an action that would unlock a useful experiment?

Our open source extension framework makes it easy to add custom components to Steadybit. Build your own custom actions using our language-agnostic ActionKit and create any experiment that would be useful for your organization.

Learn More

Schedule experiments or automate tests with the Steadybit API and CLI

You can run experiments manually, on a schedule, or with automation. Many teams will incorporate Steadybit experiments into their CI/CD workflow so they can continually verify experiments and ensure that new deployments meet a certain reliability standard.

With the Steadybit API and CLI, it’s easy to incorporate experiments into your development lifecycle to on your terms.

Learn More

Track your progress with experiment and usage reports

As you run experiments across teams, you can track your progress with reports in Steadybit. See what types of attacks are being used most often, count experiment runs, and see how many issues you have found and fixed.

Explore Report Types

Browse Actions & Templates in the Reliability Hub

See what types of actions, targets, and templates are waiting for you and your team in our open source library.

Explore the Hub

Validate Risks with Experiments

Create and run experiments that provide real insights on your systems

See Recommendations

Start with Experiment Templates

Create Custom Actions

Lowering the chaos engineering learning curve

Define targets with granular precision and set a safe blast radius

Watch experiments run in real-time and validate monitoring alerts

Use a library of 200+ open source actions & templates

Actions

Templates

Stress CPU

Stress IO

Stress Memory

Trigger Shutdown Host

Fill Disk

Time Travel

Change CPU Frequency

Delete Pod

Cause Crash Loop

Rollout Restart Deployment

Pause Docker Container

Taint a Node

Drain Node

Stop Container

Blackhole Subnet Attack

Blackhole Zone Attack

Corrupt Outgoing Packages

Drop Outgoing Traffic

Block DNS

Block Traffic

Delay Outgoing Traffic

Change Azure VM State

Change EC2 Instance State

Change GCP VM State

Run AWS FIS Experiment

Trigger DB Instance Stop

Reboot RDS Instances

Trigger DB Cluster Failover

Inject Latency

Inject Exception

Inject Status Code

Inject Controller Exception

Inject Java Method Exception

Java Method Delay

Fill Diskspace

Create Maintenance Window

Check Monitor Status

Create Monitor Downtime

Check Grafana Alert Rule State

Gather Prometheus Metrics

Check SLO State in Splunk

Create Muting Rule in New Relic

Run a K6 Load Test

Run a JMeter Load Test

Run a Gatling Load Test

HTTP Checks

Istio gRPC Abort

Kong Route Terminate Requests

Limit IO Threads

Check Queue Backlog

Run Jenkins Job