Types of Chaos Experiments (+ How To Run Them According to Pros)

Chaos Engineering Guides

10.04.2024 Benjamin Wilms - 18 min read

Types of Chaos Experiments (+ How To Run Them According to Pros)

The primary objective of a Chaos Experiment is to uncover hidden bugs, weaknesses, or non-obvious points of failure in a system that could lead to significant outages, degradation of service, or system failure under unpredictable real-world conditions.

What is a Chaos Experiment?

A Chaos Experiment is a carefully designed, controlled, and monitored process that systematically introduces disturbances or abnormalities into a system’s operation to observe and understand its response to such conditions.

It forms the core part of ‘Chaos Engineering’, which is predicated on the idea that ‘the best way to understand system behavior is by observing it under stress.’ This means intentionally injecting faults into a system in production or simulated environments to test its reliability and resilience.

This practice emerged from the understanding that systems, especially distributed systems, are inherently complex and unpredictable due to their numerous interactions and dependencies.

Components of a Chaos Engineering Experiment

Hypothesis Formation. At the initial stage, a hypothesis is formed about the system’s steady-state behavior and expected resilience against certain types of disturbances. This hypothesis predicts no significant deviation in the system’s steady state as a result of the experiment.

Variable Introduction. This involves injecting specific variables or conditions that simulate real-world disturbances (such as network latency, server failures, or resource depletion). These variables are introduced in a controlled manner to avoid unnecessary risk.

Scope and Safety. The experiment’s scope is clearly defined to limit its impact, often called the “blast radius.” Safety mechanisms, such as automatic rollback or kill switches, are implemented to halt the experiment if unexpected negative effects are observed.

Observation and Data Collection. Throughout the experiment, system performance and behavior are closely monitored using detailed logging, metrics, and observability tools. This data collection is critical for analyzing the system’s response to the introduced variables.

Analysis and Learning. After the experiment, the data is analyzed to determine whether the hypothesis was correct. This analysis extracts insights regarding the system’s vulnerabilities, resilience, and performance under stress.

Iterative Improvement. The findings from each chaos experiment inform adjustments in system design, architecture, or operational practices. These adjustments aim to mitigate identified weaknesses and enhance overall resilience.

💡Note → The ultimate goal is not to break things randomly but to uncover systemic weaknesses to improve the system’s resilience. By introducing chaos, you can enhance the understanding of your systems, leading to higher availability, reliability, and a better user experience.

Types of Chaos Experiments

1. Dependency Failure Experiment

💡Objective: To assess how microservices behave when one or more of their dependencies fail. In a microservices architecture, services are designed to perform small tasks and often rely on other services to fulfill a request. The failure of these external dependencies can lead to cascading failures across the system, resulting in degraded performance or system outages. Understanding how these failures impact the overall system is crucial for building resilient services.

Possible Experiments

Network Latency and Packet Loss. Simulate increased latency or packet loss to understand its impact on service response times and throughput.

Service Downtime. Emulate the unavailability of a critical service to observe the system’s resilience and failure modes.

Database Connectivity Issues. Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms.

Third-party API Limiting. Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.

How to Run a Dependency Failure Experiment

Map Out Dependencies.

Begin with a comprehensive inventory of all the external services your system interacts with. This includes databases, third-party APIs, cloud services, and internal services if you work in a microservices architecture.

For each dependency, document how your system interacts with it. Note the data exchanged, request frequency, and criticality of each interaction to your system’s operations.

Rank these dependencies based on their importance to your system’s core functionalities. This will help you focus your efforts on the most critical dependencies first.

Simulate Failures

Use service virtualization or proxy tools like SteadyBit to simulate various failures for your dependencies. These can range from network latency, dropped connections, and timeouts to complete unavailability.

For each dependency, configure the types of faults you want to introduce. This could include delays, error rates, or bandwidth restrictions, mimicking real-world issues that could occur.

Start with less severe faults (like increased latency) and gradually move to more severe conditions (like complete downtime), observing the system’s behavior at each stage.

Test Microservices Isolation

Implement Resilience Patterns. Use libraries like Hystrix, resilience4j, or Spring Cloud Circuit Breaker to implement patterns that prevent failures from cascading across services. This includes:
- Bulkheads. Isolate parts of the application into “compartments” to prevent failures in one area from overwhelming others.
- Circuit Breakers. Automatically, “cut off” calls to a dependency if it’s detected as down, allowing it to recover without being overwhelmed by constant requests.

Carefully configure thresholds and timeouts for these patterns. This includes setting the appropriate parameters for circuit breakers to trip and recover and defining bulkheads to isolate services effectively.

Monitor Inter-Service Communication

Utilize monitoring solutions like Prometheus, Grafana, or Datadog to monitor how services communicate under normal and failure conditions. Service meshes like Istio or Linkerd can provide detailed insights without changing your application code.

Focus on metrics like request success rates, latency, throughput, and error rates. These metrics will help you understand the impact of dependency failures on your system’s performance and reliability.

💡Recommendation → Monitoring in real-time allows you to quickly identify and respond to unexpected behaviors, minimizing the impact on your system.

Analyze Fallback Mechanisms

Evaluate the effectiveness of implemented fallback mechanisms. This includes static responses, cache usage, default values, or switching to a secondary service if the primary is unavailable.

Assess if the ‘retry logic’ is appropriately configured. This includes evaluating the retry intervals, backoff strategies, and the maximum number of attempts to prevent overwhelming a failing service.

Ensure that fallback mechanisms enable your system to operate in a degraded mode rather than failing outright. This helps maintain a service level even when dependencies are experiencing issues.

2. Resource Manipulation Experiment

💡Objective: To understand how a system behaves when subjected to unusual or extreme resource constraints, such as CPU, memory, disk I/O, and network bandwidth. The aim is to identify potential bottlenecks and ensure that the system can handle unexpected spikes in demand without significantly degrading service.

Possible Experiments

CPU Saturation. Increase CPU usage gradually to see how the system prioritizes tasks and whether essential services remain available.
Memory Consumption. Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
Disk I/O and Space Exhaustion. Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.

How to Run a Resource Manipulation Experiment

Define Resource Limits

Start by monitoring your system under normal operating conditions to establish a baseline for CPU, memory, disk I/O, and network bandwidth usage.
Based on historical data and performance metrics, define the normal operating range for each critical resource. This will help you identify when the system is under stress, or resource usage is abnormally high during the experiment.

Check and Verify the Break-Even Point

Understand your system’s maximum capacity before it requires scaling. This involves testing the system under gradually increasing load to identify the point at which performance starts to degrade, and additional resources are needed.

If you’re using auto-scaling (either in the cloud or on-premises), clearly define and verify the rules for adding new instances or allocating resources. This includes setting CPU, memory usage thresholds, and other metrics that trigger scaling actions.

Use load testing tools like JMeter, Gatling, or Locust to simulate demand spikes and verify that your auto-scaling rules work as expected. This will ensure that your system can handle real-world traffic patterns.

Select Manipulation Tool

While Stress and Stress-ng are powerful for generating CPU, memory, and I/O load on Linux systems, they might not be easy to use across distributed or containerized environments. Tools like Steadybit offer more user-friendly interfaces for various environments, including microservices and cloud-native applications.

💡Pro Tip → Ensure that the tool you select can accurately simulate the types of resource manipulation you’re interested in, whether it’s exhausting CPU cycles, filling up memory, saturating disk I/O, or hogging network bandwidth.

Apply Changes Gradually

Start by applying small changes to resource consumption and monitor the system’s response.
Monitor system performance carefully to identify the thresholds at which performance degrades or fails. This will help you understand the system’s resilience and where improvements are needed.

Monitor System Performance

Use comprehensive monitoring solutions to track the impact of resource manipulation on system performance. Look for changes in response times, throughput, error rates, and system resource utilization.

💡Pro Tip → Platforms like Steadybit can integrate with monitoring tools to provide a unified view of how resource constraints affect system health, making it easier to correlate actions with outcomes.

Evaluate Resilience

Analyze how effectively your system scales up resources in response to the induced stress. This includes evaluating the timeliness of scaling actions and whether the added resources alleviate the performance issues.
Evaluate the efficiency of your resource allocation algorithms. This involves assessing whether resources are being utilized optimally and whether unnecessary wastage or contention exists.

Test the robustness of your failover and redundancy mechanisms under ‘conditions of resource scarcity’. This can include switching to standby systems, redistributing load among available resources, or degrading service gracefully.

3. Network Disruption Experiment

💡Objective: To simulate various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiment seeks to understand how a system responds and adapts to network unreliability, ensuring critical applications can withstand and recover from real-world network issues.

Possible Experiments

DNS Failures. Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.

Latency Injection. Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.

Packet Loss Simulation. Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.

Bandwidth Throttling. Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.

Connection Drops. Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.

How to Run a Network Disruption Experiment

Identify Network Paths

Start by mapping out your network’s topology, including routers, switches, gateways, and the connections between different segments. Tools like Nmap or network diagram software can help visualize your network’s structure.

Focus on identifying the critical paths data takes when traveling through your system. These include paths between microservices, external APIs, databases, and the Internet.

Document these paths and prioritize them based on their importance to your system’s operation. This will help you decide where to start with your network disruption experiments.

Choose Disruption Type

Decide on the type of network disruption to simulate. Options include;
- complete network outages,
- latency (delays in data transmission),
- packet loss (data packets being lost during transmission), and
- bandwidth limitations.

Next, choose disruptions based on their likelihood and potential impact on your system.
- For example, simulating latency and packet loss might be particularly relevant if your system is distributed across multiple geographic locations.

Use Network Chaos Tools

Traffic Control (TC). The ‘tc’ command in Linux is a powerful tool for controlling network traffic. It allows you to introduce delays, packet loss, and bandwidth restrictions on your network interfaces.

⚠️Note → Simulating DNS failures can be complex but is crucial for understanding how your system reacts to DNS resolution issues. Consider using specialized tools or features for this purpose.

On the flip side, chaos experiment solutions like Steadybit provide user-friendly interfaces for simulating network disruptions. For example, you get safety features like built-in rollback strategies to minimize the risk of long-term impact on your system.

Monitor Connectivity and Throughput

During the experiment, use network monitoring tools and observability platforms to track connectivity and throughput metrics in real time.

Focus on monitoring packet loss rates, latency, bandwidth usage, and error rates to assess the impact of the network disruptions you’re simulating.

Assess Failover and Recovery

Evaluate how well your system’s failover mechanisms respond to network disruptions. For example, you could switch to a redundant network path, use a different DNS server, or take other predefined recovery actions.

Measure the time it takes for the system to detect and recover the issue. This includes the time it takes to failover and return to normal operations after the disruption ends.

💡Recommended → Analyze the overall resilience of your system to network instability. This assessment should include how well services degrade (if at all) and how quickly and effectively they recover once normal conditions are restored.

START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with Steadybit — Start a free trial. 👈

The Role of Fault Injections in Chaos Testing

Fault injection in chaos testing is a technique that intentionally introduces errors or disruptions into a system to assess its resilience and fault tolerance capabilities.

This approach is grounded in the belief that: by simulating real-world failures, teams can identify potential weaknesses in their systems, improve their understanding of how systems behave under stress, and enhance the overall reliability and robustness of their services.

Practical Application of Fault Injections in Chaos Testing

Consider owning a web application that relies on a microservices architecture, where one of the services is a payment processing service.

To ensure the application remains operational even if the payment service becomes unavailable, you can design a fault injection experiment to simulate the service’s failure.

Here’s how it’ll play out:

💡Objective: Target the communication between the web application and the payment processing service.

First, the hypothesis is that if the payment processing service fails, the web application should automatically switch to a fallback payment service, ensuring that payment transactions can still be processed, perhaps with a slight delay.

With a tool like SteadyBit, you can simulate the failure of the payment processing service. This could involve shutting down the service’s instances or introducing network rules to block traffic between the web application and the payment service.

Monitor the application’s logs, payment transaction success rates, and user experience metrics during the experiment.
- The expected behavior is that: the web application detects the failure and switches to the fallback payment service without significant user impact.

Investigate the cause if the application fails to switch to the fallback service or if user transactions are significantly delayed.
- It might involve issues in service discovery, failure detection, or the fallback mechanism itself. Based on the findings, adjust the application’s codebase or configuration as needed.

Document the experiment, including the setup, execution, observations, and improvements. Share the results with your development and operations teams to enhance the system’s resilience against similar future incidents.

Principles of Chaos Engineering for Experimentation

Define ‘Steady State’ as a Measurable Output

The ‘steady state’ refers to the normal behavior or output of your system under typical conditions. This includes identifying KPIs such as response times, error rates, throughput, and availability metrics.

🔖How to do this: Collect and analyze historical data to understand the system’s behavior under normal conditions. Use this data to set thresholds for acceptable performance, which will serve as a baseline for detecting anomalies when introducing chaos experiments.

Hypothesize Based on Steady State

This principle involves forming hypotheses about what will happen when chaos is introduced to the system. The hypothesis should predict that ‘the steady state will continue despite the chaos introduced, based on the assumption that the system is resilient.’

🔖How to do this: Based on the defined steady state, develop scenarios that could potentially disrupt this state. For each scenario, predict the system’s response and define the desired outcome of the experiment, such as failover to a redundant system, graceful degradation of service, or triggering of alerts and recovery processes.For example, if latency is injected into a service, hypothesize that it should not affect the overall error rate beyond a specific threshold.

Run Experiments but NOT in Production

Run chaos experiments in a pre-production environment to avoid unintended disruptions to real users and services. This controlled setting allows for identifying and remedying issues without risking production stability.

🔖How to do this: Replicate the production environment as closely as possible to ensure the validity of the experiment results. Conduct the experiments by introducing the planned disturbances and observing the system’s response. Make adjustments and fixes in this environment and re-test before considering moving to production. Once you’re confident in the system’s resilience through thorough testing and remediation in pre-production, begin planning for controlled experiments in the production environment.

Automate Chaos Tests

Automating your chaos experiments ensures consistency, repeatability, and scalability of testing. This continuous experimentation helps catch issues arising from system or environment changes.

🔖How to do this: Integrate chaos engineering tools into your CI/CD pipeline to automatically trigger experiments based on certain conditions, such as after a deployment or during off-peak hours.

Automate Your Chaos Tests with Steadybit

Robust API Functionality. Steadybit’s API offers several functionalities to manage teams, tailor your workspace settings, and orchestrate the lifecycle of chaos experiments. Whether you’re looking to automate experiment creation, scheduling, or analysis, Steadybit’s API endpoints offer the granularity needed to integrate chaos engineering deeply into your development and operational workflows.

Seamless CLI Integration. If you prefer a console over a UI to get your work done but also don’t want to make API calls every time manually, we’ve got you covered. Our open-source CLI provides basic features to create, edit, and run experiments. The CLI also proves very useful when integrating Steadybit into your CD pipeline and allows you to versionize your experiments as code next to your application.

Other tools for automating chaos tests include Gremlin, Chaos Monkey, and Lit

musChaos. Each tool has features tailored to different infrastructures and failure scenarios.

💡Pro Tip → While you can run experiments at any time, it also makes sense to run experiments automatically on build or deploy jobs. You can make experiments a part of your CI pipeline through SteadyBit’s API, GitHub action or CLI to continuously verify resilience automatically.

Minimize Blast Radius

Limit the impact of chaos experiments to prevent widespread disruption. This principle involves starting with the smallest possible scope and gradually expanding as confidence in the system’s resilience grows.

🔖How to do this: Use feature flags, canary deployments, or service mesh capabilities to isolate the experiment’s impact. Additionally, you can utilize throttling, segmentation, or shadow traffic techniques to control the impact of the experiment.Monitor the experiment closely and have rollback mechanisms in place to quickly revert changes if unexpected issues arise.

Observe and Measure

Monitor performance using observability tools to collect data on how the system responds to the introduced disturbances. This data is critical for analyzing the experiment’s impact and making informed decisions on system improvements.

🔖How to do this: Implement monitoring and observability across the system to track performance metrics, log anomalies, and trace transactions. This will provide deep insights into the system’s behavior during and after experiments.

Chaos Engineering Tools for Running Experiments

Open-Source Tools

Steadybit Extension Kit

Steadybit’s extension kit extends the capabilities of the SteadyBit platform, enabling custom chaos experiments.

These extension kits include:

Action Kit. The Steadybit ActionKit enables the extension of Steadybit with new action capabilities that you can use within experiments. For example, ActionKit can be used to author open/closed source:
- attacks to attack AWS, Azure, and Google Cloud services that Steadybit cannot natively attack,
- integrate load testing tools,
- health and state checks and
- every other runnable action!

You can learn more about ActionKit through its GitHub repository.

DiscoveryKit. The Steadybit DiscoveryKit enables the extension of Steadybit with new discovery capabilities. For example, DiscoveryKit can be used to author open/closed source discoveries for:
- proprietary technology,
- non-natively supported open-source tech,
- hardware components and
- every other “thing” you would want to see and attack with Steadybit.

You can learn more about DiscoveryKit through its GitHub repository.

EventKit. This allows extensions to consume events from the Steadybit platform to integrate with third-party systems. Extensions leveraging EventKit are similar to webhooks but do not face the typical web routing issues as Steadybit agents handle this aspect. You can use EventKit to:
- Forward audit logs to an external system.
- Add markers to monitoring systems’ charts during experiment runs.
- Capture experiment run statistics.
- Report information about experiment runs to Slack, Discord, etc.

You can learn more about EventKit through its GitHub repository.

ExtensionKit. Through kits like ActionKit and DiscoveryKit, Steadybit can be extended with new capabilities. ExtensionKit, on the other hand, contains helpful utilities and best practices for extension authors leveraging the Go programming language.

You can learn more about ExtensionKit through its GitHub repository.

LitmusChaos

LitmusChaos is a CNCF project that provides a framework for running chaos experiments in Kubernetes environments. It includes a variety of predefined chaos experiments and allows for custom experiment creation.

LitmusChaos integrates directly with Kubernetes, using CRDs to manage chaos experiments. It also offers a ChaosHub where users can share and use chaos charts.

🚩Challenges: While powerful, its Kubernetes-centric approach may limit its applicability to non-containerized environments. Users must also be comfortable with Kubernetes concepts to use it effectively.

Chaos Mesh

Chaos Mesh is another CNCF sandbox project focused on Kubernetes. It offers a comprehensive toolkit for orchestrating chaos experiments across Kubernetes clusters.

It provides a rich set of fault injection types, including pod failures, network latency, and file system IO. Chaos Mesh uses CRDs to define chaos experiments and has a dashboard for managing and visualizing them.

🚩Challenges: Similar to LitmusChaos, its Kubernetes-specific nature means it’s less suited for non-Kubernetes environments. It requires a good understanding of Kubernetes to deploy and manage experiments.

Chaos Monkey

One of the earliest tools for chaos engineering, Chaos Monkey was developed by Netflix for randomly terminating instances in their production environment to ensure that engineers implement their services to be resilient to instance failures.

It was originally designed to target Amazon Web Services (AWS) to test how remaining systems respond to such failures. It also integrates with Spinnaker for managing application deployments.

🚩Challenges: Chaos Monkey requires integration with other tools for a comprehensive chaos engineering use. It also requires specific expertise to adapt and manage.

Cloud-Based Platforms

Steadybit

Steadybit provides a platform for executing chaos experiments across various environments, including cloud and on-premises. It stands out for its user-friendly interface and ability to define, execute, and monitor experiments without extensive chaos engineering knowledge.

The platform integrates with major cloud providers and Kubernetes, enabling seamless transitions between different environments and facilitating experiments across a hybrid cloud setup. One key feature is its scenario-based approach, which allows users to simulate complex, real-world scenarios involving multiple types of failures across different system components.

💡Pro Tip → You can use Steadybit’s Reliability Hub to try out some commonly used attacks and see how they impact your system.

FIS via AWS (Amazon Web Services)

AWS provides the Fault Injection Simulator (FIS), a fully managed service designed to inject faults and simulate outages within AWS environments. It supports experiments like API throttling, instance termination, and network latency. There are also targeted chaos experiments on EC2 instances, EKS clusters, and Lambda functions.

As part of the AWS ecosystem, the FIS leverages IAM for security and CloudWatch to monitor the impact of experiments.

Chaos Studio via Azure

Azure’s chaos engineering toolkit includes Chaos Studio, which provides a controlled environment for running experiments on Azure resources. This allows real-time tracking of experiments’ effects on application performance and system health. It also supports various fault injections, including virtual machine reboots, network latency, and disk failures.

Google Cloud

Google Cloud offers a range of tools and services that facilitate chaos engineering, including managed Kubernetes and network services that can simulate real-world network conditions.

The platform integrates with the Google Cloud Operations suite (formerly Stackdriver) for monitoring, logging, and diagnostics, enabling detailed visibility into the impact of chaos experiments.

Observability Platforms

DataDog

DataDog provides comprehensive monitoring and observability for cloud-scale applications. It captures data from servers, containers, databases, and third-party services, offering real-time visibility into system performance.

The solution integrates with several chaos engineering platforms to track the impact of experiments in real-time. This allows you to correlate chaos events with changes in system metrics and logs.

New Relic

New Relic offers observability with real-time application performance monitoring. It provides detailed insights into the health and performance of distributed systems, including applications, infrastructure, and digital customer experiences.

A key addition to the New Relic’s hub is the programmable platform, New Relic One, which allows users to customize their observability environment. This includes creating custom applications and dashboards tailored to needs, such as getting detailed analyses of chaos experiments.

Platforms and Integrations

Kubernetes Clusters

Chaos engineering in Kubernetes involves introducing failures at various levels, including the pod, node, network, and service levels, to test the resilience of applications and the Kubernetes orchestrator itself.

💡Integration benefits: Tools like Chaos Mesh and LitmusChaos specifically target Kubernetes environments. This is partly because they allow for the definition of chaos experiments as custom resources, enabling scenarios such as pod deletions, network latency, and resource exhaustion directly within the Kubernetes ecosystem.

Cloud Services

Cloud providers like AWS, Azure, and Google Cloud offer native services and features that support chaos engineering, such as managed Kubernetes services (EKS, AKS, GKE), serverless environments, and specific fault injection services (AWS FIS).

Utilizing these services for chaos experiments allows teams to simulate real-world scenarios that could affect their applications in the cloud, including region outages, service disruptions, and throttled network connectivity.

💡Integration benefits: Cloud providers often offer extensive documentation and support for running chaos experiments within their ecosystems, reducing the learning curve and speeding up the experimentation process.

Version Control Integration

Integrating chaos engineering workflows with version control systems like GitHub enables the automation of chaos experiments through CI/CD pipelines. GitHub Actions or similar automation tools can trigger chaos experiments based on specific events, such as a push to a branch, a pull request, or on a scheduled basis.

💡Integration benefits: This integration supports the ‘shift-left’ approach in resilience testing, allowing for early detection of issues before they impact production. It also facilitates tracking and versioning of chaos experiment configurations alongside application code.

🔖Steadybit Nuggets → The ‘shift-left’ approach is a methodology that emphasizes the integration of testing early in the software development life cycle rather than at the end or after the development phase.

Container Orchestration

Tools like OpenShift and Docker Swarm extend Kubernetes’ capabilities, providing additional features for managing containerized applications across diverse environments. These platforms support the deployment and scaling of containers, which are critical for microservices architectures.

💡Integration benefits: Many chaos engineering tools offer specific functionalities for container environments, allowing you to inject failures into containerized applications directly. This direct integration facilitates more granular control over the experiments and the ability to monitor the impact on containerized services closely.

API Testing

Microservices architectures rely heavily on API communications. Disrupting these communications can help teams understand the impact of network issues, latency, and failures on the overall system.

💡Integration benefits: Focused disruption of API communications allows you to test the resilience of service-to-service interactions, which is critical in a microservices architecture. This helps validate API contracts, ensuring that services can handle failures and maintain functionality even when dependencies are unstable.

⚠️Note → Effective API testing within chaos experiments requires a deep understanding of the expected service interactions and dependencies. This includes not just direct service-to-service communications but also the cascading effects of failures through the system.

Best Practices for Chaos Experimentation

Start Small. Identify components with isolated impact, lower traffic, or those that are non-critical to business operations as initial targets for your experiments.
- Recommendation: Begin your experiments with less critical systems to minimize the impact on your production environment and users.

Use Templates Employing YAML or JSON to Define Chaos Experiments. Utilize YAML or JSON formats to describe the parameters, scope, and actions of each experiment. These templates can be version-controlled and shared, ensuring consistency and reproducibility.

Automation. Once initial experiments are defined and validated, automate their execution using CI/CD pipelines or chaos engineering platforms. This includes automating the setup, execution, monitoring, and teardown of experiments.

Monitor Metrics. Monitoring specific metrics provides insights into the system’s behavior under test conditions. Focus on:
- Latency. Indicates how quickly the system responds to requests. An increase may suggest issues with resource contention or network problems.
- CPU Usage. High CPU usage can indicate inefficiencies in the code or that the system is under high load. Monitoring CPU usage is essential for understanding the system’s capacity and planning autoscaling.
- Error Rates. The frequency of errors can signal problems with the application logic, dependencies, or underlying infrastructure.
- Server Responsiveness. Whether servers are still responding adequately under stress conditions. This helps assess the system’s availability.

Top Indicators of Issues. Be mindful of:
- High Error Rates. A significant increase suggests systemic issues that need immediate attention.
- Readiness Failures. Kubernetes readiness probes indicate that applications are not ready to serve traffic, possibly due to startup or runtime issues.
- Litmus Probes. In Kubernetes environments, LitmusChaos probes can help detect specific conditions or states that indicate failures.
- Resilience Scores. While not universally applicable, they can quantify how well the system withstands and recovers from failures. However, their implementation and interpretation should be approached with caution.

Controlled Environment. Start chaos experiments in a staging environment that closely mirrors production. This allows for the identification and mitigation of risks in a controlled setting.

Document Result. Document the objectives, execution details, observations, and conclusions of each experiment. Include metrics, logs, and any remediations applied.

Educate and Collaborate. DevOps, development teams, and other stakeholders in planning, executing, and analyzing chaos experiments. Provide education on the principles and benefits of chaos engineering.

Security and Permissions. Implement strict access controls and permissions for conducting experiments. Conduct security reviews of chaos engineering tools and procedures.

Experimentation Use Case in Software Development

Simulating Outages

Simulating outages involves intentionally bringing down services or components within a system to test its recovery processes and resilience. This can range from shutting down virtual machines, killing processes, or disconnecting network services.

The goal here is to validate and improve the system’s ability to detect failures, reroute traffic, or spin up new instances without significant impact on the end-user experience.

Use cases include:

Automation of Failure Injection. Automated tools can schedule and execute outages in a controlled manner, reducing the risk of human error.
Observability. Enhanced monitoring and logging are crucial to observe the system’s behavior during the outage and for post-mortem analysis.
Fallback Mechanisms. Implementation of fallback mechanisms like circuit breakers, retries with exponential backoff, and service degradation strategies.

Load Testing

Load testing involves simulating unexpected spikes in traffic or processing load to understand how systems behave under extreme conditions. This helps identify bottlenecks, resource limitations, and scalability issues within the application or infrastructure.

Use cases include:

Scalability Testing. Verifying if auto-scaling policies are effective in managing increased loads.

Resource Utilization Monitoring. Tracking CPU, memory, disk I/O, and network utilization to identify potential resource contention or leaks.

Throttling and Rate Limiting. Ensuring that systems can gracefully degrade performance by prioritizing critical services when under stress.

Dependency Testing

Dependency testing focuses on the resilience of a system when external services or internal components fail. This could involve simulating API downtimes, database connection failures, or corrupted data inputs. The purpose is to ensure that the system can handle such failures gracefully, without cascading failures to the user level.

Use cases include:

Service Virtualization. Mimicking external service responses, including failures, to test how the internal system reacts.

Contract Testing. Verifying that interactions with external services meet predefined contracts, ensuring that changes in external services don’t break the system.

Timeouts and Retry Mechanisms. Configuring timeouts and retries appropriately to avoid prolonged failures due to dependency issues.

Disaster Recovery

Chaos experimentation can simulate catastrophic events, such as data center failures, to validate the effectiveness of disaster recovery plans. This testing is critical, especially for industries required by law to maintain high availability and data integrity, such as finance, healthcare, and insurance.

Use cases include:

Data Replication and Backup Strategies. Ensuring data is replicated across geographically dispersed locations and can be restored efficiently.

Failover Procedures. Automated failover to secondary systems or data centers with minimal downtime.

Recovery Point and Time Objectives (RPO/RTO). Testing if the system can meet the business requirements for data recovery and system availability after a disaster.

Discover Potential Problems with Steadybit

Thinking of running a chaos experiment within your system? Think Steadybit.

Steadybit offers a library of predefined attack scenarios, such as CPU stress, disk fill, network latency, and packet loss, enabling teams to quickly set up and run chaos experiments.

With this, you can design custom experiments tailored to your specific infrastructure and application architecture, including the ability to target specific services, containers, or infrastructure components.

Steadybit also includes automated safety checks to prevent experiments from causing unintended damage, such as halt conditions that automatically stop the experiment if certain thresholds are exceeded.

💡Read Case Study → Discover how Salesforce achieved unmatched system resilience with Steadybit’s innovative Chaos Engineering solution. Get the case study now and learn why they chose Steadybit.

Why Choose Steadybit?

Add reliability to your system

With our Experiment Editor your journey toward reliability is faster and easier: everything is at your fingertips, and you have full control over your experiments. All is meant to help you achieve your goals and roll out Chaos Engineering safely at scale in your organization.

You can add new targets, attacks, and checks by implementing extensions inside Steadybit.

A unique discovery and selection process makes it easy to pick the targets.

Remove friction when collaborating between teams: export and import experiments using JSON or YAML.

Discover, understand, and describe your system

Using Steadybit’s landscape, you can see your software’s dependencies and relationships between components – the perfect start to kick off your Chaos Engineering journey.

All targets are automatically and continuously discovered.

Figure out where to start your Chaos Engineering roll-out.

Easily detect common architecture pitfalls.

Limit and control the Chaos injected in your system

It’s never been easier to successfully and safely scale Chaos Engineering in your organization: with Steadybit you can limit and control all the turbulence injected into your system.

Using the powerful query language, divide your system(s) into different environments based on the same information you use elsewhere.

Explicitly assigning environments to specific users and teams in which they’re allowed to work and prevent unwanted damages.
Integrate Steadybit with your SAML provider or using the on-prem installation with your OIDC provider.

Open Source extension kits to give you flexibility

We treat extensions as first-class citizens in our product. As a result, they are deeply integrated into our user interface: you can add extensions and even create your own.

Choose the language you prefer when creating new extensions: we use GO to create ours, but you’re free to choose what you like best.

AWS, Datadog, Kong, Kubernetes, Postman, Prometheus: we got you covered!

Use our Kits to develop your own discoveries and actions for your custom targets where Steadybit out-of-the-box support doesn’t fit your needs.