The primary objective of a Chaos Experiment is to uncover hidden bugs, weaknesses, or non-obvious points of failure in a system that could lead to significant outages, degradation of service, or system failure under unpredictable real-world conditions.
A Chaos Experiment is a carefully designed, controlled, and monitored process that systematically introduces disturbances or abnormalities into a system’s operation to observe and understand its response to such conditions.
It forms the core part of ‘Chaos Engineering’, which is predicated on the idea that ‘the best way to understand system behavior is by observing it under stress.’ This means intentionally injecting faults into a system in production or simulated environments to test its reliability and resilience.
This practice emerged from the understanding that systems, especially distributed systems, are inherently complex and unpredictable due to their numerous interactions and dependencies.
💡Note → The ultimate goal is not to break things randomly but to uncover systemic weaknesses to improve the system’s resilience. By introducing chaos, you can enhance the understanding of your systems, leading to higher availability, reliability, and a better user experience.
💡Objective: To assess how microservices behave when one or more of their dependencies fail. In a microservices architecture, services are designed to perform small tasks and often rely on other services to fulfill a request. The failure of these external dependencies can lead to cascading failures across the system, resulting in degraded performance or system outages. Understanding how these failures impact the overall system is crucial for building resilient services.
💡Recommendation → Monitoring in real-time allows you to quickly identify and respond to unexpected behaviors, minimizing the impact on your system.
💡Objective: To understand how a system behaves when subjected to unusual or extreme resource constraints, such as CPU, memory, disk I/O, and network bandwidth. The aim is to identify potential bottlenecks and ensure that the system can handle unexpected spikes in demand without significantly degrading service.
💡Pro Tip → Ensure that the tool you select can accurately simulate the types of resource manipulation you’re interested in, whether it’s exhausting CPU cycles, filling up memory, saturating disk I/O, or hogging network bandwidth.
💡Pro Tip → Platforms like Steadybit can integrate with monitoring tools to provide a unified view of how resource constraints affect system health, making it easier to correlate actions with outcomes.
💡Objective: To simulate various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiment seeks to understand how a system responds and adapts to network unreliability, ensuring critical applications can withstand and recover from real-world network issues.
⚠️Note → Simulating DNS failures can be complex but is crucial for understanding how your system reacts to DNS resolution issues. Consider using specialized tools or features for this purpose.
On the flip side, chaos experiment solutions like Steadybit provide user-friendly interfaces for simulating network disruptions. For example, you get safety features like built-in rollback strategies to minimize the risk of long-term impact on your system.
💡Recommended → Analyze the overall resilience of your system to network instability. This assessment should include how well services degrade (if at all) and how quickly and effectively they recover once normal conditions are restored.
START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with Steadybit — Start a free trial. 👈
💡Recommended → [Case Study] Learn how ManoMano uses Steadybit to be in control of its system’s reliability.
Fault injection in chaos testing is a technique that intentionally introduces errors or disruptions into a system to assess its resilience and fault tolerance capabilities.
This approach is grounded in the belief that: by simulating real-world failures, teams can identify potential weaknesses in their systems, improve their understanding of how systems behave under stress, and enhance the overall reliability and robustness of their services.
Consider owning a web application that relies on a microservices architecture, where one of the services is a payment processing service.
To ensure the application remains operational even if the payment service becomes unavailable, you can design a fault injection experiment to simulate the service’s failure.
Here’s how it’ll play out:
💡Objective: Target the communication between the web application and the payment processing service.
START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with Steadybit — Start a free trial. 👈
The ‘steady state’ refers to the normal behavior or output of your system under typical conditions. This includes identifying KPIs such as response times, error rates, throughput, and availability metrics.
🔖How to do this: Collect and analyze historical data to understand the system’s behavior under normal conditions. Use this data to set thresholds for acceptable performance, which will serve as a baseline for detecting anomalies when introducing chaos experiments.
This principle involves forming hypotheses about what will happen when chaos is introduced to the system. The hypothesis should predict that ‘the steady state will continue despite the chaos introduced, based on the assumption that the system is resilient.’
🔖How to do this: Based on the defined steady state, develop scenarios that could potentially disrupt this state. For each scenario, predict the system’s response and define the desired outcome of the experiment, such as failover to a redundant system, graceful degradation of service, or triggering of alerts and recovery processes.For example, if latency is injected into a service, hypothesize that it should not affect the overall error rate beyond a specific threshold.
Run chaos experiments in a pre-production environment to avoid unintended disruptions to real users and services. This controlled setting allows for identifying and remedying issues without risking production stability.
🔖How to do this: Replicate the production environment as closely as possible to ensure the validity of the experiment results. Conduct the experiments by introducing the planned disturbances and observing the system’s response. Make adjustments and fixes in this environment and re-test before considering moving to production. Once you’re confident in the system’s resilience through thorough testing and remediation in pre-production, begin planning for controlled experiments in the production environment.
Automating your chaos experiments ensures consistency, repeatability, and scalability of testing. This continuous experimentation helps catch issues arising from system or environment changes.
🔖How to do this: Integrate chaos engineering tools into your CI/CD pipeline to automatically trigger experiments based on certain conditions, such as after a deployment or during off-peak hours.
Other tools for automating chaos tests include Gremlin, Chaos Monkey, and Lit
musChaos. Each tool has features tailored to different infrastructures and failure scenarios.
💡Pro Tip → While you can run experiments at any time, it also makes sense to run experiments automatically on build or deploy jobs. You can make experiments a part of your CI pipeline through SteadyBit’s API, GitHub action or CLI to continuously verify resilience automatically.
Limit the impact of chaos experiments to prevent widespread disruption. This principle involves starting with the smallest possible scope and gradually expanding as confidence in the system’s resilience grows.
🔖How to do this: Use feature flags, canary deployments, or service mesh capabilities to isolate the experiment’s impact. Additionally, you can utilize throttling, segmentation, or shadow traffic techniques to control the impact of the experiment.Monitor the experiment closely and have rollback mechanisms in place to quickly revert changes if unexpected issues arise.
Monitor performance using observability tools to collect data on how the system responds to the introduced disturbances. This data is critical for analyzing the experiment’s impact and making informed decisions on system improvements.
🔖How to do this: Implement monitoring and observability across the system to track performance metrics, log anomalies, and trace transactions. This will provide deep insights into the system’s behavior during and after experiments.
START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with Steadybit — Start a free trial. 👈
Steadybit’s extension kit extends the capabilities of the SteadyBit platform, enabling custom chaos experiments.
These extension kits include:
You can learn more about ActionKit through its GitHub repository.
You can learn more about DiscoveryKit through its GitHub repository.
You can learn more about EventKit through its GitHub repository.
You can learn more about ExtensionKit through its GitHub repository.
LitmusChaos is a CNCF project that provides a framework for running chaos experiments in Kubernetes environments. It includes a variety of predefined chaos experiments and allows for custom experiment creation.
LitmusChaos integrates directly with Kubernetes, using CRDs to manage chaos experiments. It also offers a ChaosHub where users can share and use chaos charts.
🚩Challenges: While powerful, its Kubernetes-centric approach may limit its applicability to non-containerized environments. Users must also be comfortable with Kubernetes concepts to use it effectively.
Chaos Mesh is another CNCF sandbox project focused on Kubernetes. It offers a comprehensive toolkit for orchestrating chaos experiments across Kubernetes clusters.
It provides a rich set of fault injection types, including pod failures, network latency, and file system IO. Chaos Mesh uses CRDs to define chaos experiments and has a dashboard for managing and visualizing them.
🚩Challenges: Similar to LitmusChaos, its Kubernetes-specific nature means it’s less suited for non-Kubernetes environments. It requires a good understanding of Kubernetes to deploy and manage experiments.
One of the earliest tools for chaos engineering, Chaos Monkey was developed by Netflix for randomly terminating instances in their production environment to ensure that engineers implement their services to be resilient to instance failures.
It was originally designed to target Amazon Web Services (AWS) to test how remaining systems respond to such failures. It also integrates with Spinnaker for managing application deployments.
🚩Challenges: Chaos Monkey requires integration with other tools for a comprehensive chaos engineering use. It also requires specific expertise to adapt and manage.
Steadybit provides a platform for executing chaos experiments across various environments, including cloud and on-premises. It stands out for its user-friendly interface and ability to define, execute, and monitor experiments without extensive chaos engineering knowledge.
The platform integrates with major cloud providers and Kubernetes, enabling seamless transitions between different environments and facilitating experiments across a hybrid cloud setup. One key feature is its scenario-based approach, which allows users to simulate complex, real-world scenarios involving multiple types of failures across different system components.
💡Pro Tip → You can use Steadybit’s Reliability Hub to try out some commonly used attacks and see how they impact your system.
AWS provides the Fault Injection Simulator (FIS), a fully managed service designed to inject faults and simulate outages within AWS environments. It supports experiments like API throttling, instance termination, and network latency. There are also targeted chaos experiments on EC2 instances, EKS clusters, and Lambda functions.
As part of the AWS ecosystem, the FIS leverages IAM for security and CloudWatch to monitor the impact of experiments.
Azure’s chaos engineering toolkit includes Chaos Studio, which provides a controlled environment for running experiments on Azure resources. This allows real-time tracking of experiments’ effects on application performance and system health. It also supports various fault injections, including virtual machine reboots, network latency, and disk failures.
Google Cloud offers a range of tools and services that facilitate chaos engineering, including managed Kubernetes and network services that can simulate real-world network conditions.
The platform integrates with the Google Cloud Operations suite (formerly Stackdriver) for monitoring, logging, and diagnostics, enabling detailed visibility into the impact of chaos experiments.
DataDog provides comprehensive monitoring and observability for cloud-scale applications. It captures data from servers, containers, databases, and third-party services, offering real-time visibility into system performance.
The solution integrates with several chaos engineering platforms to track the impact of experiments in real-time. This allows you to correlate chaos events with changes in system metrics and logs.
New Relic offers observability with real-time application performance monitoring. It provides detailed insights into the health and performance of distributed systems, including applications, infrastructure, and digital customer experiences.
A key addition to the New Relic’s hub is the programmable platform, New Relic One, which allows users to customize their observability environment. This includes creating custom applications and dashboards tailored to needs, such as getting detailed analyses of chaos experiments.
Chaos engineering in Kubernetes involves introducing failures at various levels, including the pod, node, network, and service levels, to test the resilience of applications and the Kubernetes orchestrator itself.
💡Integration benefits: Tools like Chaos Mesh and LitmusChaos specifically target Kubernetes environments. This is partly because they allow for the definition of chaos experiments as custom resources, enabling scenarios such as pod deletions, network latency, and resource exhaustion directly within the Kubernetes ecosystem.
Cloud providers like AWS, Azure, and Google Cloud offer native services and features that support chaos engineering, such as managed Kubernetes services (EKS, AKS, GKE), serverless environments, and specific fault injection services (AWS FIS).
Utilizing these services for chaos experiments allows teams to simulate real-world scenarios that could affect their applications in the cloud, including region outages, service disruptions, and throttled network connectivity.
💡Integration benefits: Cloud providers often offer extensive documentation and support for running chaos experiments within their ecosystems, reducing the learning curve and speeding up the experimentation process.
Integrating chaos engineering workflows with version control systems like GitHub enables the automation of chaos experiments through CI/CD pipelines. GitHub Actions or similar automation tools can trigger chaos experiments based on specific events, such as a push to a branch, a pull request, or on a scheduled basis.
💡Integration benefits: This integration supports the ‘shift-left’ approach in resilience testing, allowing for early detection of issues before they impact production. It also facilitates tracking and versioning of chaos experiment configurations alongside application code.
🔖Steadybit Nuggets → The ‘shift-left’ approach is a methodology that emphasizes the integration of testing early in the software development life cycle rather than at the end or after the development phase.
Tools like OpenShift and Docker Swarm extend Kubernetes’ capabilities, providing additional features for managing containerized applications across diverse environments. These platforms support the deployment and scaling of containers, which are critical for microservices architectures.
💡Integration benefits: Many chaos engineering tools offer specific functionalities for container environments, allowing you to inject failures into containerized applications directly. This direct integration facilitates more granular control over the experiments and the ability to monitor the impact on containerized services closely.
Microservices architectures rely heavily on API communications. Disrupting these communications can help teams understand the impact of network issues, latency, and failures on the overall system.
💡Integration benefits: Focused disruption of API communications allows you to test the resilience of service-to-service interactions, which is critical in a microservices architecture. This helps validate API contracts, ensuring that services can handle failures and maintain functionality even when dependencies are unstable.
⚠️Note → Effective API testing within chaos experiments requires a deep understanding of the expected service interactions and dependencies. This includes not just direct service-to-service communications but also the cascading effects of failures through the system.
START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with Steadybit — Start a free trial. 👈
Simulating outages involves intentionally bringing down services or components within a system to test its recovery processes and resilience. This can range from shutting down virtual machines, killing processes, or disconnecting network services.
The goal here is to validate and improve the system’s ability to detect failures, reroute traffic, or spin up new instances without significant impact on the end-user experience.
Use cases include:
Load testing involves simulating unexpected spikes in traffic or processing load to understand how systems behave under extreme conditions. This helps identify bottlenecks, resource limitations, and scalability issues within the application or infrastructure.
Use cases include:
Dependency testing focuses on the resilience of a system when external services or internal components fail. This could involve simulating API downtimes, database connection failures, or corrupted data inputs. The purpose is to ensure that the system can handle such failures gracefully, without cascading failures to the user level.
Use cases include:
Chaos experimentation can simulate catastrophic events, such as data center failures, to validate the effectiveness of disaster recovery plans. This testing is critical, especially for industries required by law to maintain high availability and data integrity, such as finance, healthcare, and insurance.
Use cases include:
Thinking of running a chaos experiment within your system? Think Steadybit.
Steadybit offers a library of predefined attack scenarios, such as CPU stress, disk fill, network latency, and packet loss, enabling teams to quickly set up and run chaos experiments.
With this, you can design custom experiments tailored to your specific infrastructure and application architecture, including the ability to target specific services, containers, or infrastructure components.
Steadybit also includes automated safety checks to prevent experiments from causing unintended damage, such as halt conditions that automatically stop the experiment if certain thresholds are exceeded.
💡Read Case Study → Discover how Salesforce achieved unmatched system resilience with Steadybit’s innovative Chaos Engineering solution. Get the case study now and learn why they chose Steadybit.
With our Experiment Editor your journey toward reliability is faster and easier: everything is at your fingertips, and you have full control over your experiments. All is meant to help you achieve your goals and roll out Chaos Engineering safely at scale in your organization.
Using Steadybit’s landscape, you can see your software’s dependencies and relationships between components – the perfect start to kick off your Chaos Engineering journey.
It’s never been easier to successfully and safely scale Chaos Engineering in your organization: with Steadybit you can limit and control all the turbulence injected into your system.
We treat extensions as first-class citizens in our product. As a result, they are deeply integrated into our user interface: you can add extensions and even create your own.
START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with Steadybit — Start free trial. 👈