There are many chaos engineering tools that could be right for your team depending on your tech stack, your use cases, and your organization size. The most common view on chaos engineering tool types is open source vs. commercial tools.
For example, open source tools will often focus on specific platform-specific fault injections, network and infrastructure stressors, or application-level failure simulations. Commercial chaos engineering tools will instead offer comprehensive support for all of these layers and include platform features like permissions management, safety features, and a central UI to manage experiments.
In this article, we’ll walk you through some of the most popular chaos engineering tools and provide context on where they are most helpful.
If you are new to chaos engineering, you should get started with open source tools and scripts. They are free, accessible, and can allow you to easily run some initial experiments on your systems. For example, Chaos Monkey, the open source script from Netflix, became a popular tool for teams to use to introduce chaos into their systems, turning servers on and off at random. If you are not doing any chaos engineering today, running your first experiment with an open source tool can be a good way to get your feet wet.
ChaosMesh is a CNCF open source project focused on injecting faults into Kubernetes environments. It’s powerful but assumes your team is already deep in the K8s ecosystem.
Best for: SREs running complex Kubernetes clusters who want flexibility and control.
Litmus comes with a hub of predefined experiments and integrates cleanly with CI/CD pipelines. It’s well suited to teams that want to bake chaos into their daily workflows.
Best for: DevOps teams that want to bring chaos testing into GitOps or pipeline-driven environments.
ToxiProxy is a low-level open source tool that simulates network conditions between services. It’s often used in development and testing environments to verify behavior under failure.
Best for: Developers building fault-tolerant apps who need precise control over traffic behavior.
ChaosBlade is Alibaba’s open source chaos tool. It focuses on host-level and container-level fault injection. It’s quick to get running and simple to use.
Best for: Engineers who want to try chaos testing without setting up a full platform.
There are however some natural limitations with open source tools when you want to mature your practice or bring chaos engineering to multiple teams across your organization. As your ambitions grow, sticking with open source tools means dedicating additional time and resources to adapting, building, and integrating your solution to fit your tech stack. At a certain point, you will find that you have developed an internal tool that requires maintenance, end user support, and still has significant gaps compared to commercial tools on the market that have the benefit of a wider pool of user feedback and focused product development.
If you want to spend your time using your chaos engineering tool to run experiments rather than endlessly maintaining and modifying it, commercial tools are a good path forward.
Gremlin was the first commercial chaos engineering tool on the market, founded in 2016. It helped shape the category and brought awareness to chaos engineering in production. It offers turn-key experiments with limited customization on a closed-source platform.
Best for: Teams in highly-structured enterprise environments with generic cloud infrastructures.
Founded in 2019, Steadybit is a top competitor with Gremlin. Unlike Gremlin, Steadybit is built with an open source extension framework that provides teams with the ability to easily add custom attacks and integrations. Their Reliability Hub is an open source library that features hundreds of experiment templates and actions to help teams get started immediately.
Best for: Engineering teams who want to scale resilience testing without deployment friction.
If you only want to run chaos experiments on AWS resources, AWS FIS could be a good option. It allows you to run a limited number of targeted failure scenarios. It integrates with IAM and CloudWatch, but has limited support outside the AWS ecosystem.
Best for: Teams operating fully inside AWS who want cloud-native testing.
Harness offers a chaos module based on Litmus Chaos that integrates with the broader Harness ecosystem. It’s aimed at enterprise DevOps teams that already use Harness for deployment.
Best for: Enterprise teams already using Harness who want a bundled solution.
When evaluating chaos engineering tools, here are some questions to keep top of mind:
Every tool comes with some trade-off. If you choose open source, you are accepting the risks of more internal development time, maintenance, and slower adoption while you build a solution capable of scaling across an organization. Commercial tools have license or usage costs that require you to justify them internally with a business case. If you do nothing and choose not to adopt a tool, then you are relying on hope that an incident will not occur that costs your business millions.
If you’d like to hear more about how to make a business case for a chaos engineering tool and bring an ROI estimate to your management, you can just reach out and our team would be happy to share some resources.
Start with where your team is today. The best tool is the one that fits your environment and actually gets used.