In our most recent webinar, Tailor Chaos Engineering to Scale Your Reliability Journey, our Product Manager Manuel Gerding, discussed how chaos engineering can enhance a system’s reliability. The session featured riveting insights on ways to conduct chaos engineering more effortlessly, while demonstrating Steadybit’s robust approach to this practice.
You can also check out the full webinar recording to learn more about tailoring Chaos Engineering to scale your reliability journey. The recording also includes additional insights from Salesforce and OpenText, going hand-in-hand with Gerding’s statements.
According to Gerding, one of the simplest ways to delve into chaos engineering is by using AWS console’s EC2 instances. With the simple act of randomly selecting an instance and stopping it, an engineer can already begin to observe, learn, and improve the system’s reliability. This method can help verify whether a new EC2 instance was initiated, if the application was scheduled on the new instance, and if there were any subsequent faults.
To ensure consistent reliability checks, Gerding suggests incorporating some code to halt an EC2 instance via your CI/CD pipeline using AWS SDK. This is a great way to stress test your system’s reliability under various scenarios, such as during a rolling update, resource stress, or when an availability zone becomes unavailable.
While developing code to simulate potential faults is indeed a rewarding endeavor, Gerding cautions that scaling chaos engineering to an organization-wide level may not be as simple. There are crucial considerations that need addressing, including error handling, automatic rollback, and access control, which are integral to ensuring that the right personnel are involved in chaos engineering. It’s also crucial to avoid introducing chaos into environments that aren’t meant to be affected, and to integrate chaos engineering smoothly into your existing workflows.
To address these complexities, Steadybit employs an agent-based approach to chaos engineering. They have a centralized Steadybit platform that serves as a control center for creating, running, and configuring experiments. This platform, offered as a SaaS solution, can also be deployed on-premises on your own infrastructure.
Parallel to the Steadybit platform is a constantly working Steadybit agent deployed in your infrastructure. This agent is responsible for discovering all the targets, like business applications and AWS services, and making them available for chaos engineering on the Steadybit platform. This agent-based approach has been instrumental in making chaos engineering easy to use, safe, and well-integrated.
When running an experiment, Steadybit provides a timeline view of the exact experiment you designed, alongside widgets that provide real-time information about your system. These insights, including results and events from Kubernetes or your observability tools, can help provide a comprehensive understanding of your system.
To ensure safe chaos engineering, Steadybit incorporates role-based access control. Each team, composed of a set of users, can be assigned to specific environments. This allows you to control the set of targets available for chaos engineering for a specific team, enhancing the safety and precision of your efforts.
Chaos engineering may be a complex practice, but with the right tools and approach, it can be a powerful way to improve system reliability. As elucidated by Manuel Gerding from Steadybit, having a strategic approach to chaos engineering can ensure that you’re able to scale these practices across your organization while maintaining control and safety.