Introduction
Alerting systems play a pivotal role in monitoring modern applications, but creating effective alert rules can be a complex process. Fine-tuning these alerts to avoid fatigue while ensuring timely triggers is an ongoing challenge. At Steadybit, we have developed golden signal alerts using Prometheus, Grafana, and Alertmanager, tailored for both development and production environments. However, alert behaviors must adapt to different goals and traffic patterns across these environments.
Rather than relying on post-incident analysis to refine alert rules, Steadybit offers a proactive approach by introducing chaos engineering into alert testing. In this blog, we’ll explore the new Steadybit Grafana extension, which allows you to test alert robustness through chaos engineering experiments, giving you the confidence to deploy alerts that perform when needed the most.
The Problem
Creating effective alerts is a challenging task, and getting it right on the first try is almost impossible. The balance between alert fatigue and thresholds that are too low to trigger is delicate and constantly evolving. At Steadybit, we created golden signal alerts using Prometheus, Grafana, and Alertmanager. We implemented these alerts for both our development and production platforms. However, these environments have different expectations, traffic, and goals, meaning alert behaviors must be tailored accordingly.
Instead of waiting for an incident and a subsequent post-mortem to refine our alerts, we can proactively test them using chaos engineering with our new extension.
We are excited to announce the new Grafana extension for Steadybit, providing a powerful way to test your alert rules through chaos engineering experiments. As always with Steadybit, it’s easy to use. You can select an alert rule and test its robustness with a simple drag-and-drop interface.
Let’s explore the extension first and then demonstrate how to test a golden signal alert through a chaos engineering experiment!
Auto-Discovery and Enrichment
Our extension automatically discovers your Grafana alert rules and enriches them with attributes out of the box. You can quickly and effortlessly integrate your alert rules into your chaos experiments.
Customizable Alert State Checks
Once you select an alert rule, you can choose the expected state of the alert and specify whether this state should appear throughout the entire step or just once. This flexibility allows you to precisely control and monitor the behavior of your alerts during chaos experiments.
Comprehensive State Checking
In addition to targeting specific alert rules, you can add parallel checks for all expected states to observe the behavior of other alert rules not directly targeted by the experiment. Observing outside your blast radius is crucial because chaos engineering often reveals unforeseen side effects, helping you understand the broader impact of changes.
Visualizing Steadybit Events in Grafana
One of the key features of this integration is the ability to see Steadybit events directly within Grafana dashboards through Grafana annotations. You can visualize the exact moments when experiments occur, correlating them with metrics and alerts on your dashboards. This enhanced visibility helps you better understand the impact of your experiments in real time.
Practical Example: Latency on GET Methods
Let’s assemble everything and create a dedicated experiment to observe how our golden signal alerts behave.
This experiment involves introducing latency to our platform. We then perform an HTTP GET request to fetch all experiments from our development platform and monitor the behavior of our alerting system in real-time.
We have implemented the following alert rule for an evaluation period of 5 minutes:
Additionally, we will check the behavior of our other alert rules related to the golden signals to see if there are any unexpected side effects.
Results
Here, we observe that the response times mostly fall within the warning range, with only a few responses exceeding 1.5 seconds.
In this check of the alert rules, we see expected behavior. Since not enough response times exceeded 1.5 seconds, the critical alert did not trigger; it stayed in a pending state. However, our warning alert triggered, sending notifications to our development alert channel.
Can we correlate this with the experiment? Absolutely! The purple area indicates the experiment’s start and finish, including its steps, as shown by Grafana annotations. The experiment design screen has been added to illustrate the correlation between the Grafana annotation start and end markers.
The Solution
The synergy between chaos engineering and observability is too significant to ignore. By proactively using chaos engineering to test alert behaviors, you can accelerate improvements in your observability tools far more effectively than waiting for incidents and conducting post-mortems.
Chaos engineering can uncover unexpected situations that reveal where your observability tools need enhancement to provide deeper insights into underlying issues. These new insights enable you to refine your chaos experiments and push the limits of your system’s resilience, leading to a more robust and reliable infrastructure.
Conclusion
By integrating chaos engineering directly into your alerting system through Steadybit’s new Grafana extension, you can ensure that your alerts are both robust and responsive. Instead of reacting to failures after the fact, you can proactively fine-tune your observability tools, reinforcing your system’s resilience before an incident occurs. Ready to explore this innovative approach? Dive into our reliability hub and find everything you need to enhance your alerting capabilities today.