🔥 Real-World Examples: Explore Our Salesforce & ManoMano Case Studies! 🔥 Read Now

Blog

Proactively Testing Alert Rules with Chaos Engineering: Integrating Grafana and Steadybit

Proactively Testing Alert Rules with Chaos Engineering: Integrating Grafana and Steadybit

Grafana Integrations Product updates
23.09.2024 Antoine Choimet - 5 minute read

Introduction

Alerting systems play a pivotal role in monitoring modern applications, but creating effective alert rules can be a complex process. Fine-tuning these alerts to avoid fatigue while ensuring timely triggers is an ongoing challenge. At Steadybit, we have developed golden signal alerts using Prometheus, Grafana, and Alertmanager, tailored for both development and production environments. However, alert behaviors must adapt to different goals and traffic patterns across these environments.

Rather than relying on post-incident analysis to refine alert rules, Steadybit offers a proactive approach by introducing chaos engineering into alert testing. In this blog, we’ll explore the new Steadybit Grafana extension, which allows you to test alert robustness through chaos engineering experiments, giving you the confidence to deploy alerts that perform when needed the most.

The Problem

Creating effective alerts is a challenging task, and getting it right on the first try is almost impossible. The balance between alert fatigue and thresholds that are too low to trigger is delicate and constantly evolving. At Steadybit, we created golden signal alerts using Prometheus, Grafana, and Alertmanager. We implemented these alerts for both our development and production platforms. However, these environments have different expectations, traffic, and goals, meaning alert behaviors must be tailored accordingly.

Instead of waiting for an incident and a subsequent post-mortem to refine our alerts, we can proactively test them using chaos engineering with our new extension.

We are excited to announce the new Grafana extension for Steadybit, providing a powerful way to test your alert rules through chaos engineering experiments. As always with Steadybit, it’s easy to use. You can select an alert rule and test its robustness with a simple drag-and-drop interface.

Let’s explore the extension first and then demonstrate how to test a golden signal alert through a chaos engineering experiment!

Auto-Discovery and Enrichment

Our extension automatically discovers your Grafana alert rules and enriches them with attributes out of the box. You can quickly and effortlessly integrate your alert rules into your chaos experiments.

Customizable Alert State Checks

Once you select an alert rule, you can choose the expected state of the alert and specify whether this state should appear throughout the entire step or just once. This flexibility allows you to precisely control and monitor the behavior of your alerts during chaos experiments.

Comprehensive State Checking

In addition to targeting specific alert rules, you can add parallel checks for all expected states to observe the behavior of other alert rules not directly targeted by the experiment. Observing outside your blast radius is crucial because chaos engineering often reveals unforeseen side effects, helping you understand the broader impact of changes.

Visualizing Steadybit Events in Grafana

One of the key features of this integration is the ability to see Steadybit events directly within Grafana dashboards through Grafana annotations. You can visualize the exact moments when experiments occur, correlating them with metrics and alerts on your dashboards. This enhanced visibility helps you better understand the impact of your experiments in real time.