Unlocking New Reliability Workflows with the Datadog and Steadybit MCP Servers

16.10.2025 Patrick Londa - 5 min
Unlocking New Reliability Workflows with the Datadog and Steadybit MCP Servers

Leading observability tools like Datadog provide engineering teams with one central plane for reading and visualizing system performance. These aggregated logs and monitoring data are essential for assessing and improving an organization’s reliability posture, but they become even more powerful when they are utilized in workflows with other tools.

With the emergence of MCP (Model Context Protocol) servers, commercial tools can now easily connect with LLM tools like Claude and Gemini to create novel workflows that leverage specific data sets and capabilities. By connecting complementary MCP servers, you can create powerful workflows that would have previously been too cost prohibitive to attempt with traditional point-to-point integrations.

While observability provides a view of the history of your system performance, chaos engineering is a proactive approach for testing performance. Instead of waiting for certain conditions to occur in production, teams can run chaos experiments to simulate an event or series of events to see how their systems react and respond. With a tool like Steadybit, it’s easy to design these experiments and deploy them safely across designated targets.

Now that Datadog and Steadybit have both launched MCP servers, it’s worth exploring how these two platforms can enable teams to develop innovative combined workflows.

Utilizing Datadog and Steadybit in LLM Workflows

When you connect the Datadog MCP server with Steadybit’s MCP server, you unlock powerful new reliability tactics. AI-powered analysis can consider data and outcomes from both systems, including current health metrics, SLOs, incident alerts, and the experiment results.

Datadog brings comprehensive observability data—metrics, logs, traces, and incident history. Steadybit contributes chaos experiment results, system resilience insights, and proactive testing capabilities. Together, they enable teams to create a complete picture of their reliability posture.

To make this more tangible, we’ve created some sample reliability workflows that put both MCP Servers to work with an LLM prompt.

Real-World AI-Powered Reliability Workflows

Here are some practical examples of how SRE teams could leverage both Datadog and Steadybit MCP Servers in combined workflows.

Strategic Experiment Planning
“Review all the critical incidents in Datadog in the last 60 days. Using that info, can you recommend what services we should focus on with our Steadybit experiments to improve our reliability?”

This prompt enables AI to analyze historical incident patterns and suggest targeted chaos experiments. Instead of guessing which services need resilience testing, you receive data-driven recommendations based on actual failure patterns.

SLO-Based Experiment Design
“Look at all the monitor-based SLOs defined in Datadog, then use each SLO to create a hypothesis that could be used for a chaos experiment, and export them as a list.”

This workflow transforms your existing Service Level Objectives into actionable chaos experiment hypotheses. The AI identifies potential failure modes that could impact each SLO and suggests specific experiments to validate your system’s resilience against these scenarios.

Reliability Impact Analysis
“Can you create a report, from our first experiment execution in Steadybit to now, that shows the change in how many incidents are occurring in Datadog per month?”

Track the effectiveness of your chaos engineering program by correlating experiment execution with incident reduction. This simplifies the reporting burden on tracking and measuring the impact of your chaos engineering efforts on system reliability.

Incident-Driven Experiment Creation
“Use the last critical incident in Datadog to create a new experiment for Steadybit to run, written in JSON, that replicates the system conditions that caused the issue.”

Transform post-incident learnings into proactive resilience testing. By analyzing the conditions and details of the most recent critical incident, an LLM can generate a specific experiment in JSON that replicates the conditions that led to the failure. With this new experiment generated, you can run it regularly as a regression test to validate that a fix has successfully been implemented and remains effective over time.

Lowering the Chaos Engineering Learning Curve

With a natural language interface, it’s easier for engineers of all skill levels to make specific queries and get meaningful results. For chaos engineering beginners, this method enables interactive learning as they are able to ask questions of the LLM directly. While it’s not a substitute for team knowledge sharing and training, it can help lower the learning curve.

Veteran engineers with plenty of chaos engineering experience can now more rapidly ask questions without the limits of query languages or complex dashboards. Ideas that would have required weeks or months of planning can be prototyped much faster. This democratizes access to reliability insights across the entire team.

Uniting chaos engineering insights from Steadybit with comprehensive observability data from Datadog leads to a powerful feedback loop, democratizing access to reliability insights across the entire team. Each experiment provides sample data for how the systems will respond under a specific stress, and then you can validate resilience improvements with real-world observability data.

Getting Started with Datadog and Steadybit Workflows 

While some teams are opting for fully agentic SRE tools, MCP servers offer an approach that ensures there is a human-in-the-loop and an expert is able to learn from and make decisions with each insight. The combination of Datadog’s observability expertise and Steadybit’s chaos engineering platform creates a powerful foundation for building more reliable systems and more resilient teams.

Ready to build your own LLM-powered reliability workflows?

You can get started by scheduling a demo with Steadybit to see what a proactive reliability program can look like for your organization.