Talks & Chats

Episode 3: Putting Chaos Engineering to Work with Casey Rosenthal

Experiments in Chaos • Jun 2025

In this episode, Benjamin Wilms chats with Casey Rosenthal, “The Chaos Engineering Guy”, about what it takes to develop a proactive approach to reliability.

Episode Transcript

Benjamin Wilms: Hello, Casey. Thank you very much for joining me on, on my new podcast. And of course you should introduce yourself a little bit, you can talk about the journey about your experience in the chaos space. I know most of the people should already know you as “the chaos guy”, and also what I’m trying to do to achieve with my podcast is to talk to experts from the space.

I was already able to talk to Adrian Hornsby. We explored why it’s often hard to get started with chaos engineering. In the last episode, I was able to talk to Russ Miles, and then we talked more about the cultural aspects, what a company needs to provide to be really successful in the, in the chaos and reliability space, and now I’m happy to welcome you and maybe you can say something about yourself.

Casey Rosenthal: Awesome. Yes. Thank you. It’s a pleasure to be here. It’s an honor to be invited. Yeah, so, I, I have, I’ve oof maybe three decades in high availability systems. That makes me sound really old, but I suppose I am at this point. And, so at some point, so about 10 ish years ago, I was hired to Netflix.

I had the equivalent of like an availability czar role there. And one of the things that fell under my purview was a charter for a chaos engineering team. At the time, chaos engineering was a buzzword in a series of blog posts from years earlier at Netflix. Not a lot was being done with that outside of a program called Chaos Monkey that was running.

And so it was up to us to figure out what chaos engineering was and how to build a team around it and, you know, establish our own success criteria and all, and all of that. So, I went around the company and then around some other industries in Silicon Valley and, and asked, you know, so what’s chaos engineering?

And the answer I usually got was, oh, that’s when we break stuff in production.

And as cool as that sounds, that’s a horrible thing to do in practice for, for, for no purpose. So instead, we sat down, metaphorically locked ourselves in a room for a month and looked at resilience engineering in particular borrowed a lot from existing, research into, critical systems and, and, high availability and reliability in systems outside of software like nuclear power plants, air traffic control, shipping, railroads, things like that, military operations and found that there was a ton of research outside of software that just hadn’t been recognized or brought into software.

So we, we borrowed heavily from, from that material, from Sidney Dekker in particular, um, Lorin Hochstein was on the team, so he had a lot of input into the development of the definition of chaos engineering, which we put up at principlesofchaos.org, is, is the definition still up there translated into a dozen or so languages now. Yeah, and so that became chaos engineering. Then I, I went on to build and, and manage that team at Netflix for three years.

We built some of the most sophisticated chaos engineering tooling that’s out there right now.

Benjamin Wilms: Are you referring to CHAP, chaos automation platform? Yeah. Nice.

Casey Rosenthal: And, yeah, there, there are a couple companies that have similar things. I would say Netflix’s implementation is still the, the gold standard. There’s, there’s maybe this many companies that come close.

But, but again, Netflix’s implementation is fairly unique. Well, they’re, they’re all very bespoke, and, and yeah, from there, to meet hiring goals at Netflix, we started evangelizing it across the industry. Gartner tells me that 80% of the major banks in the US now have dedicated chaos engineering teams, which is good.

So adoption is still increasing, and yeah, I started conferences for chaos engineering. I wrote the book for it, and yeah, so that’s how I came to be “the chaos engineering guy”.

Benjamin Wilms: Nice. Nice. And, before we zoom in a little bit, in, in, let’s say like your key learnings from Netflix, I want to highlight a personal experience because as I was getting started with chaos engineering is, I mean, don’t talk about your age, no, no, no, I guess it was around nine years ago and around seven years ago, I released the Chaos Monkey for Spring Boot Applications on GitHub, and that was like, for me, a special moment, beside all the buzz, there was one message on Twitter, and it was a message from you where you told me, “Hey, you’re on a good track,” and that was like a very special moment for me because at that point in time, I mean, I was just an engineer sitting in Germany, was releasing something on GitHub, and then, wow, I’m, there’s something, there’s something more.

And so this was very special moment for me, but now, let’s zoom in, in, let’s say your, your key learnings from, from Netflix out of chaos engineering. What was, or maybe you can share some “aha” moments for yourself, from yourself.

Casey Rosenthal: Yeah, sure, I mean, so the, the, the empty space between what is currently considered industry norms and what we now know can be done is huge, so if you look at industry norm practices, right? Like, so, so banks in particular have a, have a, have a high uptake of chaos engineering, so maybe putting them aside and, and looking at just industries as as a whole, and I’ll, some of this is corroborated in the EU, but I’ll, I’ll just speak from a, a US based perspective. The majority of companies and, you know, everybody’s a software company at this point, best practices or normal practices are around reliability and uptime and availability are either provably misleading, which is unfortunate or reactive. So if you think about all of the fields in incident management or around reliability, they’re all about waiting for something bad to happen, an incident, and then how do we deal with that?

Or how do we analyze it or how do we prevent that particular thing from happening again? All of which has to, by definition, take place after the incident, so it’s none of, none of the current, like, best practices or, or software norms are proactive, and I, I would say this is true for testing as well. Again, I’m not saying to not do any of these things, but testing relies on an engineer to, to, you know, think of the things that they could imagine going wrong.

And then codifying a check and, you know, assertion around the, the validity of, of that, outcome, based on something that they know about the system, which means they have to know it about the system.

Benjamin Wilms: Yeah, and the system is the important part. So I mean, if you talk about an engineer, he’s focused on his space, his work, but a system is even more like just this tiny little, let’s say microservice, it contains a lot of more moving parts from technical sides, but also the organizational: the people, the processes, all that is needed to run and operate such a system.

So I mean, you need to test it at a large scale.

Casey Rosenthal: Yeah, and even so, like, you know, I’ll, I’ll draw a distinction between testing and experimentation, but even still like that, that’s still a, a reactive thing when somebody writes a test about something that they know that the system already does. So, so I mean that, I think that’s the, the biggest thing.

It’s also the thing that attracts me to this space. Because there are so many things that are intuitive about making something more reliable that turn out to be incorrect. So for example, you, you think, oh, if I just spend more time being really precise about what the system does, then I am, I’m going to make it more stable.

The interesting outages are things that you couldn’t anticipate, so sitting and putting brain power on it generally doesn’t make a system safer. think, oh, well, if I put a bunch of guardrails on something, that’ll keep it on the road. For systems under like competitive pressure, guardrails tend to limit the behavior and experience of the people who operate the system.

That’s what saves those systems from catastrophic outages or failures. So guardrails tend not to improve the reliability of a system over time, particularly for, for critical outages. You know, there’s, and there’s, there’s a lot more that are kind of, of that flavor of like, oh, it just makes sense, like, let me just document everything. Nope. Doesn’t help. Well, let me just measure, you know, how quickly stuff is remediated and then I, you know, then I can shorten that time. Nope. Doesn’t work. Yeah, so a, a lot of these things that intuitively feel, correct are again provably wrong. Root cause analysis would be another one.

Oh, let me just find the one thing that went wrong in this case, Nope. Will not make your system more reliable. And so, you know, instead, you know, chaos engineering is part of a shift to an empirical practice and an experimentation practice, you know what, basically the, the success of the healthcare system in the United States in, in, western culture, is based on, to discover things that you didn’t know about a complex system and then use that to adjust behavior over a longer timeframe, and those small shifts and small learnings lead organizations that embrace that to a very different place than the intuitive quick fixes that again, most of the industry still resorts to.

Benjamin Wilms: Yeah, so that’s maybe a good opener for asking the question. I mean, if people want to get started with chaos engineering, if you talk to engineers, they’re quite okay to know what they’re getting out of it, but I mean, then there are also people they are on higher levels and they don’t really know what I’m getting out of it.

So why should I break something? What is like the outcome?

Casey Rosenthal: Yeah, yeah, so I would say first of all, you’re, you’re not, you’re not necessarily breaking something, right? It’s, it’s- chaos engineering is a lot more than just failure injection.

Benjamin Wilms: Correct.

Casey Rosenthal: Failure injection could be part of it, it doesn’t have to be, and all of that is part of a, a broader thing called continuous verification that is, is, yeah, that’s the broader umbrella for, for reliability efforts like this or a gold, gold standard for reliability efforts.

So, so I think you asked two questions. One is how to get started, and the other is how do you justify this to like upper management or like executive level decision makers who are further away from the engineering? Starting with the, the second question, like, how would you, how do you justify this, the, the ROI is in the outcome. So, you know, I’ll give you an example.

When we started evangelizing this and we started running conferences, the tech companies were engineer-led, so they were quick to jump on board, and the first conference that we ran, you know, had a lot of people from, you know, FANG and, you know, LinkedIn and, and big companies in Silicon Valley, and they kind of got it, but there were some banks there, some, some people from banks represented, and they would say things like, oh, well, you know, you’re, you’re talking about advertising and media, and these experiments sound dangerous and if you go down, it’s no big deal, but we have money on the line, like we can’t do the same sort of experimentation that you’re talking about, and, and so we would ask them, “Okay, so you don’t have outages anymore?”

And they say, “Well, no, of course we still have outages.” Okay, well if you still have outages, are you just gonna keep doing the same thing and expecting a different result? Because we found a different way to work this system, and instead of running into that brick wall, maybe we can help you run around it or get a light to see that, you know, the wall before you smack into it, and turns out that that was a convincing argument, you know, a lot of the banks jumped on it. The, the following year, we had a similar argument from healthcare companies where they would say, “Hey, you know, media advertisement money, that’s one thing. But like, we can’t experiment with our systems because lives are on the line, human lives.” And, again, same argument. “Do you still have outages?” “Yeah.” Okay. Well, you know, in, in this case, there was a stronger connection ’cause we could say, look, we, we built this practice on empirical principles of Western science, the quintessential example of which is the double-blind clinical study. Healthcare companies already do chaos experimentation.

They just call it something else. They do it with lives on the line in medical trials. That’s part of their process in, in fact, that’s a requirement for, you know, for, broad approval of a, of a drug, for through the FDA here in the United States. So, you know, so this practice of, of empirical experimentation, it’s, it’s already, it already saturates our culture.

It’s all around us. It it, it’s really just kind of like a semantic shift to see how that could be applied in one situation versus another. I, and again, this, this might be something that’s, that feels intuitive, that isn’t actually real. I occasionally will run into people who push back on the term “chaos engineering” because they’re like, oh, well I believe, I understand how chaos engineering is is important, like, and it’s not that we’re engineering chaos, it’s engineering around the chaos.

But, but I could never sell it internally because there’s no way my boss will go for something called “chaos engineering”. I’ve heard that a lot and I’ve asked to be put in touch with the, those people, and I don’t think they actually exist.

I have yet to meet somebody who said I won’t do that because it’s called chaos engineering. I’ve heard a lot of people who are anxious that they think it will be blocked because somebody won’t do something. But I’ve never actually met the person who says, “No, no, no, no, that that word scares me. I’m not doing it because I’m afraid of the name.”

Benjamin Wilms: I can share you a story, but not the name of a company who told me, “Okay, yeah, we know it’s important, we want to do it, but we already aligned internally and there are some decision levels. We shouldn’t call it chaos. We need to find another word, but we want to do it.” And it really depends on who you are talking to, and people are still, as you told, afraid of the word of “chaos” or that you will produce a lot of chaos. But it’s the opposite. You’re right.

Casey Rosenthal: Yeah, and, again, like I’ve gotten that pushback and you know, from my point of view, I don’t care what you call it, like if, if, if the goal is to improve your availability,

Benjamin Wilms: stay on the principles,

Casey Rosenthal: But yeah, but I still haven’t met that person. I, I still think it’s a group concept of like, oh, it makes me nervous because it’s gonna make somebody else nervous.

Benjamin Wilms: Yeah. Yeah, I agree

Casey Rosenthal: I don’t think there’s an actual person who can’t have it explained to them in three seconds? “No, actually, it’s, it’s, your system’s already chaotic, right?” Yes. Okay. We’re gonna engineer around that. “Oh, okay.” Yeah, so I, I still haven’t met the actual, you know, almost 10 years of this now, I haven’t met and the actual person who said, “No, I don’t like that name. I’m not doing it because of the name.”

So, yeah, so I think the, the more important thing is to tie it back to ROI and a lot of chaos engineering programs are implemented in the wake of an outage because an executive mandate or, comes down that, “Hey, what we’re doing isn’t good enough.”

Okay. How do you, how do you do something better? How do you do something proactive? Well, this is kind of the go-to proactive strategy, so that’s, you know, generally that’s how, that’s how chaos engineering kind of gets kicked off in, in a way that it, it gets executive adoption. In terms of your, I think your other question was like, how do you get started?

I think tabletop exercises and game days are, are the best way to, you know, put your toe in the water. You know, there’s, there’s a, it, it can be a long path to a mature practice where you have, you know, continuously running experimentation that’s smart enough to grab the attention of operators when something is unusual or going wrong or has, has an aberrant or unwanted systemic effect, and, you know, starting is just, you know, “Okay, can we prod our system in as safe a way as we can and see that it does what we expect it to do?”

Benjamin Wilms: Correct, and that’s, you mentioned it earlier, it’s like this continuous verification where you start with your expectation and you would like to see that the system still works under many different conditions, and can you maybe provide some examples or practical examples from what verification-driven development can look like?

Casey Rosenthal: Sure, so usually continuous verification refers to like a, it’s a, it’s analogous to CI and CD. So, CI gave us all these advantages because, you know, the theory is bugs get limited the, the faster you can jam everybody’s code together, right? Because the, the code is like the the common understanding of, of stuff, so anybody writing in isolation is gonna get further, is gonna get less aligned. So CI just jams it together. Great.

CD gets it in front of customers faster, right? Or end users. So again, it’s shortening feedback loops, and CV is okay now that it’s running. How do we shorten the feedback loop on systemic behaviors and, that could look like, a lot of things, but for, infrastructure, for example, imagine you have, I don’t know, pick a piece of infrastructure.

Benjamin Wilms: RDS from AWS.

Casey Rosenthal: Okay, so, what kinds of, properties would you expect from that service?

Benjamin Wilms: Good process on write and reading

Casey Rosenthal: Okay, and, typically what would you be putting in and, and fetching out?

Benjamin Wilms: Let’s say customer data, customer attributes.

Casey Rosenthal: Okay, just for people who aren’t familiar, like what, what shape of the data?

Benjamin Wilms: Hmm. Help me please.

Casey Rosenthal: So, JSON, SQL, large objects like for, yeah.

Benjamin Wilms: Let’s, let’s, yeah, keep it simple: SQL

Casey Rosenthal: Okay, great. So, your bunch of SQL queries, and, you know, there’s, there’s some, relational thing there that you wanna put [00:22:00] in and put out. And, imagine you have the best observability in the world hooked up to this. So, you know, most of the time that, it’s performing great.

Benjamin Wilms: Mm-hmm.

Casey Rosenthal: So even with the best observability in the world, you, it’s, it’s still kind of a binary signal.

Like yes, it’s, it’s working until it isn’t, and then you’re back in, in reactive land. So to be proactive about it, what can we do? We can, we can flood it with certain kinds of traffic, like many simultaneous connections. Since it’s SQL, payload size going in, probably isn’t much of an issue, but coming out is so you, you can vary the, the response size, data sets that you get back out.

You could have some SQL queries that are gonna take longer than others, and then, number of requests per second. Right? So say we’ve just got those four things, four variables that represent, different properties or different variables of, of how the system can be interacted with.

So what you wanna figure out is, where does this thing tip over? Like, where does it stop working the way your end users want it to work?

Benjamin Wilms: And it shouldn’t be like, now, yes or no. It, it needs to be like a little bit of yeah, a phase where we can see, okay, it’s shifting in the bad direction and not like good or bad.

Casey Rosenthal: Yeah. So, so, yeah, right, so before, so we wanna know what that boundary is before the end users hit it. So there’s two ways that, that, that we can stress the system.

One is with live traffic and we can kind of like funnel it into hot points. The other is synthetic load. Doesn’t really matter to me which of those you pick, obviously the live traffic is a little bit more authentic, if you can do traffic shaping, to, to do that, but, but imagine you’ve got certain treatments now where you say, okay, I want to, I’m gonna vary these four variables, and what I’m gonna look for is you know, since it’s an AWS service, rate limiting, or errors, or timeouts, and, ideally you’re, you’re looking to find, you know, a statistically significant increase in one of those things before applications start failing.

So, you know, presumably your application has some sort of retry logic if something times out or it’ll just try again, great, so you establish here’s my baseline rate of timeouts, or yeah, again, timeouts, latency, or…

Benjamin Wilms: Connections?

Casey Rosenthal: Rate limiting, load shed, yeah, and now I’m going to increase, gradually increase the payload side coming back, right? That’s one of the variables, and at some point, one of those observations is going to statistically increase, and you go, okay, now I’m approaching the, the limits of what this infrastructure can handle.

So then you can taper that back down, then increase the number of concurrent connections, find the same point. You know, note that decrease that now increase the, the request per second, find the same point, and now you can kind of like triangulate or, you know, draw a, a line around those points and say, okay, this is the map of where this piece of infrastructure fails.

As long as I’m inside there in terms of requests per second, SQL sophistication, payload size coming back, as long as I’m within those boundaries, I’m generally safe. I’ve empirically proved that, and once I approach one of those limits, I’m gonna be in trouble. Okay. This is a level of detail and understanding, operational understanding that almost nobody in the industry has about their infrastructure, so continuous verification establishes those, those boundaries. And then you can do all sorts of interesting things like that.

You can, you can set SLOs around those, you can set alerts around those so that you know, if you approach them, you’re, you’re headed in troubled water that’s proactive. And then the continuous part, you do this every day and track it over time, and you’ll be able to detect changes in your infrastructure or in your applica-, in your complex system interactions, through simple change point detections that when something else happens. AWS brings up another AZ or you get a bunch of noisy neighbors, on, on the underlying fabric at Amazon or whatever, you’ll be able to notice that or be alerted to that.

And again, that’s just a signal, “Hey, operator, something’s different in your infrastructure. You should probably pay attention to this.” Again, that’s a level of insight that almost nobody right now gets into their infrastructure.

Benjamin Wilms: Yes, I can agree, and, the important part about it is you need to do it continuously because your, your system is changing all the time. If you are getting new users, you are building new features, you are deploying a, a bug fix, what ever… your system is continuously in this you moving situation.

Casey Rosenthal: Yeah, and it could be changing in ways that have nothing to do with what you or your team are doing. So for example, we, you know, I, I, one of my hobbies is collecting failure stories. I’ll give you a failure story that that highlights this. There was a large retailer, grocery retailer here in the United States.

They did point-in-time testing for a Kafka cluster. They determined 50% overhead in capacity. So you know everything, right according to industry norms. So they, they were using, you know, this much capacity. They had a bunch of overhead and they knew that their growth was gonna be steady, and, you know, small acceleration.

So they had, you know, years before they would run into, the ceiling on the, on the capacity that they had, and so they did that before they set up the, the project, you know, load testing, performance testing to establish that overhead, and then, you know, some months later, the system just completely crashed hard, and they look back and they found out that, a month prior to the crash, that ceiling had come way down somebody in a completely different department saw the overhead and decided to change the network segmentation to use some of that network for something else.

Benjamin Wilms: Ouch.

Casey Rosenthal: Yeah. Completely different department, completely different team.

There’s no way that they could have known about each other’s plans. This is a very large, you know, it’s a Fortune 100 company, so lots of people, there’s no way that that, that they could have coordinated that in advance. So from the infrastructure team side, because they didn’t have a continuous signal about their performance and the capabilities of their infrastructure, there’s no way that they could have seen that coming.

You know, Kafka team’s not gonna sit there and go like, oh, well, let me ping the guys over in networking every day to see if they came up with a new plan to change this segmentation. You know, if you’re on the cloud, if you’re on AWS, AWS could do something that they’re not gonna tell you about, that could change the performance of your underlying infrastructure, and that’s a foundational component of your complex system, so yeah, so it’s, it’s gotta be, you’re, you’re missing a lot if you’re not doing something that continuously updates your mental model about how these things are performing.

Benjamin Wilms: And talking about this specific incident, what, and I, I really love the topic learning from incidents, so what was like the learning out of this incident?

Casey Rosenthal: What was it or what should it have been?

Benjamin Wilms: Maybe both.

Casey Rosenthal: Yes, I believe the learning from that incident was, oh, we need to put our, our network engineers in touch with the people who run infrastructure, you know, out of, out of those data centers. Is that the best learning? I don’t think so.

I, I think, yeah, I, you know, then, you know, that’s okay, then they could run into the same problem with the, the team that runs, you know, power, electricity for the data center and they wouldn’t know that, you know? So enumerating that list isn’t gonna be particularly helpful. I think the better learning would be, okay, we need to build a system that gives us this kind of continuous signal back in so that if there is that change, we can go investigate and figure out why did this happen?

Oh, it’s the network team. I mean, they were able to identify it after the outage, so it wasn’t that the information was opaque or obscure, it just wasn’t brought to their attention and the way it should be brought to your attention is you should be able to empirically see a different signal or behavior in the complex system, so I believe that should be the takeaway: how do we notice when the system is behaving differently and then we can go, then we humans can go investigate.

What the difference is and, and then decide what to do about it, instead they took a defense in depth approach, which was, oh, we had an outage because of that, we’ll put a process in place to make sure that outage never happens again. Again, intuitively makes sense. Not a great strategy for making your system more reliable.

Benjamin Wilms: Not, not in the long, long term. Absolutely. Talking about a little bit, a little bit more about incidents in general, how should people, yeah, react if there was an incident? I mean, you can learn a lot out of it, but still there are companies where an incident is still something bad and no one wants to talk about it.

So let’s, let’s continue. What is the best approach to handle an incident?

Casey Rosenthal: Oh, I don’t know if we know what the best approach is yet. We know what bad approaches are, so the, the kind of controversial statement for companies that still do this is, if you’re doing RCA, root cause analysis, the best case scenario if you’re doing RCA is that you’re wasting your time.

Benjamin Wilms: Yep.

Casey Rosenthal: The worst case scenario is you’re, you’re fundamentally undermining the things that actually make your system operate most of the time, and, I’d say, you know, I, I quickly illustrate that with an example. Say you, there’s so many examples here to, to pull from, but say you have a system that goes down because somebody changed one line in a configuration file.

This is kind of a common counterpoint to RCA, so you can get blame after the fact and say, oh, this, this, this line.

It was, you know, written by John. It’s John’s fault. That’s the root cause. The root cause is John. Just to show how absurd this is, you can say like, okay, well what was happening in John’s life at that point? Uh, you know, he was going through a messy divorce, so it’s actually John’s, you know, the divorce lawyer’s fault, right?

Like, there’s, there’s really no logical end to that. It’s arbitrary where, where we tell the story, but here’s, here’s the, the worst part from, in terms of value to the company, the organization that’s doing this. So what if you found out that it’s John’s fault? Do you think you’re going to avoid an incident by shaming John or slapping him on the wrist or having everybody stand in a circle and point at John and say, don’t do that again.

No, you’re not. But that’s the point of RCA is to drill down and find the smallest thing, smallest person in the hierarchy to blame for what caused an incident.

Flip it around. John’s boss, you could say, oh, well it’s John’s boss’s fault, that they didn’t hire better say, or you could go to the director and say, oh, well, it’s the director’s fault for not setting better context with the manager, oh, it’s the VP’s fault for not putting more funds into resilience efforts or you know, systems to make the, make it more reliable.

SVP, again, fund allocation, CTO, well, they didn’t align the organization, the technical organization around the things that are important to them, CEO, they didn’t express to the rest of the company why it’s so important for the system to not go down in terms of business continuity. Now those are just completely different stories.

Any one of which you could choose. Again, RCA points to the first one, which does nothing to prevent an outage. If you go to the CEO, the CEO can move mountains. So if you actually want to prevent additional outages or deter them, you wanna work on the CEO, which again, RCA is going in the wrong direction.

It would be something else that goes up and says, how can we readjust the, the hierarchy and the entire system to have a better outcome? And the CEO doesn’t care about one line in a configuration file, and they shouldn’t. So, so at best, that exercise of finding the line in the, in the configuration file, and concluding that that was a root cause was a waste of time.

At worst, you shamed somebody, and that had a negative impact on company morale and made people more prone to hiding things instead of sharing them and all sorts of, of, you know, chain reactions that are bad for a company’s culture. That’s, you know, the worst case scenario. But again, the best you could possibly have from that is that you wasted your time by pointing out something that’s not gonna prevent any future incidents.

Benjamin Wilms: And there’s then zero learning in it because, I mean, what you’re getting out of it… nothing. And, um…

Casey Rosenthal: So a, a better way to, to go about incidents is, you know, just as an, as an example, you know, people are, “Well, I do RCA and I feel like we learned stuff from it. So like, what else is there?” So, one other example, counter example, and there’s, there’s an infinite number of other ways you could do it.

But one counter example would be have an investigator go around and interview people impacted by the outage. Then they’re, they’re just there to capture different narratives and notice the difference between the narratives. Then bring everybody who’s impacted back into a room and facilitate a discussion so that, you know, person A will say, well, you know, I did these things because they make sense.

And you know, oh, but person B, you did these things because they made sense to you. But look at how your understanding of the system was different from theirs. Oh, yeah. Okay. I didn’t know that that system had a fallback. You know, I didn’t know that they had a requirement that their network segmentation, you know, would, would be adjusted if there was excess capacity.

Right. Okay. Now you learn something about how the organization operates. That you wouldn’t have known before. And then, you know, person C chimes in, “Well, actually that’s not how it works. Like it’s, it’s, there’s actually this other element that you didn’t see.” Okay. Now everybody is updating their mental models together constructively to form a better overall understanding of how a complex system works.

That’s gonna get them further in terms of optimizing the system for what the system has to be optimized for and, and avoiding bad outcomes than anything that happens in, in the RCA in terms of, you know, config files aren’t read this way or whatever, so that’s an, an example of a different way to do it, but again, the most valuable outcome would be going up the organizational level and saying like, okay, well, why, why do these people have different understandings of the system?

Why, you know, why are we spending our time moving the, the technology in this direction if customers actually want it to have to be optimized for these properties instead, uptime instead of features or uptime instead of cost to utilization, performance or whatever, so, you know, again, there’s, you know, near infinite number of ways that you, you could handle a post incident review or analysis. RCA is to be avoided.

Benjamin Wilms: Yeah, so don’t care about the root cause more. Improve how your sociotechnical system really works and how everyone can be a part of it and to improve your system continuously.

Casey Rosenthal: Yep.

Benjamin Wilms: Let’s take a look in the future. Let’s take a look, in, let’s say in the next two to three years, do you see any big trends coming up for, in the space of resilience and reliability engineering?

Casey Rosenthal: Yeah, so I, I do see, continuous verification is, the adoption is increasing, not always under that name, but systems for which high availability is, is critical, you know, healthcare systems in particular, but also some financial ones are by necessity exploring this route of like, okay, how do we understand our infrastructure before something bad happens? How do we understand what the boundaries are before we end up driving over the edge of the cliff? So that’s good. I think the, the, the thing that is consuming most people’s attention span right now is, how does AI help or hinder this effort? That’s, that’s TBD, right?

Like that could branch in, in a, in a couple different directions. I’m seeing good applications of LLMs in particular summarizing events and information, so that it’s easier for, you know, people, we’re all too busy, we all have limited attention for, easier for people across an organization to quickly consume, like, “Oh, okay. That’s generally what happens to that thing over there, and again, helps me update my mental model of the whole system.” There are areas of AI that are problem solvers. LLMs are not that. So on the bad side, I would say we’re gonna see a lot of funding continue to go into startups that say we can automatically do RCA, we can automatically find incidents before they happen.

LLMs aren’t that technology, so, so the AI that, that is currently sweeping over the planet doesn’t help with that, and then, at the end of the day, reliability and, the ability to, and resilience, the, the property of something that can react to a new circumstance and steer it towards a more desirable outcome, that is a property of improvisation, which right now only humans can do.

So AI, again, LLMs don’t do that. They can, you know, hallucinate suggestions, but they are very poor about having the business context necessary to correctly improvise, and, and so that’s what we need humans for, so again, you’re gonna see a lot of money poured into AI that automatically resolves incidents, and LMS are a very poor fit for that, but there might be a space for LLMs to very quickly ingest or digest a data set that’s too big for a human to digest and give them a signal about the current state of the system that helps the human improvise better.

Benjamin Wilms: Yeah, absolutely. More like a summary about all the signals, all the noise out there to get a nice overview about what’s going on in my system.

Casey Rosenthal: Yeah. Yeah, so yeah, lots of opportunity, I think, where AI could help but as with probably most of the AI implementations that we’re, we’re seeing now, just like numerically the number of startups that are coming up, most of those are probably a bad idea in the, in the space of resilience.

Benjamin Wilms: Don’t let the VCs know that… Anyhow…

Casey Rosenthal: I am, I’m happy to talk to VCs about, about their, their technical investments. That, that is, you know, I do consult with VCs on this. Um, yes, I have not steered any VCs towards the, the companies that are making, making those kinds of claims.

Benjamin Wilms: Alright, so last question. So first of all, I really enjoyed our conversation, I learned again a lot, and, uh, thank you very much for, for being my guest. Last question. So when people want to reach out to you, when people want to talk to you, what is the best way to, yeah, get in contact with you?

Casey Rosenthal: Oh, I’m on LinkedIn. That’s a pretty easy way to, to find me. Bluesky is, is generally where I hang out in public now. I try to stay away from X and I, I don’t think I’m on any of the other social things, so that would be the way to do it.

Benjamin Wilms: Cool, then again, thanks for joining me and yeah, looking forward to see you in real life again.

Casey Rosenthal: Right on. Thanks, thanks, Benjamin, and, I am super glad that you continued with that Chaos project and continue to advance the field. Thank you very much.