Talks & Chats

Episode 1: Tackling the Prevention Paradox with Adrian Hornsby

Experiments in Chaos • Jun 2025

In episode 1 of “Experiments in Chaos”, Benjamin Wilms sits down with Adrian Hornsby, a leading expert in the chaos engineering space to discuss common mistakes, best practices, and the future of the practice.

Episode Transcript

Benjamin Wilms: Hello, hello, hello. Welcome to my first episode, and one of my first guests on this episode is Adrian. Adrian, really love to see you here. I’m really excited to chat about chaos engineering in general, about your learnings, what you have done at AWS, what you are willing to do right now.

There’s a lot we can talk about and maybe you can introduce yourself a little bit.

Adrian Hornsby: Cool. Hi Ben, thanks for having me on your podcast. I hope, I hope it’ll go well. I had a little problem joining, but I think we are all good now. I see everything is uploading.

Benjamin Wilms: We are chaos guys. So what should go wrong, man?

Adrian Hornsby: What could go wrong? Yeah, exactly. Cool. So yeah, I’m Adrian Hornsby. Well, I’ve been about 20 years in the industry, unconventional kind of path that I started in, in research at the university doing some post doctorate studies, and then went to Nokia research then in different startups and then some companies, and then eventually ended up at AWS, and in AWS I went from sales organizations as solutions architect, then marketing, where I did developer advocate, and then in the engineering org where I went to build the AWS Fault Injection Service. So yeah, there’s like a lot of different things, but I think the common denominator is I’ve touched a lot of different parts of software engineering from, especially on the backend side,

I’ve even built applications on Android at some point, but mostly in the backend, and all of the time on AWS, actually from the very beginning, that was a luxury, I think, so, yeah, you know, I focused a lot on scalability, reliability, and the overall resilience engineering aspect of system and distributed system as a whole, and eventually I started to build open source tools on chaos engineering, and then that really took off a lot of people got interested in these, and then I was given the opportunity to, to build the FIS service for AWS, and then just recently left actually a month ago, so now I started to build my own business, like a consulting business, and then also some surprise stuff on the side.

We’ll see what, what comes out of it. But yeah, this is me in a nutshell.

Benjamin Wilms: Nice, nice. And also in your introduction, you mentioned something quite important, especially in the world of chaos engineering. Most of the time, if people are getting started with chaos engineering, they talk about the attacks they want to do, the technology, but there’s way more involved if you are doing really chaos engineering at large scale, which also means.

It’s not only about technology.

Adrian Hornsby: It’s rarely about technology, I would say, actually, it’s rarely about that.

Benjamin Wilms: In your past, in your experience, have you seen something where companies weren’t able to really be successful with chaos engineering, or maybe they are getting a little bit stuck, or they were not quite sure how to really get the most out of it.

Adrian Hornsby: Yeah, it was a great question, like I think we could explore that question for almost all day, but there’s a lot of different, different aspects of kind of the journey, right? I think getting stuck in chaos engineering, it really depends where you’re in journey, right? Some company actually struggles to start because they overthink it.

They think it’s, it’s like, you know, landing on the moon and they need to prepare for everything, they need to have everything perfect. They need, they think that they need to do this in production, so they get really scared. So they don’t even start. I think there’s this big group of companies where they just struggle to, to start, right, but you can even think before that actually there’s even a problem is like some company don’t even think it’s necessary. There’s something quite interesting from a psychological point of view is called the prevention paradox, right, where it’s difficult for companies to invest in something related to prevention because it doesn’t happen if it’s successful, right?

For example, the, the year 2000 bug, right?

Benjamin Wilms: Oh, yes.

Adrian Hornsby: Companies spend years preparing for it, and then you know, 2000 happened, mostly went very well because everybody had done a great job, had prepared very well, and then people said it was

Benjamin Wilms: That’s it?

Adrian Hornsby: What, nothing happened? Like it was all overblown, right?

That’s called the prevention paradox is you do the work, nothing happened, and then people said, so it was not worth it, right? It was not, we shouldn’t have done it. It was, you know, bad investment. So that’s the prevention paradox. Often those companies don’t even start, don’t even think about chaos.

Right? So then you have those that think it’s important, but are scared to start, and then you have those that kind of are the grassroots, I call that grassroots, right? Well, so you have like chaos champion inside the companies. They do those experiments at their team level, or sometimes it goes a little bit further than the team, but then they don’t get leadership support and that, that doesn’t scale through the organization or the company as a whole, and then like that, that’s kind of the spectrum of difficulty.

Benjamin Wilms: What would you say, especially for the, from the leadership, is it really sometimes the way of communicating the value from the tech people up to the leadership so that they are not really able to see what’s in it from my perspective?

Adrian Hornsby: Yeah, I think there’s a lot, there’s a lot of part of that, right, it’s like difficult sometimes for engineers to speak the language of leadership. Typically leadership loves metrics and – yes – so we need to find ways to to get metrics that, you know, that can support the need for chaos engineering, right?

So, and the problem is metrics are also… you know, there’s a lot of problems with metrics, so I guess that, you know, there’s a misalignment here. So you really need to make a lot of effort to align kind of metrics from your chaos efforts with the language of leadership, but sometimes it’s even before that is like the leadership just doesn’t see the value of again, investing in resilience or they have other priorities.

So, you know, you need to find what are the priorities that your leadership is looking to work on for this year or the next few years and try to align your program on that, because I think the big problem I see is people tend to think chaos engineering is only around kind of improving, you know, or reducing, you know, the unknowns and faults, but there’s a lot of other benefits to it, which I’m surprised people don’t, don’t talk more about that, like, you know, amazing benefit is practicing run books or operational readiness, you know, so many people go into on-calls unprepared and you know, then they have an outage, the runbook is out-of-date.

So, you know, like using chaos engineering through a way to, to make sure your runbooks are up to date is such a great way, and you don’t need to talk about chaos engineering. You don’t need to talk about failure. You don’t need to talk about any of that. You can just align from the operational readiness point of view, right? And I’ve done that actually in some teams inside AWS as well, right, and customers, and it always resonates. Nobody ever said our run books are all perfect and up-to-date, so…

Benjamin Wilms: No. Never!

Adrian Hornsby: So if you take that angle, you typically get, “Oh yes, that’s awesome. Yeah. Let’s make sure our run books are up-to-date, and then let’s make sure our on-call are ready, right?”

Because systems have become more and more reliable, available, resilient. So there’s less and less outages, right? So with this kind of what’s called the “skill atrophy” for on-calls is, you know, we don’t typically practice our recovery skills that often anymore, right? Like 15 years ago, it used to be a lot more outages and now it’s less, you know, practicing that skill, make sure that the on-calls are confident and they know what they’re doing, again, that’s operational readiness. Nobody’s going to argue with that, with you on that one. So, so these are the angles and I can come up with so many angles like this, which people don’t seem to leverage, which is strange.

Benjamin Wilms: Yeah. Let’s zoom in a bit, a bit on the system, the definition of a system.

So there’s a good article or also a quote out there from Russ Miles, where he mentioned especially chaos engineering can help you to improve your sociotechnical system, so your system isn’t just technology. It is like the people, the processes, the organization behind how you’re running, operating your software, which is getting close to what you have mentioned, the people on call, they need to be trained, they need to be in a position that they can really understand how the system behaves. The sociotechnical system really describes what’s the potential could be if you’re doing chaos engineering in such a system.

Adrian Hornsby: Yeah, yeah, it’s totally that. I see people focus on technology all the time, all the tools, right? I think, and for me, technology is kind of the Venn diagram of culture, processes, and the tools, right? So you can have the tools, but without the right processes and the culture around that, you have nothing basically.

Right? So I think chaos engineering, you know, is super interesting to actually englobe in that, what you’re talking like the overall system, right? And it’s easy to see how actually humans are probably most critical part of systems because from a resilience point of view, right? When something, when there’s a failure, right?

These humans that go typically and fix the problem, try to, make the system to adapt to the condition, right? There is a term in resilience engineering called adaptive capacity. It’s of your system, right? It’s how your system is able to move itself or to adapt to the condition, and the only way to really understand how flexible or kind of malleable is your adaptive capacity is actually to go and inject failures and see what, what you’re really good at, I think that’s super important. We often focus on what’s wrong, but actually understand what you’re really good at and piggybacking on that is probably the most important thing, but also understand where there is room for improvement, right? And very often it’s just a matter of injecting faults, and it doesn’t have to be technical faults, right? People again, think about injecting faults into a system, you know, it’s like adding latency and that’s great, but one of the experiments I’ve loved to do is take the laptop of the lead engineer in the team and see how the team is going to behave right, and literally most of the time, everybody freaks out, right? Because they have those….

Benjamin Wilms: I was on a very nice session together some years ago with Russ, Russ Miles, and he also shared that story. I did a GameDay in my old days as a consultant, and one of the first exercises for myself was identify like this leader and, as Russ told me, get this person out of the room, enjoy the coffee, and now run the GameDay without that person, this mastermind in with the team, and this can really hopefully enable those people to really step out of their comfort zone, to be more responsible, to be more engaged with the system, and of course, to learn a lot more, because otherwise you’re just trusting the single person, which is also, again, a single point of failure in a system.

Adrian Hornsby: And it’s understanding exactly these, you know, this hidden complexity, right? People, people have this knowledge inside their brain, and then sometimes, you know, documentation is not up-to-date, or there’s been some changes or there’s some biases or some context that is inherited from the very early days, but the rest of the team has changed. And then, you know, I think being able to, again, that’s another thing from chaos engineering that is really good is… make sense of the complexity of the system, right? Yeah, and systems are complex in many ways, right?

Technically, socially, and culturally and all that kind of stuff and making sense of that, like, for example, understand sometimes there’s complexity in the escalation process, sometimes there is complexity in the way a company might, you know, talk to their customers around, you know, they are experiencing an outage and they forget to communicate, right, and again, that’s a problem that can be surfaced as well in those GameDays and understanding, you know, the process.

Benjamin Wilms: Let me share one story. Maybe you will agree or disagree. I don’t know. We will figure it out. Someone, so as we are trying to really show people what, what we are able to do with Steadybit, there is sometimes a situation that people are telling us, “Hey, we are quite good in fixing broken stuff.

We are quite fast with it.” So they are still in this very, they are very deeply in this reactive approach. And they are identified as they are more hunted by their own systems, and why people still want to be in that race? Why they are not like switching into a more proactive reliability approach where they should invest earlier, so would you agree that people are in that race on purpose?

Adrian Hornsby: It’s a prevention paradox. It’s exactly that.

Benjamin Wilms: Yeah.

Adrian Hornsby: It’s that they don’t… it’s hard for human to understand the value of investing in something that didn’t happen. It’s just very difficult. Because you’re making trade offs, right?

It’s either you’re building a feature, which, you know, can give you more customer satisfaction, and it’s tangible, it’s something actually you can show for it, or you’re working on something that doesn’t exist, right? So, prevention, and it’s the same problem with the, you know, when you have technical debt, right?

It’s so difficult for organizations to invest in technical debt, because typically it doesn’t translate into customer, you know, benefit directly, but direct benefit or sales or new features, like it’s so much better to build new features versus do stuff that – at least that’s what they think, I’m not saying it is, I’m just saying how they think.

Benjamin Wilms: That’s also the way they are measured, so how their success has been measured, how fast you can get out new features and production, how fast we can deliver a new product and yeah, there needs to be a little bit pain, otherwise they will not invest proactively, but it’s so, it’s so weird because let’s imagine you would like to purchase a new car.

Would you like to purchase a car without the airbag? I mean, the airbag, you are, you’re happy if you will never, ever see this thing coming up in your face, but isn’t chaos engineering also like a proactive investment and hopefully everything went smoothly and you are not seeing like a big impact?

Adrian Hornsby: Yeah. But like in the early days, cars didn’t have airbags, seatbelts were not mandatory, like when I was young, it was… I was sitting in the back without anything.

We didn’t have to wear seatbelts actually, even on the front, passenger side seat, like people didn’t have to. So it took time, right? It took regulation. It took, it took, you know, having some sort of compliance, I think, and this is going, it’s where actually it’s going, right?

If you, if you see like the regulations, like DORA, for example, they are mandating now companies to actually verify that their recovery scenarios have been validated and then they are looking to, to have proof of that exercise. So, and it shows, right? Like in the last couple of years, the biggest, the biggest tractions for chaos engineering was financial services because you know, I’ve been that I have to show for that, that’s it.

You know, it’s humans are quite funny, right? Yeah, sometimes. You need to be forced to do something. I mean, it’s the same for us, right? Typically it’s very… that’s why, you know, dieting is such an important industry, right? Or that’s why, you know, people typically have to wait to have a heart attack or to have health problems to start to do big changes in their life, right, and that is just so difficult to see an investment in something that we can’t touch, right? Yes.

Benjamin Wilms: There needs to be a pain. There needs to be a strong pain, otherwise people are not willing to invest or they are not allowed to invest. Even if you have like, let’s say a young innovative team that wants to do something with chaos engineering, because they already identified there’s a lot of unknowns in our system, we need to be getting better, but then they are not able to really communicate why they are doing it, what they want to get out of it. That’s, that’s a tough challenge.

Adrian Hornsby: That’s why I’m saying you need to flip that on its head, right? Only focusing on the resilience side is typically a lost battle, right? That’s why I’m saying focusing on operational readiness, you know, understanding complexity, preparing run books.

I mean, there’s tons of things Chaos Engineering can help you with, and so like, I think the curse of chaos engineering is being called “chaos engineering” in a way, right? Like, yeah. So, you know, if it would have been maybe called, you know, like, I don’t know, this kind of distributed testing or distributed system testing, maybe it would have been better because that’s what it is typically, right? You verify the, or you kind of work at the interactions of systems and see how they behave with each other, so maybe that, but it’s too late, right?

Benjamin Wilms: So in some conversations, I am getting this kind of impression that people really want to build a failure free system.

So with zero failures in it. I’m more from the, from the, um, standpoint or from the perspective that failures are part of the system. And as you mentioned, like how the system is able to work under those conditions is like the key element is really this, this dynamic system needs to identify a failure. It needs to be able to work with that failure and your organization should know when something goes wrong. How many, from your experience, how many companies are operating more like in a hope way or using hope as a strategy? Is it still a big portion?

Adrian Hornsby: Yeah, I think there’s a fair amount of companies that are in kind of hope-driven development, but the, I think it’s changing as well, right? But maybe that’s, you know, maybe I’m the bias of the customers I’ve been talking to are already kind of understanding resilience to some extent.

So I think, you know, like my sampling, my sample is biased, but, you know, but you still, you know, hear stories and you see it in the news, right? Where people do not expect failure, and when there’s failure, there’s direct blame, you know, that’s typically the, the very old way of thinking that, you know, system is unavailable, and if there’s a failure, it has to be somebody to blame, right? And that, that’s a very old, old way of thinking about it, because like you said, your system is in perpetual mode of failure. There’s no part of the system, especially if you think about the social technical system, right? There’s always something going wrong, right?

So, you know, like thinking that there’s not going to be any failure. It’s just, it’s just basic reality. Wishful thinking, right, and, and because something is gonna happen, something will go wrong eventually, and often it’s not what you think is gonna happen, right, so yeah, it would be, uh, well known if the system would never fail, right? Well they do, they do.

Benjamin Wilms: Absolutely. And I’m always like curious about learning from failure. So if there’s something in your mind, any, any story where something really went wrong as you have done chaos engineering.

Adrian Hornsby: What, what do you mean?

Benjamin Wilms: So not from the technical point of view, more like there was, let’s imagine you have done a GameDay and chaos exercise, whatever, and then something was not working as planned.

Adrian Hornsby: It’s often like that, it’s rarely what goes as planned, right? Because, like, people have assumptions and typically those assumptions are completely wrong, right?

It just depends on the degree at which you define your hypothesis, right? So it’s easy to always have a valid hypothesis if you have a very loose hypothesis, but if you have a very precise, specific, with time, people, you know, like, really precise hypothesis then typically something goes unplanned or wrong.

You know, that’s why you need to be, you know, careful what you’re doing. That’s why you need to first do it in a test environment or in a control environment, or if you do it even in production environment, you need to do it in, in a situations where your blast radius, kind of the number of customers that you’re potentially impacting is reduced or controlled, or, you know, like it’s a certain type of customers.

So for example. You know, I’ve done, I’ve done work with the BMW group recently, and they’ve done their GameDays in production, but they’ve done it, you know, at the time where there was minimum customer on the systems, and even if those would have been impacted, it was certain type of customer that it wouldn’t have been a big problem.

So it’s kind of very fine-grained definition of, or kind of shard of the system where you apply the failure and then you can learn from that. And then as you control it, you can expand it potentially, you can, you know, put it in different situations, but yeah, typically you want really controlled, you know, very precise.

Like, I’ve honestly, I’ve never seen an experiment that would go as planned. You know, that’s the funny thing is like, you always learn something from a chaos experiment or a GameDay because like our assumptions about the systems is more often than not completely wrong.

Benjamin Wilms: Yeah, let me maybe transition right now into a new topic, but we already talked about how complex today’s systems are, how many parts are already in that system. And now, yeah, everyone is aware that AI is coming up and everything on paper will be greater and better and easier to do, but aren’t we in the position that with these approaches, like AI agents or systems that are designed and created by an AI, we are getting even in a more complex world?

Adrian Hornsby: Yeah, it’s actually, there’s a very interesting paper that was written in 1978 I think by Longbridge [ Lisanne Bainbridge] or something like that called the “Ironies of automation”, and I think it, it really applies today. It basically states that the more automated the systems, the more specialized people need to be to actually recover those systems, because of course you have this separation between abstractions and you know, how the system works, and if you only work on the abstractions, you don’t have the knowledge of how to recover the system, so you need to basically have eventually people, because it’s always at the end of the day, people that come and recover your system or kind of fix your system, so you need to have the more automation you have, the more specialized you need to be, to be able to understand how it works, since it seems obvious, right.

But it’s an interesting paradox and is becoming even more obvious now where you have like companies that, you know, you can prompt an app and it builds everything for you, but then at the end of the day, and I’ve worked with, actually, I recently wrote an article about that because I’ve been working, testing many of those and it works well, you know, for your most lovable product kind of stuff, but as soon as, you know, your users grow from, you know, a hundred to a thousand, 10, 000 or past a million, and then you need to do a lot of changes, right, and AI is not going to, at least not at the moment, is not able to do that in a way that is smooth, that is predictable, that makes sense, right, so…

Benjamin Wilms: We are still in this situation where the AI is still more guessing, trying out, figuring out, it can do very fast, this approach, but we are missing like the training data. We are missing like the experience that a good AI really is able to do it in the first try in a good way.

Adrian Hornsby: Yeah, and that’s also, I mean, if you want to be a crude app, it’s fine, like it’s super easy, right? But as soon as you do something a little bit atypical, there’s so many ways of building things.

I mean, I’ve been, I’ve built systems for 20 years. None of them were similar, nothing alike, right? They, they were all completely different. There’s some principles, right? So you have like, you know, event-driven, you have like asynchronous workers, you have these kind of patterns, but then the implementation details are so specific to the context of, or the business, you know priorities or something like that.

So, yeah, you know, the, the AI system needs to basically make an assumption about what you want to do, gonna select defaults, you know, and that’s again, that’s something that, you know, it’s going to be a problem for what we called before your adaptive capacity, because the more defaults you have, the harder it is to change the system to understand where the limits of that system is.

So, like, you’re going to have need even more knowledge now. So there’s a concept called “Meta Operator”, which again is something I’ve been exploring and playing with actually, which you use an AI to operate on AI systems, like to make this kind of decision and kind of, but again, it’s like, you know, you, you need, you need to have potentially, an agent that is looking at, okay, make a decision based on cost, make a decision based on scalability, on availability, so you have like all those agents kind of working, you know, against each other, maybe each other, and then somebody needs to be a compromise. So it’s super interesting, but like, you know, it’s difficult because none of these systems are deterministic, right?

So the context changes just slightly, and then it might make completely different decision, right? So that’s for regulations, for engineers, for it’s stuff to handle, right? Because we want to understand why a decision was made and then being able to make sure it’s done multiple times, especially for regulations.

Like, imagine you have like a disaster recovery system, the AI agent decides that “I’m going to make the disaster recovery now”, because it’s, it’s been seeing a pattern somewhere, and then, okay, you say, okay, that’s great, and then another situation you say, “No, I don’t do it.” And then it’s like, okay, why?

That’s why it’s going to be interesting. All the questions are going to be surfacing.

Benjamin Wilms: I was able to, to read an article on Medium from you. It is “Chaos Engineering in the Age of AI”. What, what is, from your perspective, maybe you can summarize this a bit and how can chaos engineering be helpful in today’s system with AI?

Adrian Hornsby: Yeah, yeah. So it’s a bit related to one of the properties of chaos engineering, which I discussed at the beginning, which is very great, is very good at on this kind of surfacing complexity of systems, because AI is making a lot of decisions like those default configurations, default coders, limits, all this kind of stuff that works well for the vast majority of your use cases.

But as soon as you come out of this, you basically need to make your own decision or change the system. So to understand those limits, you, the best way is really to inject default, you know, and, and understand, what are the limits so that you can get a sense of – I like to – I always do this, I like to represent the system as a, it’s dynamic, as a 3D object somehow, right?

So you have a potential load capacity and then, you know, other properties, but, you know, like it’s kind of a 3D object, and the more experiments you do, the more points you can print, you know, in that object that gives you, you know, that it gives you kind of a better sense. It’s like sampling an object in 3D, right?

So it gives you a better idea of how the system looks like, especially the boundary of the system, right, so the more load, with different capacity, with different size of payload and stuff, so you can understand how you’re…

Benjamin Wilms: And in this bubble, then there’s like at any point in time, the actual status of the system, it’s moving in this, in this…

Adrian Hornsby: Exactly. Yeah, exactly. So you have the state of the system, you know, kind of the output of the system that is moving depending on the different dimensions, right? So you give load, you change load, you increase capacity, you have also the payload object, so these are all the basically inputs of the system and the output is kind of gives you that kind of 3D object, so the more points you have, the better understanding you have of your system, especially of the boundaries, so then you can make better decisions, right? Or you can reverse engineer a little bit what AI has done with your system. Otherwise you would need to go through the code and kind of look at all the defaults that’s been made and look at all the, the libraries that were selected to be used and their defaults.

So for example, all the defaults related to timeouts, the retries, all the size of your potential, the queue, the thread pools, all this kind of stuff are defaults your AI is going to make at some point, right? And these are actually the ones that make your system either very rigid or potentially very resilient.

And the more knowledge you have of that, the better your decisions, right? The better you can, you can basically understand the system. So chaos engineering is great at that because you inject faults and you can understand right away, which are the limits of your system, right? Like it gives you, it gives you this capability to do that.

So, and it’s also why, I don’t know if you know about the concept of the probabilistic simulators, right? It’s basically, it kind of takes time outside of the equation and you can accelerate testing, so Antithesis is one of those systems that is awesome for that, right, and it also can use chaos engineering in that simulator to understand how the system fails, right?

So it turns on a non-deterministic system, so like an operating system, into a deterministic simulator, where then you can accelerate time and do a lot of tests and all this kind of stuff, and I see this basically in my head as where chaos engineering helps a lot.

It helps you kind of understand the system, the limits of your system, it works very well in kind of understanding, in trying to get a sense of…

Benjamin Wilms: With all the moving parameters.

Adrian Hornsby: Yeah, exactly, and if you can do this, then in your simulator, you know, at fast, fast forward, then it gives you a sense of how your system works, so basically what I’m looking at now is starting to connect like AI generated system, and then put that into a deterministic simulator, and so that you can actually really understand how your system works and then make the right decision. I’ve had chats with a couple of CEOs of companies that are building AI systems on that topic, you know, because a lot of customers come and, and build, you know, the first MLP or POCs, but then as soon as they get customers, they freak out or they, you know, they don’t know what to do, and then it’s complete chaos, right, and then basically they lose customers because the system stopped working, and then, so the customers go somewhere else.

Benjamin Wilms: Yeah, they are losing customer trust. Customers are not returning, new systems are coming up overnight, new products are coming up overnight.

Adrian Hornsby: Yeah, exactly that.

So you basically have to have a team that is able to go very fast at the beginning, but then also flip from AI generated to understood and then potentially then, you know, human-assisted.

Benjamin Wilms: Which is also not something new. It’s already like getting from a prototype to something really at scale, which is working under load with customers and so on.

Adrian Hornsby: But it used to take a lot of time, like you used, you used to have weeks or years to do that transition. Now you need to do it really fast. So that does, that’s where I see chaos engineering having a big role to play is like, you know, understand complexity and allowing those companies to generate a big system, then generate a bunch of faults, understand the boundary of the system, and then, you use an, you can potentially use AI then to make those suggestions, but the AI needs to have context.

Benjamin Wilms: Correct. There needs to be some data based on where, where the decision is based on. Otherwise it’s just, yeah, guessing, failing, guessing, failing. And we’re already talking a little bit about the future.

So how would you see the future of reliability engineering and chaos engineering?

Adrian Hornsby: Yeah, I think things are definitely going to change a little bit, you know, it’s going to be, I see chaos engineering potentially go into an agent mode, you know, like with autonomous agents that are able to inject faults at the right place, obviously linked with observability.

It’s the right place to, to be, to do chaos engineering, and then, yeah, and then, you know, AI is there, right? You know, yeah, you inject faults, you understand the system of site complexity and then eventually make, make changes, right, so adaptability. So I think it’s, it’s going to be a big part in there and making the choice, giving the data to agents, create adaptability, I think that’s, where chaos engineering especially has a good place. It just means we need to be able to do chaos engineering very, very simple, in a very simple way, almost hands off, potentially all autonomous, and then potentially as well in an accelerated way, right, so that’s why I think about those deterministic, um, deterministic simulators that are interesting.

Not every, everything can go into those simulators and eventually the real world systems are a bit different, but then how could we use those principles of accelerated learnings, you know, automated load, automated, you know, user, synthetic users. So you can have a phase of really, you know, test the whole system, inject chaos with load, all that kind of stuff that gives you a set, a data set of how your system is supposed to work, you put this into AI, it like creates this 3D image, you know, of how your system is and all the boundaries, and then you can make, you can make good suggestions based on the, on the context of where your business is going.

Does it make sense what I’m saying? Or is it?

Benjamin Wilms: Yes, yes, yes. I’m following. Yeah, it makes sense. Absolutely. So we all know that we are not getting out of this complexity cycle and we all need to learn and you are getting more insights about it. Is there something you can already, some recommendations for reading and learning?

So if people want to get started with such complex systems or also like the human part in those systems, is there something, from your reading list you want to share?

Adrian Hornsby: I mean, the classic Woods [David D. Woods], you know, kind of about resilience engineering are a great start in resilience engineering and practice human errors.

A lot of the, the book from, uh, I think it’s Diane Vaughan about the Challenger [“The Challenger Launch Decision”], you know, where she talks about the, she talks about the normalization of deviance. I think that’s also super interesting, interesting idea. It’s changed, changed me a lot. And then just, you know, there’s, there’s no substitute to, to trying and I think hands on, you know, and, and actually yeah, because you can read as much as you could, but at the end of the day, you just need to try things and, and play, because that’s where you get the real value.

And again, like…

Benjamin Wilms: Yeah, and the biggest learning for yourself. So really learning from failures, learning from you’re doing this right now is so valuable.

Adrian Hornsby: Yeah. I know we don’t say, say like, and I know it’s, it’s ingrained in all our brains, learning from failure, but you can learn a lot from success. I think that’s a big, big myth in the industry.

Everybody focused on learning from failure, but most of the time the system works right, and there’s a lot of learnings to be done during success, right? So when near misses, when things are just nobody talks about, but you know, your engineer reviewed something in the code or did an operation that, you know, prevented a big outage.

All these are so important to learn from because they are 99% of the time it works and understanding that is a very important thing, because if you only focus on failure, you focus on 1% of the time and meet those problems, and that’s great, but you know, there’s so much to learn and then piggyback on what works and make it even stronger.

Benjamin Wilms: I would agree, but this also, this is, there’s a strong need that you, for example, if you have implemented like a specific fallback strategy, or maybe you would like to see your retry pattern, you need to see it in the data. If the system is just running and you are not getting any insights and you need to get this confidence, you need to see that what you have created is really working.

And if there is an edge case or like this fallback should now been called, you need to see it in the data as well. Otherwise you are just guessing.

Adrian Hornsby: Yeah, yeah, yeah, for sure. There’s no doubt about that, but I think it’s still a lot to learn from when things work properly, right, and then understanding why they work, you know, because like it, because at the end, otherwise you’re going to have a failure, you’re going to fix something and it’s going to, it’s going to affect things that already work, right? I have seen this a lot. I’ve seen this a lot. There’s like you, you focus on something that didn’t work. You try to address that. And then it touches something else because nobody saw that, oh, that’s something else actually was successful, and I’ve seen this a lot as well. It’s interesting to think about it, right, so it’s just like a note. I know we always, you know, it’s part of the chaos engineering to always talk about failures, but I think understanding success and, you know, using chaos engineering actually to, to also show that your system behaves well in hard condition is such a great thing, right? So it shouldn’t be like just put it on the side because if you can just prove that your system works as you expect, it’s great. It’s a great way to sell.

Benjamin Wilms: Yeah, and you can be really proud about it that the system is working as expected, because yeah, that’s like, you want to see your expectation fulfilled from the system as well.

Yeah. So thank you very much for the first session of my podcast series. We could, we should do it again. I really enjoy our conversation. And…

Adrian Hornsby: likewise,

Benjamin Wilms: what, what is the best way if people want to reach out to you?

Adrian Hornsby: I’m quite active on LinkedIn lately. I’ve been publishing almost every day. So just hit me up on LinkedIn or, you know, I’m also on BlueSky @horn.me and, uh, yeah, also a website, so you can see me, I’ll be just, just, and any, any place there. I’m there on the, on the Meta Meta universe. Not Meta, the company, but the universe.

Benjamin Wilms: Nice. Okay. Thanks again. And I really enjoyed the enjoyed the conversation and see you soon.