Talks & Chats

Episode 2: Embracing Psychological Safety with Russell Miles

Experiments in Chaos • Jun 2025

In episode 2 of “Experiments in Chaos”, Benjamin Wilms sits down with Russell Miles, a leading expert in the resilience engineering space, to discuss the definition of system reliability, the importance of psychological safety, and why embracing failure is at the heart of true learning.

Episode Transcript

Benjamin Wilms: Hello, hello, hello, and this time I’m very happy that finally Russ Miles is with me. And I know Russ quite a while and some years ago we, we did a joint, let’s say, yeah, it was called the Chaos Roadshow. So we were really jumping around in Germany and in and the Netherlands to talk about chaos. And it’s a pleasure to have you here. Welcome to the show.

Russell Miles: It’s an absolute pleasure to be invited, my friend. Thank you.

Benjamin Wilms: You’re welcome. And yeah, what is the topic of today’s session? I mean, it’s chaos. Yeah, it’s chaos engineering, but it’s more like, it should be more like what you can achieve with and why reliability is so important. And it’s not, not just, yeah, something you, you can implement on the code side, it’s, it’s even more, and I would say really reliability is a culture.

So why is tooling not enough from your point of view?

Russell Miles: It’s a really interesting question. Because for in a lot of organizations, tooling’s the answer. And as a consultant from a dim and distant career, um, I certainly found that when I was brought into an organization, the answer was, very much the tooling that I was associated with. Um, and then almost it was the question, the problem was being fitted to the answer, if that makes sense. So I think in a lot of organizations there’s a tendency to see tooling as a incredible is the answer already. And we just need to shape our problem to fit the, the answer that is the tool. And I think that’s where the danger comes in, in a lot of different areas.

But certainly in reliability and resilience, I generally think there is so much of it that comes down to the, the people involved and how they operate and how they think that can make all the difference between whether you have a highly available system, a highly reliable system, or you don’t.

And I think the difference the, not just still, but perhaps may always be a little wary of that word “always” in technology, yeah, but I think, I think there will always be a need for people’s, um, I’m hesitant to use the word “mindset” as well. That’s another overused term in pop psychology. But if they just a grasp of the situation could be shifted, uh, I think that’s something that makes a lot more difference than the specifics of the tooling that are used, um, in some organizations, to give you an example, I’ve found that, um, you could use some of the best tooling in the marketplace in terms of observability, of turbulence, injection, of uh, even the, the types of monitoring and learning that are necessary to get the value outta chaos engineering. But if the uh, senior stakeholders aren’t aware that we are trying to work with at least complicated, if not complex systems, and if they’re under any misapprehensions about we do this chaos, um, exploration and experimentation today, but we really need to keep doing it and learning from the dynamic nature of the system, then you can end up with a lot of false dawns.

You can have this moment of, oh, it’s all good. It’s fine. We’re done. Um, and I, so I try to, A lot of my work, I suppose, has been in helping people realize the nature of complex systems. And I have a lot of people to thank for the hard scientific work on what that looks like. Mm-hmm, uh, Dave Snowden is a classic individual in the, uh, in the industry, that has been, I’d like to say he’s been yelling about what complexity is for a long time, I don’t think I’ve ever heard him actually yell. He’s, he seems to be quite, quiet spoken, but forthright individual and we need some of that sometimes. And then of course there’s Barry O’Reilly with his theory of residuality and how to explore a system from that perspective. These people have made incredible steps forward to help us understand the dynamic, non-linear, complex entangled nature of systems, particularly systems where people are a big part of what’s going on, and I think that is probably, it’s probably still a 20 year journey to helping people understand this perspective.

So then they’ll see the great work that you are doing. We spoke before the podcast about Adrian Hornsby and the great work he’s doing, John Allspaw, Casey Rosenthal. The list does go on about people that are working in this area and they’re doing the, I wouldn’t say the harder job, you’re doing the harder job, I would say you’re doing the absolutely necessary job of bringing that awareness into the mainstream.

And that’s, again, that’s a good 10, 15 year journey I feel, um, but the tooling won’t win over the game. It will be, I believe, the way people think about their systems and how they change over time and the complex nature of them and how that relates to reliability, where I, I’m hoping the biggest wins come, but it’s a long game.

Benjamin Wilms: It is, it is, and you mentioned a couple of times the word system, and I mean, if you talk to engineers and if you talk to some other people or whatever role they, they are in the definition of a system is quite different, and so in my opinion, um, the, the system isn’t just a technical part, and you mentioned it also, I guess in a, in a talk or in a publication, um, it’s called the sociotechnical system. It’s a combination of technology you need, but also the people, the organizations, the process, so everything else, and that most of the time is not part of the story when people are talking about the system.

Russell Miles: It’s, you are absolutely right. And in fact it makes me smile slightly because anyone who’s worked in systems theory would, uh, must have spent like many industries outside technology must spent the last 10, 15, maybe long more than that, years looking at us going,

“Stop focusing on just the technology.” Yes. And uh, and also probably screaming at us, “You don’t need a new word for it either.” ’cause we, we sort of hold up this sociotechnical thing as a, as a new idea. Systems thinking is always about all the factors in how you approach wherever you demarcate a, a system and demarcating a system in itself can be enormously challenging because everything is somewhat part of another thing, and, uh, there can be, um, intangible relationships between multiple factors that whatever you call your system, you won’t necessarily see. One of the metaphors I love working with, with this is, I, I’m a massive fan of the book, “The Hidden Life of Trees”. Yeah. And in that book it talks about how trees interact and communicate and, um, collaborate and support one another, and the idea of a tree being a system is a narrow one and a misunderstanding.

It’s like what you see above the, above the ground. Um, and even when you dive into the, uh, the ecosystems within a single tree, there are a lot of systems within there. So everything’s systems within systems and systems and lots of crosstalk and overlap and entanglement between them.

So yes, I think, um, in our area in particular, we’re even more myopic. We tend to go, what’s the technology and what can it do? Or, you know, what is the, uh, the survival rate of a particular system under a set of set of conditions, and I think the, the sort of loudest, most forthright champion of this, um, change in perspective that I’ve been enamored with is the, is the work again, John, John Allspaw, where he is going, no, please remember that people are at the heart of all of this, that people are the key factor to your systems survivability, and this constant reminder, and we do need to be constantly reminded in this industry that the, um, the system has to be with the people.

The people are part of the system. There is, you know, just to, um, I use the phrase sociotechnical myself a lot, um, but I use it, it, it sort of, despite my own slight flinch at it, in that it’s a reminder of just how easy it’s for us to focus on the technical but any system will be very, particularly the systems we work on, highly available, highly reliable systems. Um, it’s always, well, possibly the most important part of it to explore is the people and how they, um, respond to what’s going on.

Benjamin Wilms: Also, people on both ends, I mean, it, every, everything starts with, with, with the customer, with your customer, and on the other side, it, it’s your job to really fulfill their needs. And the technology is just a tool you are using, but in between there are so many yeah, needs that are hopefully articulated, but yeah, if not, you’re in, in, in a bad position anyhow, um, and let me, let me chime in, into that area where, um, nowadays people want to get a good understanding about how healthy their system is so they are building huge dashboards. They are taking a look as at SLOs about, they are using let’s say 3, 5, 4, or how many observability tools to just know how good their system is, but it’s not just one metric. It’s not one KPI, there’s so much needed.

Russell Miles: You can’t distill systems down the list. I don’t believe you can distill, and I hope, I believe that the complexity experts out there would, would, um, I. Would agree with me a little on this. You can’t distill things down to a simple set of metrics.

That’s that say that this is good, that this is okay, that this is working, this is healthy, this is reliable. I still come back to a lovely phrase I once heard in that if you want to look at a system, sociotechnical system and understand if it is better now than it was yesterday or better now than it was a year ago.

Look at the types of stories people tell about it. And that always chimed with me as, as a really lovely, very qualitative metric, if you like, of how systems are working. And it’s something that I naturally look for in terms of the systems I tend to work on, where, the perception of a system from the outside is where one reliability metric, if you like, exists.

How do people perceive it? Do they rely on it? If they do, then it’s perceived as reliable as you say to the customer, to the person that is benefiting externally from what you have. And then there is the, the benefit to the people who run the system, there’s the benefit to the people who are stakeholders in the system’s existence.

Yeah, and you can ask, is it reliable to them as well? And that becomes another interesting way of examining it, and then there’s this question of system health. Is it healthy? And in fact, health, it has so many dimensions that it should never be a single number. I don’t even know how you make it as a simple number.

Um, yeah, but we are engineers, so one or zero is our favorite, but exactly. Not just one number. No. But, uh, literally a binary. Yes, absolutely, and we don’t like stories. I mean, when I, I, well, we do, we do like stories as humans. We do, but it, it depends on the context. It really does. And we don’t like to think that that’s what better might look like.

Yeah. I mean, as an example. I was working, some years ago now with a group who were explaining that they had a highly reliable, highly available payment system, and we started to unpack the experiences that people have of it. Very subjective. Of course they are, but they are as valid as anybody else’s experiences within it.

And as we did that, that we found that that was over time as improvements were explored and everything is a baby step in improving that system. Every baby step of improvement, the stories would adjust. They wouldn’t all suddenly become positive, by the way. They might become more negative. Um, yeah. But they, ’cause we’re experimenting, we’re probing, we’re sensing, we’re responding.

We’re seeing how a complex system reacts to these things, but we were using as a primary sense-making tool. What are the stories that people are telling about it? And that was quite useful for us because what we realized is that, some stories disappeared and some of the worst ones disappeared for some changes that we made.

Yeah. And in fact, it’s a bit like, I, I don’t know how, familiar your audience is with British culture, whatever that may be, but in British culture, we have a tendency to be quite pessimistic. I think it’s all weather that helps, but we, but so we tend to sell pessimistic stories. If you read the press in the uk, it’s pretty much, this is what you should hate.

This is what you should be upset about. This is what you should fear, and this is why it’s all going to hell in a hand basket. And then you go out and you experience the world in the UK and you can take a perspective of we’re quite privileged, we’re quite happy. There is, you know, it’s not the reflection of what’s on the page, but those are the stories that we tell about ourselves.

And, and so when you’re looking at these, the stories that people tell about how they experience these complex, complicated systems, you can use the stories to help you guide the improvements you’re making, the test experiments you’re making as you baby-step forward, and I found that to be really useful when you’re working in reliability and resilience where, single point metrics or even trends over time really don’t help.

Can we find a way of engaging people in, surfacing the stories that they see and not treating it as a binary Yes, it’s okay. No, it’s not.

Benjamin Wilms: Yeah, it’s, yeah. This group of people, they’re telling different stories. Now what does that tell us? What’s different now? How do you handle in those stories the emotion? The story is always like connected with some emotion.

Russell Miles: Yeah. At that point in time. And that’s a great, some people are very emotional and others are more like rational, like one or zero. You have to have an understanding of, a couple of things. So there’s, I believe there’s some science now I can try and find the paper, but there was a study done that kind of distilled biases down to one bias.

We used to think we had a lot of biases, and there’s an awful lot of pop psychology out there that says that we’ve got hundreds of biases, um, I don’t know, prediction bias and confirmation bias, and there’s a list. And in fact, if you distill it all down, it comes down to one thing. There’s several books on this that we are prediction engines.

We predict how the world’s gonna be. Our brain really likes to predict what happens next. And, and in fact, in many respects is already ahead of where we are to try to see how things might play out. It’s how we can, it, it was extremely useful for evolutionary reasons for us to have a brain that works this way.

Yeah, otherwise we weren’t here. Yeah. So, but the downside is that we tend to have a number of predictions that are always going on, and those predictions are heavily informed by cultural stories, yeah, myths, myths we tell ourselves, and I’ve been reading a lot on the science of myths in the last, few years actually.

Mary Midgley has written some amazing work on this, and our myths, of course, are culturally set. So depending on our social groups and depending on our location and our, where we are at certain times in our life make a huge difference as well, believe implicitly in different myths. And we’ll look at the world in different ways.

Okay, so back to your point about emotion, I don’t take the emotion out, but what I do is I try to understand and navigate these stories under the lens of understanding these biases. Under what- sorry- understanding the signal, Single bias. Yeah. Understanding the cultural ramifications of the group of people that are sharing their stories, their experiences of things,

and then also tempering that with what are the, the myths, the, the tribal knowledge, if you like in the culture they’ve grown up in. So I will judge, judge is perhaps too strong a term, but I will use the stories of one group differently. Perhaps I’ll look a different, a different lens for them over another group entirely, whereas one group might be incredibly positive and here to tell you how great it all is, another group naturally doesn’t do that. One of my favorite examples of this is I did, I had the most incredible cultural, sort of change switch between two talks that I did in the same week. Okay. I did a talk in America on the West Coast to around 8,000 people, 8,000 Valley, American, extremely positive people. Yeah. These beautiful individuals were laughing and clapping before I’d even got into the room, and I thought, I remember going on stage at the end of their incredible welcome to me and saying to them, well, it can only go downhill from here. And then hoping that the talk survived.

And then I came, I flew and I did talk in, I think it was in Norway, and I did it in, I remember very distinctly I was doing it in this ancient theater with beautiful different colonnades, and I had the, the other experience of coming into a silent room. Highly respectful and then all going through my entire talk.

No reactions from the audience. Literally. No, that’s tough. And yeah. And I was like, okay, this is, this is an example of the cultural differences, particularly when you are navigating stories. People will tell them differently. They’ll be trying to imply different things. In one story, if someone’s a quiet participant in the world, then they’re seen as negative in some cultures that seen as extremely positive,

and so there’s all these nuances you have to play with, but if you’re going to affect change or look to affect change in, sociotechnical, in, in systems that have humans and you know, there’s sort of the less mechanical parts of the world in there, then you’re gonna need, you’re gonna need at least a sense of the stories that are being told, because they are often where a lot of the truths are.

Benjamin Wilms: Yeah. Or, or like, uh, a champion on onsite or someone who is able to translate it for yourself. Because I can remember, I was giving a, what was it? It was, I guess a demo session for, for our product and it was for, for a company in Finland, and I was getting zero questions, zero feedback. We closed the call and I was, damn, I really messed up.

And two days later, “It was amazing. You did perfect. We need to move on, blah, blah, blah, blah,” and I was not able to read in between the lines, but that’s, yeah, the culture aspect. Um, do you agree that it’s, it’s a question. It’s more a question. Do you agree that even if we are making our systems more and more reliable, this can lead to a big risk?

Russell Miles: There’s a lovely subtext to that question, isn’t there? That, I mean, it’s, I – lemme try and explore it under two ways. It’s a great question. So I could almost somewhat argue against the first statement that we’re making systems more, more reliable. I’m not sure we are yet. I think we’re trying to, I think people are beginning to understand what that feels like, but I think there’s an enormous amount of hope-based investment. Yeah. But hope is not a strategy. No, hope is not a strategy, and I think there’s still a lot of hope and there’s also a lot of, I’m gonna call it baggage, and thinking around reliability being some, um, uh, property of the system, like it’s done. There’s a, there’s a level that you need to hit that you can trade off, whereas I tend to think, think of it as being, as a practice, it’s a thing you do with the system. So I, um, I, I do question the fact we’re making anything much more reliable. I think we’re build, we’re, it’s, it’s like, it’s, it’s one of those forcing functions. The more you can make certain levels of complexity behave more reliably, the level of a complexity will only go up because we then want more of other things because we’ll then stretch in a different direction.

So I remember talking once to somebody who said we needed a certain language to manage complexity in the system, and I used to ask where they thought the complexity was and they said, well, it’s in the classes, it’s in the methods, and I said, no, it’s in how people can comprehend the system and how they can run it and manage it.

The complexity is there in the interactions between parts of the system and people in the system and everything else. Yeah. So, um, so yeah, one argument is that complexity is on the up and I dunno if we’re getting much more reliable really, although we are getting away with it in a lot of quarters at the moment.

We can see on a regular basis there are headline grabbing moments where people’s beliefs are challenged, let’s call it that reality comes to give you a little kick and say, Nope, this is not beautiful, and recently, there’s an airport nearby to me that has had a massive problem, an outage, it couldn’t, it its systems all went down and the airport was out of action for at least a day or something like that.

It was maybe a little longer than a day, and I remember, I remember watching the answer from the local community on this is, how could this happen? It’s an airport. Surely there are redundant systems. There are all sorts of ways it can react and respond. And yes, the answer from the electricity board, the suppliers of energy, was yes, there are, we have another power station, we have another substation.

So one went down, there was a fire, so switched to the next one. But all of the systems inside that airport had not been really rigorously tested of what would happen if the power went down and they all had to come up. I’m absolutely convinced that was on somebody’s risk matrix and they were feeling very comfortable.

It was down the bottom somewhere, unlikely to happen, and that word “likely” really upsets me sometimes because the likelihood of something is a complete judgment on past experience, and therefore it’s mostly null and void. So unfortunately, someone had decided that was unlikely that every single system would need to be rebooted and restarted and and possibly recovered all at once and the, the mean time to recovery or the time to recovery being very long in this case, and it’s experienced by the, the clients, the customers, if you like, people trying to fly. And so, you know, even in those circumstances where there had been measures have been taken, they call infrastructure has been built up.

The electricity board is saying this airport never actually lost power. They had the power back in seconds, and yet none of the systems beyond that had ever been, I imagine I could be wrong, but I don’t, I suspect no one ever practiced that scenario, and I just, and I suspect that when they did a paper-based exercise, which is a very common activity in organizations, when they said, we are investing in reliability, we do, was it desktop, tabletop, paper-based activities to explore things.

Yeah, which is theater and fiction and not the stories I was mentioning earlier at all. And, and doesn’t mean that you’re any better. It means you may have grappled with it a little, you’ve tried, but you have, are still basing everything on hope, trying to use a risk-based, sorry, risk-based approach, which is the phrase again, I hear a lot.

Benjamin Wilms: Another, another airport story I want to share because it, it, it fits quite well. A while ago I talked to someone from Dubai and, I guess it was around April last year where this heavy rainfall was in Dubai City and it was way too much water. So this guy is working as a quality, as a leader, as a quality engineering department of a very well known airline, and so the rain starts, heavy peak in, in customer complaints and, and something’s going on. So they really designed a technical system that can handle all the customer requests and all that stuff, but then, I mean, it, it was a situation that the airport was no longer operating because there was way too much water on the runway.

They cannot use the airplanes, they cannot get the, the, the passengers out of the building because also the public transport was not working. And this was really something they were never, ever able, I mean even it’s Dubai, so much rain is, it’s very unusual, and so, I mean, failures will happen at any point in time and you cannot be prepared for every failure.

And that was a worst case scenario. But then still they are taking this as an opportunity to really, yeah, try to figure out even under those conditions, what can we do as airline for our customers to make them happy even with just a little bit, but something, so what they decided is they recreated the system in a way that even engineers or software developers were been getting into the airport with an, with an iPad on their hand and were able to help their customers in any way and…

Russell Miles: I mean, that’s, yeah, the complexity of the system. I mean, if you have everything done right, nature will be your biggest enemy at some point. Yeah. I think it’s, I think it’s just reality, isn’t it? Yeah. Yeah. You’re absolutely right. Nature’s there, we have different phrases for it of Murphy’s Law and and others, you know, there’s, these things will happen and the more unlikely they seem, sometimes the more likely they are that they, uh, they may rear their ugly heads, but to your, your original question of do I think that an investment in reliability and even a perceived improvement in it could actually result in, um, in a reduction of reliability?

Yeah. I mean, the word there is often most used is complacency. And you can see this, there are some lovely stories in safety science about how humans react to safety measures, and reliability is another safety measure. You can begin to have trust and confidence that your system can be relied upon and the more you are reliant upon it, particularly as a single, always there quality in your life, the more susceptible you are to nasty shocks and surprises on a enormous scale. We can see that across the world when the world’s financial systems take knocks and shakes. Nassim Taleb is one of the greatest writers on, anti fragility and fragility and, and his exploration really from an economics point of view, in some respects, he was originally an economist at hedge fund, hedge fund investor type person, and his perspective on these systems is that shocks and surprises, and those events we were just talking about, these one-off events as we see them are not one-off, and possibly not even unlikely at all.

It’s just that they, if you were to, to write them down and say floods in the middle of a desert where a city is built, you might not go top of the list. Nope. But, but his point is, you, you, not that you should make them top of the list, but all of your theories about what is top of the list are probably based upon fragile foundations, and so there is an element of complacency at least, and that’s why I’m very careful around the language of anyone saying we are, we have a reliable system, and I’ll, I’ll often ask questions like, reliable to who and under what conditions. It’s really important, I think, to ask particularly, what, what are the conditions that you think your system is reliable under?

And if you don’t know, then you probably don’t have reliability or you have is a hope and a belief. You possibly have some numbers to tell you how much money you’ve spent on it, but you are not sure how and why, and so that’s where I would say to folks, if you, if you don’t have a basic understanding or a basic touchstone of we are reliable under these conditions because we’ve seen it, because we have evidence of it, then I think that there’s a danger, you’re right, that we have spent a lot of money on reliability, we’ll have spent a lot of money on practicing scenarios that help us grapple with resiliency and even recovery, but we are still open wide to things that can happen and, and there will always be a stressor too far. There’ll always be a set of conditions that are completely unexpected, the unknown unknowns.

But this is where residuality is interesting because it tries to embrace the fact that there are naturally unknown unknowns and work on the basis of this, these attractors that say if you explore the system in a random way and how it will respond to all sorts of random, turbulent scenarios, then what you will do is you’ll come up eventually with a system that survives many of these scenarios, and then you can go, okay, maybe that’s getting close to what is called the critical system,

the system that is most likely to survive under unknown unknowns, which is the really exciting part of Barry’s work is it’s, he’s trying to help you architect the technical system that you can explore before time, before chaos engineering, before verification, before which all expensive operations, they’re all things that can cost a lot of money.

He says you can do a lot of this in design, in architecture where it’s cheaper to do these things, even though I’ve been very disparaging of paper-based exercises, what I love about residuality is it’s asking you to ask certain questions and seek out an architecture that can, has a better chance of surviving, truly has a better chance of surviving under unknown unknown. Yeah, but is this enough evidence if you are doing it so early?

I, I don’t think it’s, well, this is perhaps where Barry would disagree with me and, uh, there are many things that Barry and I may disagree on, but I will always love the fact that he’s here and he’s part of our industry, I think it’s amazing.

But they, he would probably argue it’s the, whether the most evidence, the most scientific approach can be applied. What he’s done is the hard science, the creation of, you know, PhD writing and that sort of thing to, to explore these things. I would say they go beautifully hand-in-hand with experimental technique, which is where you start to look at, yeah, how we verify, how we exploring, how we practicing, how we seeing the change over time.

And so, um, I think in residuality or residues, the first book from Barry O’Reilly, he, I don’t think he’s disparaging, but I think he says, you know, chaos engineering is kind of, it’s too late, too expensive, so residuality is a, is a better investment. And I think that’s very much… if, if you took chaos in production? Um, I don’t think he cared where you were doing the chaos engineering.

I think his whole point was that it’s, by then, you’ve made a lot of very big decisions that can be quite hard to adjust and almost impossible to reverse. So again, he was saying try to put more investment into the architecture and design to explore these conditions. So he, you know, I’m gonna paraphrase slightly, but it’s bringing that, those natural conditions we were talking about, those threats from the world of the, the physical world to the system.

Early, in design and architecture, rather than doing, you know, design by previous experience or design by PowerPoint, it’s designed by stressor. It’s look at the different random scenarios that a particular architecture can survive. So it has enormous value. I think where Barry and I perhaps differ a little is that there’s, I think there’s a balance of value between that approach when you are designing and architecting and when you are then verifying and exploring and experimenting and providing evidence through chaos testing, chaos engineering, chaos experimentation.

I see the two going hand in hand. I don’t see them being, you can do more of this and less of this, or any of those sorts of arguments. I think they go beautifully hand in hand, and, but what I have seen, and I don’t yet have enough hard evidence for this, but what I have seen is that when you practice residuality in exploring the criticality of your system, then when you start to experience turbulent conditions, unexpected conditions in production, or even staging, if you decide to affect them there, surprise people with them there sometimes, in GameDays, for example, the systems survive. Well, I’ve seen that, but I don’t yet have enough. I don’t have any double blind experiments, right? I can say look the same system, but it reacted differently depending on whether someone had used residuality and chaos engineering or not, but I can certainly again, from the stories I see, I can see these things being, it, it’s, it’s very much they go hand in hand.

It’s the same mentality behind them both. It’s the same need for science that’s in both. It’s just where are the decisions being made and how are we verifying. So I, I tend to use residuality in combination with chaos engineering to build real confidence, ongoing confidence in the reliability of the system, as, as it evolves rapidly.

Benjamin Wilms: Maybe it’s a good point in time where we can talk a little bit more about, so I’m a, I’m a big fan to talk about failures because that’s a moment where you can learn a lot and if you’re doing it right, you, you shouldn’t do the same failure again, so are there any failures from your, from your experience, you, you want to share right now that you were still remembering because it was a big aha moment, a big learning for you?

Russell Miles: Several, I suppose. I mean, I’m very fortunate my life is full of failure. Um, and, and I mean that I don’t, I mean, I’m saying it slightly flippantly, um, but I’m very fortunate that my upbringing, um, not really my schooling I can say, but most of the, um, whatever’s prepared me, maybe I just early on realize pragmatically I’m not going to avoid it, so I better get good at it.

Failure was very, I mean, I failed my academic career at first, right? So I took a big detour to eventually succeed more academically. I didn’t go the usual path. So my relationship to failure in the first case is possibly wired quite differently than many high, highly successful people. I don’t think of myself as a highly successful person.

In fact, I’m slightly proud of the, the lack of success at times. Um, but I, and I certainly appreciate the lack of success. So there’s, there’s a lot of learning in those failures about what matters that comes along. And so, yeah, there’s a lot of, um, I, I guess I’m just trying to make sure I set the stage that when I talk about failure, there’s a tendency in Western culture to view failure as a thing to be avoided.

And I don’t, I am very keen to see failure, to explore it, to embrace it, to find what it, what there is to take from it. A good example is one form of failure is I had a, a critical system failure, four, four years ago now, where critical physical system failure where I ended up very ill and that was something to learn from.

That was something I was very lucky. I managed to survive it and learn from it, but that was one failure that, um, that could teach an awful lot about what matters in this world, and I’ve been on, been on stage, I’ve talked about some of those things and why I love building platforms like I do now and why I am interested in what I’m interested in,

and it’s always about the people. It’s always about how they, how their lives are improved by what we do. But I suppose you’re also talking about discreet system failures and uh, sort of the entertaining things where, where technology breaks down and what is a, I can talk about specific cases, if you like, about how certain systems are broken.

I mean, the classical example of is usually a certificate change. Let’s be honest. Certificates, if I could get just probably a couple of pounds or a couple of euros for every time it’s like certificates got, didn’t get updated, or didn’t get refreshed or whatever, then I think, I think I’d be a rich person.

So those sort of failures always make me smile. You mentioned earlier about, oh, you know, we, we hopefully learn from them. We do. The lessons are frequently not actually acted upon, and so I, I talk a lot about OODA loops in when I’m building platforms. And, and OODA loop is, if you’re not familiar, I know you are, if your audience isn’t familiar, is this idea of observe, orient, decided, act, and for me it’s a nice model because it says, well, you need to see something. You then need to look at the important pieces of it, possibly enrich it, and then go ahead and try to facilitate a good decision, perform an action, and then quickly observe what are the results of that action.

And the faster you read loops, the better. When a failure occurs, there’s an opportunity to learn. OODA loops are a really good model to understand what’s happening, what is being seen, how are people’s attention being directed? What decisions are they able to take, what action can they take? How safe are these conditions that they can make these decisions and actions?

And what I found actually in all of the failures that I’ve seen in systems and where systems are broken under pressure is the single most important quality that an organization can invest in, even if this is slightly more so than chaos experimentation and empirical evidence gathering, because you can do those things and you can still fail, and by fail, I, in this sense, I mean you’re not learning, so I was very interested some years ago in why other organizations using the tools that I built in open source, and they were still not really any more perceived reliable, or they were, they were not, they didn’t feel like they were any better.

And it was only when I started talking to people in those organizations that I realized there was a quality that was quite low, and you can measure it. It’s a beautiful set of ways of measuring this. Usually done using surveys. You can measure this. Yep, and that quality is something that Amy Edmondson talks about in her books of psychological safety.

If you don’t have a psychologically safe environment, learning is, it’s not impossible, ’cause learning is a tenacious thing. People will learn anyway. We’re human. We have evolutionary benefits to being able to learn, but learning what is most impactful is extremely hard when it is unpsychologically safe to learn, which means it’s unpsychologically safe to fail.

It means, you know, there are, there’s no second, first try. Exactly. And it’s that feeling of, she talks about this brilliantly in her recent books, in Amy’s recent books about the right sorts of failure, and what she means by that is not selecting good failure, it’s, it’s using failure and learning from it and not having the same failure again, not, you know, not having the second piece of harm occurring, making it as safe to learn from failure as possible.

And I think that’s fabulous and I think a lot of organizations would benefit from investment in that, that quality and reading Amy’s work, listening to what she’s saying, to help them, particularly when you’re in the business that, that we’ve been in, that you are in at the moment of making it possible to learn from failure, controlled failures, controlled circumstances.

Your work will be frustrated by a lack of psychological safety.

Benjamin Wilms: Um, yeah. Let me share a story about a company, we worked together a while ago and this company was setting them up. We need a chaos engineering tool, uh, not tool, a team. Let’s hire people. We need to find the right tool.

So let’s get started. They did a, a, a proof of value and this team was able to identify three or I guess up to five very big issues in the system already in the design by just running some simple, very simple experiments, and so POV was done, wrap up of the POV, let’s get to the management. This team was standing in front of the management and then to the point you mentioned earlier, the management told them, now we know, we will shut down this program.

You are all fired. We don’t need chaos engineering anymore. It’s part of our system. We know, but keep on, you are measured or we are measured on, on new features, on something our business is, is, is driving just revenue so we don’t care, and so this was a, a very bad culture. This was a very bad mindset, top down, and there was no room to really improve yourself or learn from failures. It was been shut, shut down by the management. And that’s, it’s, it’s tragic, right?

Russell Miles: Yeah, it is. Because it, what we’re doing, there’s actually a great, great book being written at the moment, online available free, written by Simon Wardley and Tudor Girba.

It’s called “Rewilding Software”. It’s a lovely phrase, and what, and I’m gonna paraphrase some of the most important parts, I think to what we’re talking about here, which is that software engineering, software development is a decision-making activity. And did we never stop? Yeah. So yeah, so we’re, you know, we’re, we are constantly making decisions about how we implement something, how it interacts with other things, what conditions it might be able to survive, and the challenge we face is that being able to interrogate and understand the context within which how our code works is extremely difficult, and it’s even worse when you try to consider the context as being the organization or the organization and the customers.

The challenge of even figuring out what the right question is to make a decision on can be quite hard. People spend enormous amounts of energy in doing this. One of the things that I think that Simon and Tudor are exploring is currently, they think it’s around 50% of every developer’s time is spent on figuring out how to, how did you make the decision about what to do? Doing it?

Yeah. Is, is is much straight more straightforward in some respects, but figuring out the change to make is where 50% of the time at least is spent. So their argument is reading the system, exploring the system takes 50% of the time. Attached onto that perspective, if you buy into that perspective, and I really do, I can see that as how I develop systems, it’s how I know many coders develop systems, I don’t think anyone could argue with it too much. If that’s the, if that’s what we do, it’s a decision making process. It’s another OODA loop. Okay, and Tudor and Simon and the glamorous toolkit that Tudor’s company have created is all about how do we observe and orient to make a decision.

Okay. ’cause the act part, well, that could be relatively small, but making the right action, deciding on the right action is the hard part of software development. Yeah, and it’s almost like this question, do I have enough evidence to really, yeah, take this decision and not come back again. Exactly. And then you couple that, right?

Because they, in some respects, they’re working from a very scientific perspective. As a scientist, you’re trying to observe, you’re trying to look at the things that matter to say you can make a decision and you or you can at least have the evidence to make the right decision to perform the right action.

Same thing in, um, in reliability. When you’re using chaos engineering to explore things, you are trying to gather evidence of the system. Yep. So you can observe it and you’re trying to focus on the right piece, and then you are trying to look at, okay, what decisions and actions do we need to take to potentially improve the criticality of our system, it’s ability to survive more categories of turbulence and conditions.

Okay. Both of those things. All learning and science is undermined if you don’t have psychological safety. And that’s the tragedy of it is that you can do all these wonderful things and you can amplify and promote and help people embrace chaos engineering and reliability and rewilding of software, the psychological safety in the organization is, is always the, the huge hammer that will be there to go, no, you’re not going to learn from this, and that starts my theory anyway, is that starts with the beliefs and the comprehension, the understanding of the leadership in the organization. How do they think about the world? How do they embrace learning? How do they embrace faith? Mm-hmm. And my respect in my, again, in my view, one of the challenges we face is the people that are in leadership in organizations often have rarely failed, so they don’t know what it feels like to fail and they don’t realize the opportunity to learn and the need to make it psychologically safe to do so, and so that’s where we get these cultures, used to jokingly call them the Harvard Business School cultures where, you know, it’s all about how you make the right decision and get things right.

Yeah, I have to careful here ’cause I think Amy Edmondson writes for Harvard’s Harvard Business School, so I’m not, not, it’s not disparaging our whole school on this basis, but it is this thing of, we amplify the stories of success to the complete detriment of the stories that matter the most, which is how we fail, how we experience difficulty, and how we learn from it and how we survive it.

I just think that without psychological safety as an investment and a true investment, not a theatrical investment, some organizations are doing like hour long psychological safety courses and then going, right, we’re all psychologically safe now. You’re certified now! What? Exactly and I mean it’s always the sticking plaster over the open wound at that point, but genuinely taking the fact that most people have good intentions.

I think organization, they, we used to use a phrase that organizations that can learn fastest are likely to succeed. What was underpinning that, and I think Amy’s work talks about this very deeply, is that a culture of learning from failure, learning from these things, whether they be chaos experiments, whether they be actual incidents.

Yes. Whether they be poor decisions, okay, decisions that in hindsight, as John would say, a heck of a drug, but in hindsight we go, oh, that was a mistake. ’cause that’s how you can tell usually, embracing that and going, “Right, how do we as a collective get on with that? How do we, how do we explore this?”

What is it trying to tell us, in a safe way without looking for blame? And blame is the killer of all this. I mean, psychological safety has its quarterback, to use an American analogy for sports, you know, the quarterback of a, of a lack of psychological safety is blame. It’s always there behind it going, well, we, we, we may not be safe, but at least we can blame somebody, folks.

Yeah. You know, so that, this is why we, lots of people like yourself and Lauren and many of the others in this resiliency community that we have, this dispersed and beautiful community are so careful around words like blame and single root cause, in the, even in the language we use to describe what we do because they are psychologically unsafe amplifiers, and, yeah, and they ruin our ability to learn. Yeah, also, it starts with the word “failure”. It was just the wrong decision at some point in time. So just take again, another decision and improve and learn from it.

Benjamin Wilms: We can go on for hours, but I’ve got one final question, to wrap this up.

If someone wants to start building a culture, a culture of, of resilience, or reliability, we don’t should open the rabbithole: “resilience vs…”, no, what’s the first thing they, they, they should do or how can they start?

Russell Miles: So, I think we can lead it straight off, the last answer, in fact is I would look carefully at whether you have a culture of psychological safety.

It’s table stakes. So people, I’ve been asked this a lot. What are the table stakes for organizations do chaos engineering? You know, do you need to be able to do these things in production? No, that’s not, you know, no you don’t. In fact, I had an organization, I think I’ve told this story many times on stage where the organization we’re extremely proud of the fact that they were running chaos experiments in production.

They phoned me up and they said, you’ll be so proud of us, you’ll be so happy. We’re running chaos experiments in production, and every, everything’s broken. And I was like, no, no, no, what what you’ve got is masochism. You haven’t got chaos engineering… on design, yeah, by design. Yeah, and if, and if you don’t own the system that you’re hurting, then it’s just sadism at that point.

You’re not learning from anything. You’re, you’re learning what not to do. Um, so yeah, I, I would say that, so I never used to say there were many table stakes in chaos engineering. Anyone could begin to explore the reside air systems in a safe environment at any point in time, and you’d get some value outta doing that.

Yeah, maybe not a lot, but you get something. I now would say that there are many things you can approach it with. You can approach it in architecture and design with residuality. You can approach it in sort of verification and empirical, practical evidence seeking at before things get to production or in production, depending on where you want to learn the lesson and how real it is.

You could also invest heavily in how you have incident management and how teams work well or badly together when trying to respond to the greatest lessons that nature can send our way, but if you don’t invest, at least in parallel, in maintaining and establishing a culture of psychological safety, then the chances are you don’t learn from these things.

You might think you are. It can look like you are, but you probably aren’t. And it’s not that complicated to begin to invest in it and to, to watch out for the behaviors that are, that are in are, are sort of small evidence signals. Mm-hmm. That you don’t have a psychologically safe environment. The hardest thing for most organizations is it’s very introspective to individuals as well as organizations to look for psychologically unsafe conditions. I’ll give you an example. I’ve been pulled up on it myself. It, by asking a kind of passive aggressive question, you make something slightly unsafe to be to be able to offer a solution.

You’re, you’re saying, here’s the right answer, do it, which is not opening up the conversation. So. I, I think there are many, many experts out there more with more expertise than I on how to help develop psychological safety. The table stakes for a modern, high available, highly reliable system begins with psychological safety because you have to learn. Everything we’ve talked about, everything we’re doing, everything you and Steadybit are doing wonderfully in the industry, everything about all the folks we’ve talked about, the impact they’re trying to make is muted by a lack of psychological safety, because everything you’re doing amplifies the potential to learn.

And that is then, well, cold water is poured on that particular fire if there’s no psychologically, in, you know, there’s not a psychological safe habitat surrounding it. Yeah. So investing in that is, for me, the, if not the table stakes, it’s not the, you don’t do one before the other, you don’t have to do that.

I’m, I’m a big fan of not looking at things of going do this, then do this. I’m, I’m afraid it’s often do them both. So, but I would say that any organization that wants to benefit from these modern practices and those organizations that do, will produce systems that are observably more reliable and absolutely more relied upon and frankly, probably more competitive in the marketplace.

I work in financial services and in FinTech and everyone in that will tell you that reliability is number one, and uh, and scalability is part of reliability, and you know, it’s all about the reliability of the customer experience because financial services are trust-based systems. It’s everything.

Everything’s about trust. That’s all it is. Yes. And so to maintain that trust, you’ve got to invest in reliability. The hardest part of that is they also have to invest, because of the way reliability is achieved, and it’s understood these days, it doesn’t come from structures and authority. It comes from lots of creative individuals that feel psychologically safe to be able to learn from the things that are going right or wrong.

Benjamin Wilms: Well said, well said. Last question or more like, when people want to learn more about you, what is the best place to start?

Russell Miles: My goodness. To learn about me. Well, come talk to me. My door is always open. I love speaking to people. I love listening to people.

One of my hobbies, uh, it’s a strange hobby, but I volunteer with the Samaritans, in the UK, we are a crisis line, so I usually spend a four, four to four to 10 hours, depending on how keen I am of a given week listening to people in crisis and helping them navigate difficult circumstances, um, but I’m not suggesting that’s the best way to get hold of me because chances of you getting hold of me exactly at that moment, time are minimal, but yes, but reach out to me on LinkedIn. I’m often on LinkedIn. It is one of the better ways to get a hold of me. Pop me an email russ@russmiles.com. It’s incredibly egotistical email address. Um, and I’m always keen to, I’m always keen to talk to people anyway because, I mean, just not to model hark on about it, but when I was very ill, some years ago.

I made a decision that I am not here to make huge amounts of money. I’m not really here to make rich people richer or any of those things. I’m here because there are lovely people in the world trying to make things better for everyone. And so if you want to talk to me about how your organization is struggling with psychological safety, I’m here.

I may not be an expert, but I’m happy to hear and talk about it and, and, and maybe kick it around a little with you. If you are struggling with chaos engineering, if you’re struggling with residuality or reliability, just reach out to me on LinkedIn. It’s entirely fine. We’ll go and have a conversation. I will always make time for, for a phone call.

It’s usually best as a phone call. Yeah. And um, and if I get so many people asking that I have to change that rule, then I will, but to be honest, it’s usually the way around. It’s an enormous, like, dust devil. No one, very few people call, but, or get in touch, but if they wanna get in touch with me, LinkedIn’s a great way.

My website’s about to be updated with a whole bunch of courses and things that I do for people, so that may be another place to go and look at again, Russ miles.com. I’m not hiding. And, uh, yeah, I, I guess I always like to say this: there’s something I, I use, I’ve got, I’m giving a talk next week in Mainz in, just outside Frankfurt.

And, uh, I’m very excited because I’m going to the home of the Gutenberg press, so if you want to think of a platform that changed the world, there’s one. So I’m going to do that, enjoy, trying to practice my very, very rudimentary, very basic German. The last few slides of my talk are along this line, you know, of, of how to contact me, and I practiced something called Sophie’s Law.

Okay. Sophie is my dog. She’s actually snoring next to me right now. She’s a French bull dog. Okay? Sophie’s law goes like this: from the moment I got her, I used to be an introvert. I still, I’m an introvert. I, I used to be quite shy. My dog is not, my dog views everybody in the world as a friend she hasn’t met yet, and if she can get eye contact with you, then we need, you need to come and talk to me. You need to be part, i’m amazing, I wish I had an ounce of my dog’s confidence because she literally like, I’m amazing. Talk to me. Amazing. I am brilliant. So, so I try and I would say this to all of your listeners. I try to practice Sophie’s law.

Everyone’s a friend I haven’t met yet. Yeah. I would love to talk to you about what’s going on in your world and, and I extend that to organizations as well as people, if it’s organizations, it might have a price tag, but if it’s people, it, it very rarely does. So yeah. I’m, I’m here. Reach out. Thank you very much.

Benjamin Wilms: Yeah, it was really a pleasure to talk with you. It was really a pleasure that you joined my, my podcast session and I learned a lot and, I hope you also enjoyed the, the session.

Russell Miles: I really did, and it’s so lovely to see you again, my friend. And please, likewise, you and Steadybit, Steadybit keep doing the amazing things you are because the industry needs these technologies.

It needs people like yourself that are leading with goodness in their heart that are trying to make things better for everyone. I think, there should be more organizations like this, so, uh, it’s, it’s always wonderful to see.

Benjamin Wilms: Now, I’m getting emotional. Oh, boy. Thank you.

Russell Miles: Well, I’ll my best very, very much. If I can’t make you cry, what am I doing?

Benjamin Wilms: No, no, not yet. Okay. It needs to stop recording now. Thank you very much. Thank you so much for inviting me on.