October is Mental Health Awareness Month. WHO wants to raise awareness of mental health issues around the world and mobilize efforts in support of mental health. The day provides an opportunity for all stakeholders working on mental health issues to talk about their work, and what more needs to be done to make mental health visible. Speaking of mental health in the tech industry, especially in the environment of Chaos Engineering: What mental processes do we go through before tool decisions are made and Chaos Engineering processes can be successfully established? Let's look at Resilience Engineering on a human level: What do we need as a team to function well and build a Culture of Resilience? We think creating a healthy system and ensuring the health of the team requires looking at the people behind the code. That's why we asked Nora, founder and CEO of Jeli.io, who is focusing on the human factor in Chaos Engineering: What do companies need to change in the future so that teams can build a culture of resilience and sleep soundly at night?
The following text is based on an audio transcript. If any typos have sneaked in, we apologize at this point!
Right off the start, I'd like to ask you a sensitive question: How healthy do you consider the tech industry to be?
To be totally honest, I don’t think it’s healthy and I think we all know that. We're all very aware of how bad some of these things can be for our mental health and in turn, that ends up being bad for our businesses and ultimately bad for society.
One thing that the pandemic revealed, and that was actually good for the software industry, is how sensitive people are to burnout. We are starting to take it more seriously because we understand how prone people are to it.
At many organizations, especially throughout Silicon Valley, the average tenure of a software engineer is only a couple of years. Software engineers are helping to solve different, interesting problems, but a lot of the time they take that knowledge with them when they leave. We promote this hero culture where “Alex” is rewarded for always being there. When Kafka has an incident or when Console has an incident, we think, “Wow, Alex must know things that no one else does because she is always responding to incidents.” Rather than figuring out how to distill Alex's knowledge throughout the rest of our organization, we keep promoting Alex. We give this person more money, more stock options, and we try to make sure this person never leaves us because we need them so much. What happens is Alex eventually burns out and leaves.
The same way we set up our technical systems, it's very bad to have human silos that do not have fall overs. As a software industry we’re not great at figuring out what Alex is good at, how to ask cognitive questions, or distribute siloed expertise. Ultimately, we end up unhealthy because of this hero worship. The thing that businesses don't understand is how bad these islands of knowledge are for business.
In my research, I have often come across the term "psychological safety in teams". Google asked itself in 2016: "What's the essence of high performing teams?" After a lot of research, they came to the conclusion that the most important factor is psychological safety. This is composed of: diversity, speech-time and the ability to build empathy. What are your thoughts on this?
I absolutely agree that psychological safety is paramount on teams, we can all agree on that. I don't think that it is a contentious thing, the contentious part is that not everyone agrees on how to build it or educates themselves on how to build it. It is incredibly important for leaders because ultimately, it helps their businesses if they're fostering a psychologically safe team. The cognitive psychologist Gary Klein has written a lot about Cognitive Interviewing. It's essentially talking to people one on one after incidents, to create a safe environment for them to share what they felt happened.
Everyone has emotions after an incident. They might be feeling joy, they might be feeling sadness, they might be feeling nervous, they might be feeling a number of things. By asking them, “Hey Jona, how do you feel about this incident?”, they start to talk about it and we ease our colleague into the situation.
As the person asking questions after the incident, my goal is not to interrogate Jona, it's to make him feel like an expert. By making Jona feel like an expert, he's going to feel a lot more psychologically safe. He has a safe space with me to talk about this incident. I create this safe space for conversation with Jona, with Alex, and with anyone else that was involved in the incident. It is important to give them a safe space to understand that the stuff they're going to tell me will be bubbled up to management. After the interviews, I aggregate all of the data.
Going back to what you said about diversity, it's not just diversity, it’s the inclusion part. You could hire a very diverse team but it doesn't really matter unless, like you said before, you're giving people speaking time, you're giving people certain projects, you're fostering people's growth and understanding different people have different needs for their growth.
I would be interested in your thoughts on the following assertion: "Healthy teams are high performing teams".
I like it. I think there are different ways to define healthy to get it right. It depends on how you're defining a healthy team. What does a healthy team look like in your opinion?
The team is not afraid to make mistakes because making mistakes is part of the company culture. It's like we said earlier: It's perfectly okay to say "We caused an incident in the team. We don't know exactly how this one happened in the system, but let's figure it out together." To me, that feels healthy when people aren't afraid to talk about mistakes and learn from them together.
I really like that. When I was first building my team at Jeli, we did an exercise amongst ourselves because we didn't know each other very well. I asked everyone to reflect on the best team they've ever worked on, the most productive team they've ever worked on, and the worst team they had worked on. Some people's answers overlapped a little, but it was interesting to see how people defined healthy, productive teams. A lot of what I heard was that people know the goal of the team and everyone knows their part in helping to achieve that goal; it's a lot about working together. You might feel a lot of respect from your manager and feel like you're able to kind of do your work–but it doesn't matter as much unless you're executing as a healthy team. As a manager, you are responsible for fostering that capability for your team, for you as a human nodes to coordinate with each other and output something beneficial from that.
I think a lot of people miss opportunities to reflect on situations until after something has gone really horrible. That actually fosters bad psychological safety, because you're only taking the time to see how one thing went. When this happens every day, it's a miracle that your system is staying up.
When we talk about mental health in the everyday lives of SREs, OPs, and DEVs, how much pressure do you think there is for people who are responsible for the whole site's reliability? And how do you think this pressure is created?
It's timely that we're having this chat after the Facebook incident. I think a lot of that responsibility falls on SREs, OPs and DEVs. It's actually phenomenal. I mean phenomenal in a bad way. It's like at the end of the day, our infrastructure is held up by pieces of technology that you or I understand, but those pieces of technology were a product of us all working together, too. I have noticed a lot of the time after bad incidents, it's the SREs coming online to type the public incident report or apologize. I wonder why it's not the product of a bunch of people in the organization.
Take for instance if you run a PR campaign where a bunch of people hit your website at once and it happens to go down. As an SRE I'm not going to be preparing for PR campaigns to be run all the time, so I'm not going to scale up my infrastructure that much all the time because that's going to be expensive. If the SRE and the PR person had coordinated and understood each other's worlds a little bit better, that might have led to a more stable situation. After incidents, what we really need to reflect on are all of the things that happened organizationally to enable the incident to begin with.
I think there's an enormous amount of pressure that is created because emotionally, after incidents, we need to be able to point to something. As an industry, I think we are getting better at not pointing to a person, but we're still not great at not pointing to a thing. Instead of saying things like “oh, it was this database again,” or “it was this particular Kafka broker again, I knew that was a bad idea”, it would be great for us to understand the pressures of the organization or how management was thinking about these things that enabled that Kafka broker to be that fault fallible or enabled that database to be so wobbly. How did we even make the decision to use that database in the first place?
By understanding the history of certain things and anthropology of our companies, I think it ends up creating less pressure, but we don't take the time to do that.
Would you agree that the group of SREs and OPs is not such a big group as the DEV people, as the PR and also the other departments in organizations? It's more like everybody is running over SREs and OPs, and they are, let's say, just a handful of people that need to maintain those operations as systems. There is no balance: there are companies with 500 developers, but only ten to twenty SREs. How should they handle production?
SREs are thought of by the rest of the company as these magical people that just know things that no one else does. That's a very bad way for everyone to look at it; I think we propagate that as SREs, too.
Some of us like being those magical people, so we don't want to figure out how to talk to other groups. We perpetuate this reputation–but it's actually bad on us and it's bad on our organizations to not find what we have to reflect and say “what do we need from marketing?” or “what do we need from PR?,” and “what is it like to be able to communicate that to them, so that they can work with us better.”
Companies may have never heard of Resilience Engineering, and may not know that they need more SREs. Or need any at all. So it's a cultural and organizational issue that leads to priorities and resources being shifted differently. More resources are kept free for marketing, for example, instead of thinking about a healthy system.
I don't think leaders and organizations take a lot of time to understand what healthy actually means.
If we did take some time to study all the research that started in the 1930s before software engineering and before all these startups popped up, we could look at how a lot of those research papers studied incidents and healthy organizations–and apply it to us.
We're actually in a really great place as an industry. We're still young enough that we can study a lot and we're not regulated. Both of these things allow us to add in some of these structures and learnings in ways that help our organizations. A lot of other industries are regulated and are not able to do some of these things. What people aren't willing to give up, is the time that it's going to take to learn some of this stuff.
What we have also observed is: SREs are just firefighting and hunted by their own systems. But they are not on this long journey and don't have time to ask themselves, how could I prevent the next incident? They are so under pressure, they are not healthy in their daily work.
SREs are so under pressure! They are constantly firefighting and it's not a great mode to be in. When they do take the time to reflect and say, “how could this go better,” it's discussed by a roomful of SREs. You need to bring the product team in the room, you need to bring the business team into the room, heck, sometimes you need to bring the CEO into the room. They need to hear some of these things, too. By taking the time to reflect on an incident and your cost of coordination, you can work better next time.
From your long experience at companies like jet.com, Netflix or Slack, how are teams and people dealing with this pressure?
It depends on what the organization is going through at that particular point in time. I think of jet.com, we were an e-commerce site, and our biggest time period was the day after Thanksgiving, where e-commerce companies and shops tend to offer a bunch of deals to help people prepare for the first day of Christmas shopping. Everyone signs onto the website at that particular point in time, so that was when pressure increased a lot. We called it purple Friday. We were all thinking of working towards purple Friday. Whereas Netflix, the biggest time period of the year was the month of December, the holidays. We always said it was because people secretly didn't like spending quality time with their families so they wanted to watch Netflix together instead. With Slack, I was there right around the time we were IPO-ing, so there was a lot of pressure to make sure the system was stable at such a time of high publicity.
Each of those organizations went through very different periods of pressure, but I have noticed that pressure exists for engineers and organizations everywhere. We serve customers. So we have to think about when our customers are signing onto the application and experiencing incidents. In my opinion, that pressure is created around business goals most of the time. I don't think periods of high usage for your application are necessarily bad things, they're actually really great things. I think the bad thing comes when we expect that nothing bad is going to happen during those time periods. Like you said before, we have to be willing to fail, we have to be willing to fail fast, and we have to make failure not a very big deal.
I think it's a very poor approach when organizations say things like “we have to prevent incidents,” or “we have to get our incident counts down”. My theory is that we shouldn’t say that because it removes psychological safety. Instead, we need to say incidents don't have to be a big deal; they are a part of our everyday work. That's how we get people better at dealing with pressure. The way we get bad at dealing with that pressure is saying things like, “wow, we have this really big business thing that's about to happen; Alex and Jona are my two best engineers, everyone else go sit over there. Alex and Jona, I need you to work 70 hours this week”. Alex and Jona are doing all this work and that work isn't being shared and that’s a bad thing for the business. As an industry, I think we are fundamentally bad at training each other and distilling expertise. We have no formal way of bringing people up through organizations, and a lot of organizations don't celebrate you sharing your expertise with someone.
Relating to your Talk at QCon: In this talk, you pointed out that the introduction of new tools, especially in the Chaos Engineering teams, actually means that a culture shift is taking place. What are the challenges in such a situation?
Sometimes the introduction of new tools can be tough for teams that aren't taking the time to understand how to work with the tool. They're expecting the tool to just take care of cultural problems and that part is when things get tricky. We should embrace the fact that tools are a member of our team. A lot of times people think of these as two separate worlds, so I think that's when the mistakes and challenges come in. If we think of things in a joint cognitive systems approach, we're actually positioning ourselves much better to escalate both our team and the tool at the same time, which ends up being a win-win.
It is difficult to establish processes and a culture of resilience because it's not something that you just go to a training on to make it happen. It's something that you need to practice every single day. Some ways to practice this culture are to get good at cognitive interviewing, get good at incident reviews, and get good at recognizing places where you could use training.
I think one of the best uses of Chaos Engineering is not to prevent issues from happening in production, it's to train people. It's one of the best ways to show people how their systems work.
One mistake I see with Chaos Engineering, is having one person running experiments in the background, when really it would be beneficial for several people to run them. After an incident you can look and see who you relied on. If you have a realization of “wow we relied on Alex again” after an incident, you should run a chaos experiment and not let Alex do it. Alex should be in the room, it should be code she knows, but she should not be driving. This allows us to distill Alex's expertise a bit more.
From your point of view, how do startups in general and your own startup Jeli.io, as compared to established companies, adopt technical and cultural changes?
It has been really interesting for me running my own organization, because I've talked about resilience for years. I've done resilience in organizations as an IC and now I'm getting the opportunity to really practice what I preach by not only creating a tool that helps with these things, but creating an organization that is absolutely reflective of the things that I've studied. We're not perfect, but I think we take time, we don't ignore anything that's happened, we reflect on it and how we deal with it; what felt surprising, and what felt challenging. We had an incident last week and I had so much fun doing the incident review. I think every single person on our team did! That feels like a win. We actually invited our customers that experienced the issue to our incident review. I think that's helping to create a culture of psychological safety and resilience. I have 20 people on my team right now; I know all of them but as the organization grows, I won't know all of them so that will be harder. I do think some things you do in the beginning of an organization percolates outwards as you add more people. I think it's so important for startups to do that in the beginning because it's harder for bigger organizations to add that in later on.
We have learned that new processes and new tools can be a high stress factor. Now, of course, it's paradoxical that Chaos Engineering tools aim to counteract this stress. Hence my question: What does the tech-branch have to pay attention to, when it comes to keeping people and their mental health in focus when we want to introduce a new tool?
We have to be prepared to receive that new tool. I think we have to do a lot of reflection on why we need this tool and if we can't answer some of those questions, a tool is not going to save us. No matter how good the tool is, if we as a team haven't taken the time to identify our gaps and where things could be escalated, that tool is not going to be successful.
What are your top 3 when it comes to building a culture of resilience, and accordingly a healthy team.
Communicating with each other and sharing information with each other. Don't work in bubbles in your organization.
Listen, understand, and learn from each other.
Let the experts be experts. As a leader you hired these really amazing, brilliant people. So let the people be good at these particular things.
Nora Jones, Founder & CEO of Jeli.io
Nora has spent her entire career in SRE, incident and resiliency spaces. She worked at Netflix, jet.com and at Slack, where she was able to experience intense personal development in different ways. Through a lot of passion and understanding, Nora founded Jeli.io, a San Francisco based company helping you to analyze your incident data.
Nora says about herself and Jeli.io: "One of the things I've always noticed was, how much incidents could give us data about how we're doing as an organization. And, and not only that, I can give leaders a lot of data about how well they're positioning their system. There's been a lot of research throughout various industries outside of tech, that are considered safety critical industries that have a lot of focus in incident investigation, and how much it can actually show you about your organization. I like to say that incidents are mirrors. And it's one of the only times we actually get to see our true selves as an organization. And so we made a tool to help you grant that expertise to folks in the software industry, so that they can see those patterns, those situations as well, so that they can constantly improve.
Say hi to Nora @nora_js