“Experiments in Chaos” is a Steadybit podcast that brings together experts in software systems reliability to discuss how to incorporate resilience practices like chaos engineering to foster a culture of reliability.
In this episode, Benjamin Wilms sat down with Tom Handal, a Sr. Principal SRE at Roblox, to discuss how chaos engineering can help improve the reliability and resilience of large-scale systems. Tom shares how he overcame initial hesitation by aligning with existing processes and winning over teams by sharing experiment findings. Lastly, Ben and Tom discuss how AI will likely impact the future of reliability and chaos engineering.
Benjamin Wilms: Hello and welcome to my next session. In the past sessions we explored how different teams deal with unpredictability from building a culture of chaos engineering to uncover liability risks before they turn into incidents, and today we are taking this conversation to the next level with someone who lives and breathes this every day, and today my guest is Tom Handal from Roblox. Welcome to my session.
Tom Handal: Good morning, Ben. Yeah, thanks for having me.
Benjamin Wilms: You are welcome. So yeah, maybe you can explain a little bit what you are doing at Roblox, what your role is, and maybe yeah share some high level details about the system you are operating.
Tom Handal: Sure. Yeah. Just to give a little bit of background on myself. So yeah, again, my name’s Tom Handal. I’m a reliability engineer at Roblox. So just give a little background. I began the engineering program here at Roblox about a year and a half ago. It’s been a fun and interesting project. A little bit about me.
I’ve been in the industry for about 28 years. Started off with a few internships back in university. Um, I interned at the Department of Defense for a few projects, also worked for the Education Center for Computer Science and Engineering at San Diego State University with Dr. Chris Stewart in the computer science department there. Also was the president of the Association for Computing Machinery Chapter at SDSU. Roblox is my ninth company I’ve worked for. My background is in security, security software. Kernel development, uh, DRM, things of that nature. I was director of security for a smaller company where I, you know, developed security libraries and things for that company.
But then later on in my career, I got involved with companies such as Facebook and Twitter as a reliability engineer. And now I’m at, uh, Roblox Living the Dream, working, uh, working on chaos engineering. So it’s, uh, it’s been a fun ride so far.
Benjamin Wilms: Absolutely. And it sounds very impressive. And, what personally fascinates you about large scale systems?
Tom Handal: Yeah, so, uh, you know, the idea of hundreds of thousands of machines with millions of user interactions per second is really mind boggling to me. Having worked at a handful of companies with massive scale, I’ve spent years around it, and it can be overwhelming sometimes. You learn different design patterns around distributed system theory, and over time it gets a little bit easier, but it takes many disciplines to make things work.
Compute orchestration, databases, object stores, message cues, telemetry, distributed cache, and now we have machine learning and artificial intelligence and, and it’s just getting more and more complicated over time. Reliability in this kind of environment is not about everything, you know, working or not.
It’s, it’s more becomes about how to keep the system running when something currently is not working because in such a complex system, there’s always something that’s going wrong basically. A minor issue with a single service in a complex system can cause cascading issues that ripple throughout the system.
It’s the engineering of how to keep the user experience positive while something is going wrong that is the real discipline of reliability, in my opinion, and that this is not just for the systems themselves, it’s also for the engineers balancing cost, performance and risk tolerance is not easy.
So one of those tools that we use is Chaos Engineering to proactively test that user experience stays positive in the face of different failures that we might inject into the system.
Benjamin Wilms: And just for the audience, for reference, I mean, large scale. So for someone, could this be like 100 nodes? For someone else, maybe 500? What, from your experience, from your background, is large scale? How big are those systems?
Tom Handal: That’s a great question. You know, that’s, uh, large scale, I guess is a relative term. Uh, you know, throughout my career I’ve worked at, like I said, uh, this is my ninth company. So I’ve worked at companies where we thought large scale was 20 or 30 machines. I mean, that was years ago, but, uh.
But you know, I mean, even, 30, 40 machines is, you still have to deal with multiple machines. You have to worry about, you know, a lot of moving parts, and that scale over time as you grow into larger and larger companies just gets more and more complex. And, if you’re lucky, you kind of can grow with it.
You know, you start out at a smaller company where you know the scale is not too big and then you move, you know, you move up. And that, that’s kind of how it worked for me. I was lucky that way, I guess, starting out in a very large scale system, like for instance, Facebook was millions and millions of machines in, uh, I don’t remember exactly, but dozens of data centers.
It was very, very, overwhelming. Then moving to Twitter, which was a little bit smaller, but still very high scale. You know, hundreds of thousands of machines and then Roblox hundreds of thousands of machines in a few different data centers. So, I guess the term at large scale is relative, but honestly it doesn’t really matter how many machines, it doesn’t matter how many services.
It’s still complicated because, depending on how things were designed, depending on how many services you have, how many customers, where are they coming from, are you latency sensitive? I mean, there’s just a lot of factors and a lot of variables. So I think something like chaos engineering as an example, as well as a lot of other disciplines are useful regardless of what scale that you’re at.
Benjamin Wilms: Absolutely, and that was like also the point in my question because from my experience, even if you’re, let’s say running on three nodes, a Kubernetes cluster, you can get in a lot of trouble. At the same time where someone is sitting in front, very relaxed, sitting in front of 100 nodes. So it’s really, yeah, it needs to be seen, in context.
Let’s zoom in in the topic of chaos engineering and maybe you can explain what this means for you and why you have started with that topic and maybe from your experience, some core principles you are still following because there are a lot of principles you can follow in chaos engineering, but what is like the core principles you follow?
Mm-hmm.
Tom Handal: Right. So yeah, that’s a great question.
Uh, to me, chaos Engineering is about attempting to cause different types of degradation in production to see how our systems behave in the face of that degradation. This is what I call proactive experimentation, trying to inject failures or issues in systems that we haven’t necessarily seen before or not experience in a certain way.
Next, there’s reactive experimentation, right? So the, that’s when we have an incident. Whenever there’s an incident, we perform a postmortem where we do root cause analysis. We determine, okay, what caused the issue, and we may come up with some fixes, you know, which we call action items to actually fix the issue and make sure that it doesn’t reoccur.
The difficulty there is how do you make sure that the issue is not gonna reoccur? Well, up till having the chaos engineering platform, it was difficult to do because how do you reproduce that scenario? Having a good chaos engineering platform, a lot of the time you can actually reproduce that scenario that actually occurred in production, and actually test those action items that are put into place to really ensure that the issue’s not gonna happen again.
And I think that’s really, really very important. So that’s the proactive and reactive approach that I talk about, and then the principles are, you know, like, like one of the main things I get concerned about is just making sure that we put the technology in the hands of the engineers so you can have a chaos engineering team that runs the experiments, which is great.
But especially like we talked about scale a little bit, so regardless of scale, but especially in a large scale company where you have maybe thousands of engineers or maybe hundreds or whatever, it’s almost impossible to scale the number of experiments that you really need without giving it to the engineers, that capability. Now you have to have guardrails and safety mechanisms in place to make sure that they don’t accidentally bring things down or whatever. But, I think lowering that friction point to allowing teams to actually use the chaos engineering product is, is very important.
But some of those principles, like I mentioned around that, is like transparency and safety. Making sure that when you’re running an experiment, there’s high transparency, people know about it. Um, you know, safety is having guardrails, proper pre-flight checks before the experiment. In-flight checks, meaning like making sure you’re not causing an incident during an actual experiment, uh, because we do run experiments in production.
So, transparency and safety, low friction and getting the use of that chaos engineering platform into the culture of the company, having it become part of your daily work as opposed to a side thought or something that somebody has to come and ask you to do, if that makes sense.
Benjamin Wilms: And you mentioned it, I guess two times the word “guard rails”. So, there are two ways from my experience, two ways of doing chaos engineering if you would like to roll it out into the organization and to the teams, and one way could be just to implement controls and the other way guardrails. What is your view on this?
What is your preferred way of doing it to enable engineers to get started with chaos engineering?
Tom Handal: Yeah, that’s a great question. I actually like both, controls and guardrails. So, um, I guess there’s a little bit of overlap in the terms, but like the controls: for instance, we have many teams. I don’t know the exact number, but let’s say several dozen teams, right? And, each team has N number of services that they write and deploy, and each one of ’em is very important. Each team works on some amazing services. They deploy those services. A lot of ’em depend on each other. They, they call each other to perform certain pieces of work.
That being said, you want to make sure that those controls are in place, so the controls that we put in place are, for instance, we don’t want anyone experimenting on another team’s service, right? So we make sure that that team has only the permissions to run an experiment on their own services or their own infrastructure.
We also put into place controls, which I guess leans more towards the guardrail side of things, we have very specific pre-flight checks. We have basic pre-flight checks, things like making sure experiments are not being run on a holiday, making sure experiments are not being run when there’s a, a code freeze going on, where, you know, maybe there’s an event happening.
We don’t want experiments being run, outside of business hours is another big one ’cause we don’t want to cause an incident and have to page people and, and things like that. We’d rather do it when everybody’s at work and ready to react if there needs to be a reaction. But other guardrails we put in place are the inflight checks.
So the in-flight checks, we have certain ways to detect what we call player drop. So as you know, Roblox is a gaming platform, very, very popular. We have tens of millions of people playing at any given time. Uh, and we use.
Benjamin Wilms: My kid as well.
Tom Handal: Yeah. Yeah. Um, so I’m, I’m glad they play. Um, so Player drop is just what it sounds like, right?
We see a sudden drop in the metric for players, right? We, we, we track how many players are playing at a given time, and we, if we see a sudden drop that is that’s anomalous, then, then we know, okay, something’s happening, like something’s going wrong with the system and then we can start investigation from there and, and try to figure it out.
So if we’re running an experiment, we have an in-flight check that actually checks the player drop metric amongst a lot of other metrics as well, to make sure that like we’re not seeing a player drop. If we do, it’ll automatically abort the experiment right away. So that’s very important. So hopefully that makes sense with the, you know, controls.
So controls is more around making sure that people are running appropriate experiments on their own infrastructure and services. That’s one of our main controls. And then, you know, those guardrails are more like the pre-flight checks and in-flight checks, another form of guardrail is making sure that they’re not putting in latency values or, or certain types of configurations for the experiment that are, that we would consider to be like, maybe too dangerous. Um, another check we do, that we’re working on actually is dependency checking, like I mentioned earlier, like a lot of services will call each other.
We also don’t necessarily want, and sometimes we might, but in, in general, we don’t want to experiment on services that are actually direct dependence of each other, just to make sure we’re not causing a larger cascading failure than we intend. So.
Benjamin Wilms: Yeah, that makes sense. And, um, so if you are running an experiment, what is like the outcome you want to achieve with it?
Tom Handal: Right. Yeah, that’s a great question. So the outcome, those are kind of like the goals I guess, of the, of the, um, the actual experimentation. So those goals, kind of evolved over time for us. When we first started the program here at Roblox, there was, you know, a lot of nervousness, of course, when you use the word chaos, people are like, why would we wanna do that? Um, you know, it’s, uh, so we, I, I purposely started out by integrating with a system that already pre-existed at Roblox called squeeze tests. So squeeze tests, basically were a way to, in production stress a particular service.
So you could choose a service, and what you do is you reduce the number of instances slowly for that service, so the, the idea is that you are increasing slowly, the request per second on a per container basis or per instance basis, so we call that a squeeze test, and that like stresses each individual container more and more to kind of see where the breaking point is.
You know, do we see latency spikes, CPU spikes, memory, any kind of bottlenecks that might occur, maybe thread processing or, or what have you. So I integrated our chaos engineering product with that first, and we ran squeeze tests, but through the chaos engine. The idea of doing that was, I did that on purpose to try to use a type of experiment that people were used to in the company, but just doing it through the chaos engine just to kind of build that confidence level.
Once people were used to that, the next goal was to start running experiments that were on very specific, well-known systems that, and we could really reduce blast radius and really target, very succinctly. So that way, it was very clear that if something went wrong, it probably wouldn’t be a huge deal.
So that increased the knob a little bit of, of inducing chaos, but not so much to where, people would get too nervous. So we started doing that and the goal there was one to test those systems ’cause these were valid experiments, chaos experiments, but also to again, increase another level of confidence and, introducing chaos engineering slowly into the culture.
The goal now is, you know, now that we’ve introduced it, people are used to it and it’s being used fairly heavily now internally and, which is fantastic but the goal now is typically hypothesis driven, going back to the proactive and reactive experiments that we talked about. So now we’ll choose a service or maybe a piece of infrastructure and we’ll say, okay, well our hypothesis is that if we increase latency on this database connection, it may cause latency to propagate down this particular call path for an API, let’s say. So we’ll actually induce that latency and we’ll see what happens and we’ll see if the actual outcome matches our hypothesis.
Typical hypothesis driven experimentation. It works well and sometimes our hypothesis is correct and we’re like, okay. It just, it just affected that API path and that was it. Sometimes it hits another API path that we weren’t expecting, and we have to investigate, well, why did that happen?
Do we need to do more engineering work into separating those paths a little bit better? Where did they cross, you know, things like that. So that goes back to the proactive experiments that, that I was mentioning earlier, and then later on, after that we mentioned reactive experiments, the, the reactive experiments are not as much hypothesis driven. They’re more like reproduction of an incident driven, right? So we, we had an incident that happened. We know most likely what happened to cause that incident, and we come up with action items to fix that. But what we do is we’ll try to reproduce that same scenario that happened.
And the goal there is to actually validate those action items to make sure that they actually work. ‘Cause we’ve had incidents in the past, before we had our chaos engineering platform, we had incidents in the past where we put action items in place and we thought the action items would fix that particular type of incident, but we ended up having the same incident again later, and now with the chaos platform and being able to actually reproduce that same scenario, we can actually validate the action items and ensure that it’s not gonna happen again. So that’s, that’s a huge, huge win.
Benjamin Wilms: Absolutely. And also just getting back to the, yeah, I would call it also the expectation driven way. So I mean, if you start with your expectation, it’s way easier to run the experiment and then verify and validate is my expectation fulfilled or why not? And otherwise, if you start with an experiment, and I have seen it a lot of times where people are just selecting any kind of attack out of a list, let’s say shutdown node.
Okay, what is the expectation? They just want to see that the node shut down attack has been executed. No, that’s not how you improve your system. You really would like to see how your system reacts and what, very important, what is the impact for your customer. Um, that’s sometimes out of scope for some people, but yeah, everyone is starting that journey.
So step by step and talking about the journey, what it’s like the typical flow. So is it really like when you start an experiment, you’re doing it from scratch, let’s say in a game day, and then you are at some point integrating it into your CI/CD system. Can you maybe describe this a little bit?
Tom Handal: Yeah, so yeah, that’s a great question. So, we started out, just like any project really, you have to crawl, walk, and run, right? So you have to start out somewhere. We first started out with our chaos engineering program by slowly, just talking to certain teams and being like, Hey, would you mind us, you know, running some, some chaos experiments on your service in, in production.
So it kind of started out as like, I guess you can call it a GameDay We don’t really use that term exactly, but it’s essentially like a GameDay where, you know, we choose a particular service and we’re like, okay, today we’re gonna focus on the service. Let’s run some experiments.
Let’s inject some latency, let’s inject some CPU stress, whatever, and see what happens. You know, teams are a little scared at first, uh, like nervous, but once they start doing it and once they see. It’s really interesting. You see their, their, their faces light up when they’re like, oh, look at that.
Like the latency’s going up on our metric. That’s really cool. You know? ’cause it’s not usually something they see unless there’s an incident happening, right? I think them gaining that control of injecting these types of failures, I think is , it’s like a sense of power maybe. I don’t know.
So it’s, it’s, it’s cool and, and they actually get a little bit more interested in doing it. And, we’ve had actually quite a few engineers across the company become very interested in chaos engineering once we started introducing it, and now they, they want to do it on a daily basis.
But earlier we also talked about scale, right? So when you’re dealing with dozens and dozens of machines, thousands of services, uh, I’m sorry, dozens, dozens of teams and thousands of services, hundreds of, of thousands of machines, lots of moving parts, right? Even if you had a few dozen engineers trying to sit there and manually run experiments, you’re not gonna make enough of a dent to really test things appropriately.
So that’s where we started implementing automation. So automation is one of the most important things in my opinion. Now there’s different forms of automation, you mentioned about integrating into your continuous delivery system, right?
Benjamin Wilms: Mm-hmm.
Tom Handal: So we do have a continuous delivery system internally. It’s, it’s a system.
We, it’s a homegrown system and it goes ahead and deploys services to what we call “cells”. So I guess I’ll have to step back a little bit and explain what a cell is. Um, so. Most people understand, you know, orchestration systems, right? You have, you have different ones like Nomad, Kubernetes, and there’s others like Mesos.
And what those are, are basically, you can create clusters of machines and it’ll run your container orchestration system basically. So you have services and it’ll go ahead and use bin packing algorithms to actually deploy N number of, of containers of each service to these worker nodes.
About three years ago, approximately three and a half years ago, Roblox had an incident, and we called it the Halloween outage. And the Halloween outage was, it happened during Halloween, which is why we call it Halloween outage. But we used to have one large monolithic orchestration system.
I won’t get into like crazy details of exactly what happened, but in a nutshell that one monolithic orchestration system went down due to being overloaded. And it literally took, uh, I I, it, it happened before I started with the company, but I know a little bit about it, we were down for about two days, uh, approximately.
Um, so which is not good. Uh, obviously, uh, that’s, that’s a lot of, um, a lot of lost, revenue and, and very upset players, which we definitely don’t want. So basically that, that was a fundamental shift. Roblox was a very fast growing company, and we still are, and we had been going through COVID around that time and, and so everybody was staying home for COVID, and that caused a lot of people, of course, to jump online, play games, and the number of players was just really growing dramatically during that time. So, we were just doing, the company was doing what it could just to, you know, keep things up and running.
There was a lot of work behind that, but the Halloween incident really spurred a large reliability push in the company, and, there were many, many things we did to increase reliability, but one of those things we did is we decided to break up that monolith of an orchestration system into what we call cells.
So, cells are basically smaller clusters of orchestration systems, so the idea was they were around particular fault domains like power fault domains, network fault domains, like physical fault domains, and also, having logical fault domains of okay, you have your services spread out appropriately, evenly across these cells, so that way you make sure that if a cell goes down, you still have plenty of capacity to continue running all the services that you need to. So that was the cells approach. The idea of having cells, that’s why we had to create a kind of at homegrown continuous delivery system, which had knowledge of all those cells, knowledge of all of our services, how much they needed to be spread out appropriately, doing the calculation of how many instances are actually needed per service, things like that. So we are working on integrating with that continuous delivery system so that way we can have experiments run on a deployment.
So if you deploy a canary of your service, it would be great to actually run experiments on that service to see if there’s any regression. So regression checking, using chaos Engineering is great because it uncovers things. ’cause when people write tests, typically unit tests or integration tests, usually they’re testing the positive case.
Uh, like does it work? You know, can, can I, can I talk to this machine over here? Can I talk to this service? Sure. Everything works. You know, thumbs up. What gets overlooked a lot is the negative test cases, right? And, and chaos engineering can really help with the negative test cases to make sure that maybe that, that thread change you made, or that extra mu text you added might make things look great, but when you’re inducing latency, maybe that adds a lot more contention than you were expecting, things like that. So, chaos engineering definitely helps a lot with, those types of experiments, especially for negative test cases.
But yeah, automation. Another thing we do is automation. We could talk more about this, but we also do rack failures. Rack failures is another big type of chaos experiment we do internally, and that’s been very fruitful. We found many, many issues while doing rack failures. Um, for those of
Benjamin Wilms: like for, for the audience, really the infrastructure itself is failing. The hardware is failing, and that’s what you are doing with chaos experiments.
Tom Handal: Correct. Yeah, exactly. So, so basically, yeah, we have data centers. I guess most people are familiar with data centers. Inside those data centers you have racks and a rack holds, a certain number of machines. So usually at the top of the rack you have what we call a tour. A tour is a top of rack switch, so it’s basically a switch that, that provides the network for that rack of machines. So what we did is we integrated our chaos platform with the network engineering automation to actually basically disable the uplinks for that switch. So you have the top of rack switch, which talks to redundant, what we call pod switches above that to feed the network.
Now what this chaos experiment does is it goes in and actually shuts off those uplinks. So basically the rack loses all network, shutting down the rack essentially. Now that rack can have dozens of cell workers, it can have stateful services like a database or maybe, you know, a caching system or pretty much anything can be running on it, right? And we’ve found so many issues with shutting down a rack. Like, ’cause we lose a certain number of services, we might be taking down a, a node of a database or a cache cluster, or things of that nature. And so running those types of experiments has been very fruitful. We found issues with systems that we thought were redundant across racks not actually being redundant across racks.
Our rack diversity assumptions were not correct. So rack diversity is when you say, okay, I have a database with n number of nodes. It might have R three replication where you replicate three copies on different machines across different racks. So that way if you lose a rack, you still have two copies of whatever the data is.
We found quite a few issues with that, and we also found a lot of issues with what we call consistent hashing. Consistent hashing is when you might have a stateless service, but we might have certain traffic routed to that particular service because even though it’s stateless, it might do a lot of caching, memory based caching internally.
So we like to try to pin certain requests to that same stateless service in order to increase performance. Uh, but we found. Another thing we found of rack failures is that the load balancers, which are redirecting this traffic, they’ll keep going to that same service. And when we shut off that, that rack, even though it’s a stateless service and another stateless service should take over, of that same type, we found that the load balancer was still trying to hit that particular instance on that down rack, not realizing it was down and it would just start sending 503s back to the caller instead of going to the next container. So we found those types of load balancer, misconfigurations, we found timeout misconfigurations, we found, you know, retry misconfigurations. Quite a few issues that we found with rack failures, but running rack failures is tedious and time consuming, so we’ve built automation, around our chaos system to actually run those rack failures, and we perform currently three rack failures per week, in our production data centers. And it’s been amazing. Like in the beginning, the first six months or so we were doing it, we found issue after issue after issue. And now we’ve noticed that we can pretty much take down almost any rack, and it’s actually rare that we have an issue now.
So there’s huge improvements in our reliability, from both stateless and stateful systems through that type of experimentation. So it’s pretty cool.
Benjamin Wilms: How could, how are you able to now close the loop? I mean, if you are running those experiments, how are you working with other teams to close the loop to help them to say, Hey, that’s what we have done.
That’s the experiments, that’s the result. Now we need to improve. How are you doing that?
Tom Handal: Right. Yeah, that’s a great question. So, when we first started, going back again to the beginning, uh, because, you know, it’s a journey, it really takes a lot of planning and a lot of, culture shift to really do this properly. So going back to the beginning when we first did our first few rack failures, it was very daunting and a lot of very nervous people.
It, it took me quite a bit of discussions and meetings to convince people that we should do this. Um, it was, uh.
Benjamin Wilms: You, you told us you’re doing it in production.
Tom Handal: Correct. Right. Correct. So, and we, we, we did start out the first few, rack failure experiments. We did start out in what we call our site test data center, which is kind of like more for testing.
The problem with site test or the testing environments in general is that, you know, you don’t have real production traffic going through it. And no matter how hard you try , you’re never gonna get a testing environment to be exactly like your production environment. I mean, you can get it very close, but the traffic alone, trying to reproduce the same type of traffic patterns and the same type of load is just extremely difficult.
So you end up having to do production level testing, that’s where you’re really gonna find the issues. Going back to your question of like, how do we work with those teams? So when we first started doing the experiments, any type of experiment, not just rack failures, we initially started out by actually working with that team’s on-call.
So what we would do is, like, for a rack failure as an as an example, we actually wrote a tool that would actually, we can run it, we can point the tool at a rack and it’ll actually scan the rack and tell you everything about it. What kind of machines are on there? What are those machines doing?
Are they cell workers? Are they part of a database? Are they part of a caching system? Uh, what team does that machine belong to? What services are running on it and what teams do those services belong to? So it’ll give you a digest of all that information and then we can easily go out and contact each one of those teams.
We could find the on-calls of each team and contact each one. Tell them, okay, we’re gonna run this experiments. You’re gonna lose this system that you’re running on that rack. What we did initially is we actually set up a meeting where we had a conference room and we would invite all the on calls for all of those different systems into the room.
So we would have a room filled with on-calls basically, and we would run the experiment and everybody would be closely monitoring their, their, their services and their, and their systems and, and everybody’s nervous. And, you know, we had a big call with like a whole bunch of people watching and, uh, it was a, it was a big deal.
Benjamin Wilms: Yeah, let me, let me chime in there because, um, I mean, if you’re doing chaos engineering experiments, one of the core principles is if now in that scenario, someone will raise his hand and will say, Hey, we shouldn’t do that, because I, I already know we will fail. Have you considered that in, in that scenario?
Tom Handal: Yeah, that’s actually a great point as well. You know, it’s, it’s funny, like we talked about negative test cases, right? Like a lot of people don’t necessarily focus on those. Uh, not, not because you know, they don’t know what they’re doing. It’s just kind of inherently they’re working more about like, oh, I have to get this thing working, so let me get the positive case working so we can deploy this.
When we started talking about doing rack failures, there were a lot of cases where we’re like, okay, we’re gonna fail this rack. We’re gonna take down this machine of yours. Oh, wait a second. Wait, that’s not gonna work. We’re gonna run into this problem if you do that, you know, so like, yeah, there was, there was actually cases where people either knew that they wouldn’t be, good in that case, or they would think about it and be like, oh, actually if you do that, this, this bad thing might occur. You know, like, so, yes. Uh, there was actually a lot of cases where we actually fixed issues without even having to run the experiment because just the idea of running through their heads of, oh, what’s gonna happen when we run that type of experiment?
Like they would, it, it would get their, the wheels turning and they would actually think of things that need to be fixed before we run the experiment. So, yeah, that, that’s actually, um, a very good point. Again, that’s part of that cultural shift of, of changing that thought process of different failure scenarios, that people may not think of if there wasn’t a chaos engineering platform in place in the company.
So yeah, that’s, that’s a really great point.
Benjamin Wilms: And let, let us zoom a little bit more in the cultural aspect. So we have talked a lot about the technology and what’s possible, and I mean, we are all tech guys, we all love technology, but there is more needed. There’s like the, the people are needed. So some processes are needed. If we talk about a system, like the definition of a system is not only infrastructure or technology or, or services.
There are people, there is an organization, there are rules, everything else. So, that kind of experiments you are running right now, is this something you can see doing a change in your culture, a way of improving how you work, how you talk about outages and resilience?
Tom Handal: So changes in the culture. Yes. We’ve already seen a cultural shift due to these experiments. When I first decided to introduce the idea of doing chaos engineering? Yeah. Like I said, there was a lot of friction at first. It was a little bit of like, oh, you wanna actually induce failures?
It took some convincing. But yeah, like I was mentioning, after running these experiments, after introducing like rack failures, after introducing these latency injection experiments and, changing the culture by making the friction to running an experiment as low as possible.
Putting the tool in the hands of certain engineers, making it part of our postmortem process, like I mentioned earlier, it’s part of our postmortem process in reproducing the issue so we could test the action items. So, I actually made that part of the process. So we actually have a postmortem template, the person responsible for that particular incident, or the team, they need to fill out that, that postmortem document, describing what happened and the root cause, et cetera.
But as part of that document, we added a few questions to it. One question is, could have this incident been detected using chaos engineering, number one and number two, can chaos engineering reproduce the incident so that way the AIs can be tested. So doing things like that to integrate the chaos engineering platform into the culture has actually made a huge difference.
Now it’s more top of mind for a lot of engineers, which is really great. But similar to security, some people that have may worked in the security field a little bit, there’s certain things in our field that, we’re always trying to progress. We’re always like, oh, we need to put out this product.
We need to put out this feature, et cetera. There’s certain things that kind of can go by the wayside sometimes. Those things, like a big one is security. There was a period of, of time in the industry, I think where, where security was thought of, but it wasn’t really top of mind.
I think we’ve kind of made that shift over time and I think companies are a lot more security minded now. Like we have an excellent security team, our InfoSec team here at Roblox. They’re always proactively involved, always making great suggestions and putting great guardrails around everything.
They do an excellent job there. So we’ve made a lot of progress there. But I think back in the day, security was one of those things that people didn’t think of, at Forefront, ’cause you know, you have to invest money into it and is it worth the investment? Right? I think we’ve all learned security is definitely worth the investment.
Now Chaos Engineering is similar to that, where it’s, you know, is it worth the investment? Like, should I put x number of engineers to work on this? Should I spend the time developing it? So trying to convince people that it’s worth it is not easy. It’s not easy to convince people because one question I get asked once in a while is, well, have we prevented incidents and it’s like, well, yes, I truly believe we have, but how do you prove that? How do you prove something that hasn’t happened? And I actually, you know, it’s kind of funny. I was, uh, uh, playing around with some comics. I actually have them here because, why not? Just for fun, I generated some comics that I thought kind of illustrated this a little bit and I’ll just, I’ll show a couple of them if that’s okay. I’ll try to hold it up to the camera. Hopefully it’s visible. But, uh, here’s one of ’em. Let me see if this is working. Uh, there we go. Let’s see if that works. So, um, basically it’s a, you know, like a police officer with some empty cages, you know, and they’re his ghost traps and, you know, there’s nothing in there.
And he is like, oh yeah, this, my ghost traps are working. There’s nothing, there’s nothing in them, so there must not be any ghosts. So that’s like how like chaos engineering kind of success looks like suspiciously, like nothing. Here’s another one that I, I really enjoy. This is, uh, let’s see.
Yeah, see, not a single raindrop. The umbrellas really work, but it’s like a sunny day and there’s no rain. You know, so like, that’s another, another kind of idea about, you know, how do you convince people that chaos engineering’s working when, you know, did you prevent an incident?
Well the incident didn’t happen, so sure we prevented it. Um, you know, so.
Benjamin Wilms: like sitting in a car when, um, there is an airbag inside of that car and you are still a very happy customer, even if the airbag is not exploding in front of your face.
Tom Handal: Exactly. Exactly. That’s another great analogy. Yeah. Uh, exactly. So, um, so how do I deal with that? That that’s, that’s very difficult and it’s, it’s something I still struggle with. You know, like how do we show the impact of the chaos engineering program? And the only thing I can, like, the main thing I’ve been doing is just being very data driven.
Like one thing we do track is you asked earlier, how do we work with those teams to fix the issues? Well, what we do is if, if during an experiment we do find an issue, we’ll create a ticket. So it’s like a, like a fix it ticket or a bug ticket, and we assign that ticket to that team and we say, okay, here’s the details of the experiment.
Here’s what happened. Try to investigate and fix, right? And what we do is I, I kind of label those, I label all those tickets with, I try to match it, like I say, okay, this is a chaos engineering ticket. Or it resulted from a chaos engineering experiment. And I try to put a link to what the experiment was so we kind of know, okay, what was the experiment?
Uh, what kind of experiment was it? You know, who ran it? Things like that. Just so we can have some data around the action item and, and what needs to be fixed. And then what I’ll do is I’ll compile those and I’ll kind of give them a priority level based on what I think the severity would be.
Or what the engineer would think the severity would be if it actually really happened, you know, organically as opposed to from a chaos experiment. So I kind of keep track of those action items that come out of the experiments, and I kind of use that as one piece of data of like, here’s the work we’re doing.
It’s still hard to convince people sometimes that, well, did that prevent an incident? And it’s like, would that ever actually happen organically? Um. maybe not, I don’t know. But if it did, we know it would’ve probably caused an incident, so it’s worth fixing and it’s worth finding, especially with the rack failures.
Like when we started the rack failure, uh, campaign, one of the questions I got was, oh, do we actually have rack failures? So I, again, I tried to use data to kind of answer that question. I went to the network engineering team and I actually asked them, can you tell me how many rack failures we’ve had in the past year?
And they actually gave me a number and, and it was actually higher than I was expecting even. Uh, and so I took that, I took that information back and I was like, yeah, we actually have quite a few that happen organically ’cause, you know, switches fail, maybe cables fail or, you know, power failures, things like that happen, which caused the rack to go down.
So, that was actually a really nice piece of data as well.
Benjamin Wilms: You are mentioning a very important point. I mean, incidents, failures are part of any system. But what is the impact for our customers? Are they able to see it? You described it in a way that even your engineers were not aware of, “Hey, there was a rack failure.” So that’s how confident they are about the system they are running on, how, how good the system is designed and can handle such an outage and no one is able to see this outage. That’s like a perfect outcome. Are there any trends? So let, let’s take a look in the future, looking ahead, are there any trends you are seeing in the chaos engineering space that are on your radar?
Tom Handal: Yeah, that’s another great question. Um, so. One thing that, so there’s a few different things. First is the advent of ai, right? AI is now, you know, becoming a huge, part of our work in the industry.
And our first, movement towards AI for chaos engineering that, that we’ve done was we have an internal communication system where people can ask questions and, you know, things like that to the, to the team. And we sometimes we would get a lot of questions, how do we run this type of experiment and how do I do this? Like, what is chaos engineering? Like, you know, things like that.
So what we ended up doing is we ended up developing a, uh, we actually had an intern start this project. Uh, he did a great job on it. We actually started an AI chat bot. So we have internally an AI chat bot that we wrote that would explain about chaos engineering, explain about different type of experiments, what can you do with it, what’s the point of each type, and how do I set up an experiment? It’ll actually explain the details. And this AI chat bot was, is actually aware of all of our services. It’s aware of all of our infrastructure. We actually feed the AI model with knowledge about our infrastructure and our services. And in, in real time, it can actually check in real time how many services are running in what cell and what type of services and things like that.
So it’s actually aware in real time of our system and can answer a question. So the, the customer, you know, the, the engineer might ask, I wanna run a latency experiment on this service running in this cell. Can you generate an experiment for me that does that and it’ll actually generate the experiment for the engineer.
So like, that was our first entry into using AI for, for chaos engineering. And it was, it, it’s, it, it works pretty well now, you know. People started using it, we saw some activity on it, which was pretty cool, and, we still have some people that jump on and use it periodically.
But, integrating to our other systems as well has been more of a way to reduce friction, like we have a central UI system here, at Roblox where we try to put a lot of our tooling and a lot of our automation in, so that way it’s kind of like a one-stop shop, I guess you can call it for like everything you need to do as an engineer at Roblox. So we integrated with that directly to also reduce friction into using the chaos platform. But going back to ai, another cool project we have going on here at Roblox is we have a system that uses AI to actually analyze all of our metrics. So we have what we call key metrics. Key metrics are, like player drop, like we talked about earlier. We also have a lot of our most critical services, we have metrics about those services like success rate latency, you know, error rate, a whole bunch of different metrics that we, that the IMOC will, will pay attention to.
So the IMOC, for those of you who may not know is the incident manager on call. So that’s the main individual responsible. It’s a on-call rotation, but they’re the main person responsible for the frontline detection, I guess you could say, of any kind of issue. So if there’s a player drop, there’ll be an alert that goes off, it’ll go to IMOC.
And the IMOC’s responsibility is to figure out who do they need to call in, from what team to try to root cause and fix the actual issue? Because the IMOC will look at the key metrics and the key metrics should hopefully point them to maybe most likely what, which service is causing the issue or what, what infrastructure is causing the issue and they can call in the appropriate people.
The idea there is to decrease the meantime to detection as much as possible ’cause you know, the faster you can detect, the faster you can remediate, right? So the idea is to reduce that as much as possible. So we started a program to use AI to actually scan those key metrics every, I don’t know the frequency, but very frequent like every second or every couple seconds, it’ll look at the key metrics and look for any kind of anomalies. So it knows what normal metrics look like, and it looks for any anomalies, and it’ll automatically detect that, Hey, something’s going wrong. And then it’ll actually try to root cause it automatically, the AI will look at all the metrics and look at this service and that service and it knows, it knows which services are dependent on each other.
So it knows the call paths basically. And, and then it’ll kind of create a report for the IMOC that says, Hey. I noticed this problem, here’s the most likely cause, you know, et cetera. And then that really reduces the meantime to detection and starting the ball rolling of remediation before the IMOC even gets to his laptop or his, or her laptop to, to, uh, to actually start looking at the problem.
Because by the time that person get the imoc gets to their laptop, they already have a report from this system telling them what the issue may be, which really reduces the time dramatically ’cause the faster we can remediate the issue, the better. So where I’m going with this is, uh, sorry for the long-winded answer, but where I’m going with this is now that we have this AI system that knows our system, knows what kind of anomalies we might have, knows what kind of issues we might have. It’s getting trained over time, right? And my vision is in the future that we’re gonna go full circle here and eventually have that AI be able to run the chaos engineering experiments, uh, to
Benjamin Wilms: As a validation step.
Tom Handal: Yeah, like as a validation step, but also reason about the issues that we see, and maybe come up with new types of attacks that we didn’t think of.
Uh, but yes, correct, like validating. Like it can definitely start with validating. That would be very, very, valuable where it can look at a particular incident and actually come up on its own with an experiment that reproduces that particular scenario, that would be really cool. And then the next step beyond that is to use the reasoning capabilities to come up with, oh, well if that happens with this service, maybe it’ll also happen with this service that is similar.
Uh, and then it, you know, yeah. And then it can kind of take over like Skynet and, you know, and, uh, like actually, start injecting failures all over the place and that’s where. We really need to worry about guardrails. Um, ’cause you saw what happened, you saw
Benjamin Wilms: For your engineers.
Tom Handal: Yeah, exactly. A lot of guardrails and, uh, so yeah, it’s, it’s a very exciting vision, I think, and, and a lot of really cool things that can happen as we come full circle with artificial intelligence and everything.
And exciting to see what happens with that. But I think the industry as a whole, I think, we’ll, we’ll get there. Um, so it’s pretty, pretty neat.
Benjamin Wilms: Awesome. So this was a fantastic conversation. I learned a lot. And, when people would like to reach out to you, what is like the best way you, they can connect with you?
Tom Handal: Uh, sure. Yeah. I’m, I’m on LinkedIn. Of course. You know, you can, you can find my profile on there. I love hearing from people, so if anybody has questions or wants to reach out, you know, feel free to, uh, contact me on LinkedIn. Also, you know, we’re hiring here at Roblox. There’s, uh, a lot of career opportunities, uh, you know, Roblox is a excellent place to work.
I’m very proud of our team. I’m very proud of the reliability team, which is the team I work on. Very lucky to be working with, uh, such dedicated and smart people. That really makes me proud to be working at Roblox. And, um, I’m really happy I was able to start and lead this, you know, chaos engineering program.
Roblox is very proactive in things like that. So like, if you have an idea, if you have something you would love, love to work on. They’re very open to that, which is really cool. So yeah, I love, uh, networking and communicating, so yeah, feel free to reach out.
Benjamin Wilms: Perfect then. Thank you very much for joining me on this episode and yeah, looking forward to hearing more from you.
Tom Handal: Yeah. Thank you so much for having me. It was, it was a lot of fun. Much appreciated.
Benjamin Wilms: Okay, bye bye.
Tom Handal: All right. Bye-Bye.