Talks & Chats

Episode 4: Enabling Reliability in the Cloud with Carlos Rojas

Experiments in Chaos • Jul 2025

In this episode, Benjamin Wilms chats with Carlos Rojas, author of “Resilience Engineering for the Cloud”, about DevOps vs. platform engineering, the roles and responsibilities related to reliability, and the new trends that will be making an impact for organizations in the next six months.

You can connect with Carlos Rojas on LinkedIn, and order your copy of his book, “Reliability Engineering in the Cloud”.

Episode Transcript

Benjamin Wilms: Welcome to my new episode. In the past three episodes, we have been exploring what it really takes to build resilient systems.

With Adrian Hornsby, we looked at why it’s often difficult to get started with chaos engineering, even when its value is clear. Russ Miles then helped us unpack the social and cultural foundation that makes chaos and reliable practices truly effective. And most recently, Casey Rosenthal challenged the status quo, pointing out that many industries best practices are still reactive and making the case for a more proactive approach through continuous verification. A common thread has run through all of these conversations. Resilient systems are sociotechnical, shaped just as much by people and processes as they are by technology.

And today we continue that thread with Carlos. We will explore the world of platform engineering and how infrastructure teams can enable reliability without getting in the way of developer velocity. And here we go. Very nice to have you in my new episode.

Welcome, Carlos. Maybe you can do a short intro about yourself.

Carlos Rojas: Sounds good, Ben. Thanks for having me. A real honor to be here. I appreciate the invitation. I’m a tech executive, a transformational leader, I have a track record of driving platform engineering, Gen AI, operational excellence at Fortune 100 companies. Recently, you know, I’m very proud of this, I wrote a book called “Reliability Engineering in the Cloud”, and the whole purpose is to help organizations improve their systems resilience and, and stability, very similar and align with the podcast today.

So, in terms of some of my performance, you know, while I was at AWS for example, I built platforms to drive cloud infrastructure expansion. I owned the platform to orchestrate, build, and validate the creation of new regions and new availability zones for AWS globally.

From a FinTech, financial tech, point of view, I also have been enabling organizations to scale in highly, highly regulated environments with automation versus more and more and more and more humans. Right? On a more personal note, I’m an engineer who loves, loves, loves solving problems. I love playing soccer. I race motorcycles at the racetrack, and I, I love spending time with my two kids.

Benjamin Wilms: Thank you very much for this intro and I mean now it’s clear why you’re on this episode. I was able to follow your work on LinkedIn, on AWS re:Invent, you are also someone who, who wants to share what you have learned and achieved with others and it’s really, yeah, very nice to see what you are able to do.

Carlos Rojas: Absolutely, and it’s one of the reasons why I actually wrote that book. I was trying to figure out how do I give back? How do I make sure that other companies or individuals who are early in their careers, they can also benefit from some of the trying moments and some of the success stories throughout my career, so I highly encourage, you know, everyone to pass it back and keep the elevator door open for the next generation of technologists. How do we progress is basically helping those next engineers in their careers.

Benjamin Wilms: Nice, and maybe that’s a good opener for yeah, asking the first question because I’m curious about your personal way into platform engineering. I mean, we all still aware of DevOps and what does it mean to be successful in that space? But what was your journey into the platform engineering space?

Carlos Rojas: Well, let’s, let’s start by defining what platform engineering or a platform team is, right?

In my opinion, it’s a specialized engineering organization that builds and maintains what’s called the shared infrastructure or the shared tools or services, right, that enable development teams in a company to basically work more efficiently and effectively.

That’s by nature, you know, when I get into any platform engineering team go after, like, how do we make these development teams more effective, more efficient? Right?

For me, in the early 2000s as an entrepreneur, I built a clinical trial management system, to track investigation in new drugs at the National Institute of Health.

An example of this is COVID-19, for example. That was one of the specific protocol investigations that went through that system. This platform was used across the globe by scientists to investigate and develop these, these new drugs, and it was basically something where collaboration was the genesis of why we needed that platform back in the day.

I, with my team designed, built, deployed the infrastructure, the applications, the backend APIs, the integration points with, you know, organizations like the US Congress and, and also because we have some regulatory requirements, we have to publish the results of these investigations in, in the National Library of Medicine.

So, making sure that all development teams and health professionals could collaborate to improve findings associated with clinical studies. That was my beginning of, you know, how do we create a platform? How do we have an engineering team that enables other dev teams and even scientists to collaborate across the globe? So that’s how we got into platform engineering, to be honest with you.

Benjamin Wilms: Yeah, nice, are there any special key moments you can remember from early days?

Carlos Rojas: So I, I’ll share maybe one that was very interesting for me as well, which is, you know, while I was at AWS, one of the key principles was to build services with scalability, reliability, security, all of the “ilities”, right?

That was something that had to be done from the get go. It was not an, not an afterthought. It was something that you’re building a service: that has to come in as part of the DNA. For many companies, up until that point in my career, I realized that it was like, yes, we have to do that. And it’s part of architecture and it was best intentions, meaning they tried, they talk about it,

but I also noticed that while I was at, at AWS, this was a tenent, basically a principle that it was in their DNA. It was part of the engineering community, right? So for example, knowing that a service you own is responsible to manage and report on the global infrastructure of the cloud, and that millions of users can access that particular service that you own in different regions across the globe, it just forces you to build software that scales, right?

I was very impressed because AWS takes that very seriously. They have principal engineers, for example, in there that are just dedicated to ensuring the designs are reviewed, approved, that they meet these expectations as part of every single service, right? So I, I own a service called Regional Information Provider.

And in fact, you, it’s the one service available that nowadays, if you wanna know what’s available in the global infrastructure of AWS, you can just go and access it. There’s a CLI, you know, there’s different access points that you can just check what’s available.

So knowing that every minor change needs to be approved, reviewed, and the team and your PE and your leaders, they were focusing on this conversation about is this a scalable, is this reliable? Is this secure? Do we have the right architecture in place? To me, that was like a key moment that helped me shape my thinking about engineering in general.

It was a great learning experience for me.

Benjamin Wilms: Yeah, this is a very important point because most of the companies today still are more in, in a reactive mode. They are building a system based sometimes on top of hope, but it’s not part of the DNA. So there’s really like this mindset shift is needed. What was the secret source inside of AWS or other companies where this is and was part of the DNA?

Carlos Rojas: I’ll give you my personal point of view. Traditional DevOps teams were very much process-focused. They thought about manual operations. They were very much into infrastructure maintenance and, and they wanted to treat infrastructure as a service.

But when I think about the evolution to platform engineering, platform teams, they don’t focus on the process. They focus on being a product organization. They treat their platform as a product-focus approach.

Benjamin Wilms: As an internal product in their organization, which will enable other teams to build products for their customers.

Carlos Rojas: Exactly, so they move away from being a process orientation to a product orientation, right? So instead of our enterprise, centralized horizontal team doing it, how do we create a self-service capability?

How do we just go from, oh, I’m gonna maintain your infrastructure so that you can do the software. They move away from that and more into, Hey, how do I enable the development teams to continue to do this as maybe infrastructure as code? So that way there’s not that dependency on that centralized platform team.

Right? I think the net net is the platform teams now are essentially acting as a force multiplier, where their job, their mission, their ultimate goal is, how do I enable multiple application teams to move faster, obviously, while maintaining the quality of security and the operational excellence standards that are critical nowadays.

Benjamin Wilms: From my experience, I will be quite direct, developers, engineers, they don’t care about reliability or security. They are more measured on features and how fast can get out new feature into production. So, and nowadays with platform engineering teams, are they responsible for security, reliability to be more like the enablers for other engineering teams?

Carlos Rojas: I wanna say that platform teams are responsible for, for two major things. One is the reliability of that platform, but also the second one is enabling application teams to leverage that platform for the business.

There is this shared responsibility model that is needed to be in place, one that describes how the platform team will be responsible for the wellbeing of that platform, for the organization, and all of the application teams, right? And then one that says, “Hey, how are the application teams, the owners of their specific domain?”

Okay? And we have to make sure that these differentiation and clarity is crystal clear. Because it’s not like you own A, I own B. The platform team owns this, and then the application teams owns that. There’s also an intersection point where there is an overlap and the collaboration is needed. So I’ll give you some examples to make it more tactical and, and real.

From a platform code, the platform team owns that particular aspect of the platform. Any code related to the platform, it’s owned by the platform team, but then there’s some specific code sections and features and services that are put on top of the platform, and that is basically domain specific.

In that case, the application teams own that part of the code. And this is very relevant because, for example, when there is an incident, who’s gonna be responsible for that incident? Is it the platform team or is it the application team? And, and just having that clarity of those boundaries where one team plays versus the other is very critical.

I’ll give you another example. Platform specific SLOs, how do you measure the availability of your platform? Is it five nines, four nines, three nines of the platform? Well, that’s a platform team. But then there’s gonna be some domain specific indicators like SLIs, for example, for a financial institution, you might want to know your error rates on a loan or a credit card approval.

That is not part of the platform engineering team. That’s part of the actual application team, because that’s the domain specific. It’s part of the credit card process or the loan process, and it’s part of a specific service or feature that is supporting that effort. Right? So, I think the responsibility model, the shared responsibility model is something that these organizations that are planning to invest in engineering or the platform need to have very, very clear, ’cause if you don’t have it, there’s an impact for the company. And most companies are thinking like, well, we’re gonna have an impact on uptime. Well, it’s not about that. It’s about enabling velocity, trust, business growth, because those are the things that, when you’re not thinking about the technology side, those are the things that get impacted.

At the end of the day, if your business doesn’t close on that loan or doesn’t approve that credit card, you’re losing revenue. You’re not growing as a business. So I think it’s part of also of the evolution of, before DevOps, were very much DevOps teams were very much focusing on the technical aspect of it. Whereas now platform engineers are saying, Hey, this is the shared responsibility model. If we don’t do this part of reliability, guess what? You’re gonna have an impact on potentially this critical services for the organization.

Benjamin Wilms: Yeah, maybe thinking about products will help people to define the ownership. So if it’s, part of the platform product responsible, it’s the platform team. If it’s really a part of the external product for a customer, then the responsibility is in the hand of the product engineering teams.

But then again, everything is getting into this world of the socio-technical system. I mean, we can talk about technologies, we can talk about the code, the pipeline, the process, but the, the alignment between those teams is very important. Otherwise, it’s not my job. It’s the job of the platform team.

And then you’re getting endless discussions and meanwhile, your customer is going through your competitor. So that’s very important.

Maybe, we can talk about your personal vision or maybe also how you think platform engineering will evolve over time.

Carlos Rojas: I’ll tell you how I see it today. When you think about failure testing, for example, chaos engineering, right? The, the platform teams must enable failure testing or chaos engineering with a tool in the design they need to help coordinate, execute and, and partner with application teams to, to basically advance those stress testing capabilities to improve their systems, right? And I see a lot of companies trying to say, “Hmm, how do we introduce failure in the system now before our customers know about it?”

And this isn’t about creating problems, right? It’s about building confidence in a way that is, is a systematic validation of your system’s resilience and reliability. I think that the, the most mature teams and companies now are using chaos engineering as a feedback loop mechanism to validate their reliability and, and to validate that their investments and, and their guides and their architectural decisions are, are right, and those companies are creating that competitive advantage.

Do I wanna wait until my customer finds out that there is a problem in the system?

Or do I wanna actually stress test my system to capture those learnings before my customer knows about it and I can fix it ahead of time in a more proactive way? So, so the one thing that I see the industry changing is from more a reactive mode, moving into a more proactive mode, right?

And I think that for those that master the chaos engineering, the next logical thing for, for them will be like, how do I invest in an AI-powered reliability engineering approach? And what I mean by that is at least two things to keep in mind. One is, how do I use predictive capabilities?

Benjamin Wilms: Mm-hmm.

Carlos Rojas: And the second one is, how do I create like intelligent self-healing type of automation?

For predictive capabilities, most people know anomaly detection using ML models, capacity forecasting because right now you can plan better based on historical data and based on seasonal patterns, one that is very close to my heart, which is failure prediction. How do I anticipate a hardware failure to preemptively do maintenance?

How do I anticipate that a deployment that I’m basically about to execute, it’s going to fail with which level of confidence? And then you can kind of predict based on history even what is your theme or what is your root cause, or what is the application service that probably is gonna fail before it happens, right?

So that predictive capability is something that I think most companies are gonna be investing in. Once you have the predictive side of the house, you could also think about, Hmm, how do I dynamically optimize my systems based on performance patterns? How do I use AI to figure out the behavior of my application?

Or how do I identify the patterns of my user’s behaviors so that I know exactly how do I optimize my systems? Right? I’ll give you an example, QA infrastructure is not used between 1:00 AM in the morning and 5:00 AM in the morning for most companies, depending upon location. What if you shut those down and you enable them back in the morning and you’re saving a lot of money for your company so that create a dynamic optimization of your infrastructure, but at the same time, it doesn’t impact your application development teams. This is not applicable for every company, by the way, but in many cases, you can run with a portion of your QA environment versus having it all available all the time. I’ll give you another example, which is my favorite one, which is the automated remediation, and this is something that I have implemented in a FinTech organization and, and it’s self-healing of systems, okay, that is based on recurring problems that we have learned from in the past.

For example, a host needs to be restarted. A deployment needs to be rolled back. I have an issue in production and I need to do a failover from one region to another region.

All of those things are things that you can write code to execute in an automated fashion versus having, your application team wake up in the middle of the night at 2:00 AM in the morning saying, go and run this script and do it manually one by one, and probably make a mistake because we all have been there, and the idea will be, what if I tie an alert to a runbook? The specific example that I worked on and recently implemented is there was an alert, you know, a mainframe went down, for example. And most of us, we know that figuring out and fixing a mainframe might take about four hours before actually comes up live.

What are you gonna do? Are you gonna show your teams and your applications and your customers that you’re down for four hours? No, absolutely not. Right. So I think that what we end up doing is we created automatically an incident record. We created automatically a slack channel for teams to collaborate.

We created automatically, automatically a Zoom bridge. To be able to have teams come in and collaborate. We notify application teams, regulatory, our vendors automatically just based off of that alert that we know is a real alert. It’s not a false positive, it’s a real alert. And then we took it to the next level, which was we automated in real time.

How do we move from a regular uptime or downtime for my mainframe to a high availability mode, which is creating an abstraction layer to say my mainframe is down, I’m gonna go in and read from a, a copy, a replica, and it’s read only. My customers are not gonna be able to see the deposits right away, but they will see all of their, all of the other transactions up until when that mainframe went down.

So I’m now isolating the problem. I’m giving partial functionality to my teams, and guess what? All of that is done automatically in a self-healing capability, where your alert triggers all of those steps. At the end of the day, you have your app team coming to a bridge, not to try to remediate the problem.

They’re coming in where the system has already put some sort of self-healing capability. They come in to figure out what was the root cause of this issue and how do I go back to mitigate it completely? Fix it completely for the future? So I think that we’re gonna get to a point where self-healing, maybe creating SRE type of agents is going to be ideal.

I know the pattern, I know the solution. Let’s not have humans basically address it right away, or at least let’s assist the humans as much as we can to fix as much as we can while my customer is not perceiving a downtime. Then my engineering teams can come in and help mitigate the rest of the incident or help basically fix the root cause, right?

So there’s a lot of advantages in there from a productivity efficiency, et cetera, that these platform teams can do.

Benjamin Wilms: The, the central component is that your customer needs to able to interact with your system in a still good way that he is getting something out of it. Maybe as you, as you mentioned, it’s just reading some, messages or whatever, but he cannot purchase a new product and then you also mentioned self-healing systems. The definition of healing is maybe quite tough, but what I can imagine is that systems are able to handle any kind of outages that they can shift the load into another region or be able to spin up new instances that can handle a specific load right now because it’s urgently needed, but they are designed in a way that even if all the stuff is going wrong in the background, our customer is not affected in a bad way.

That’s the, that’s the end goal. I personally believe that we will never, ever be able to build a system without any failures, so we need to bring up a mindset that failure is part of the system and it’s not something bad. We need to learn how the system reacts under any conditions. Also the bad ones.

Carlos Rojas: I’ll tell you something. There is a quote that I reference in every one of my keynote speaking engagements. It’s from Dr. Werner Bogles, the CTO of amazon.com, which is everything fails all the time. It’s not if it’s gonna fail, but when and how. The more important part of it is how do you prepare in advance for that so that you avoid customer impact, right?

There’s different multi-region, multi-availability zone designs, different architectural designs that will help you have the right resilience to withstand any issue. The evolution is telling us it is possible that you fail, but your customers don’t feel the impact, and the best way to do that is by automating your processes.

By automating your remediation steps. Many companies based on history and based on my experience with them, when I engage with them, I have seen that they have the top three to five things that happen all the time. It’s the same top five. And if you address those five things in an automated fashion, you can reduce up to 70 or 60% of your incidents because many applications fail from the same symptoms or the same root causes. That’s why when I was giving you the example, Ben, how many times a failover is going to fix a problem? Many!

Benjamin Wilms: Hmm.

Carlos Rojas: If you use the right blue-green deployment strategy and one of those stacks fails, but you still have both of those stacks, then you just point to the right, stack, the one that is not failing.

And it’s a, the difference between having a two minute, maybe three second delay in an issue or a 45 minute delay because you have to go back and rebuild everything, right? So it’s a matter of figuring out what are my top five things that I need to fix because every single one of my thousands and thousands of applications is going through it.

Then adding the automation to it. Then adding the chaos engineering testing to make sure that we, we cannot break it anymore because we fixed that.

Benjamin Wilms: Exactly.

Carlos Rojas: Right? And then saying, how do I add an agent that will be the one responsible for auto-healing? And the auto-healing is nothing different than I get the alert, I know something’s wrong, I do a check and balance, and then I have to execute one of these five options. And you don’t need a human to do that right away. You need the human to come in and say, okay, now that we are mitigated, how do we really, really, really prevent this from happening in the future, and that’s a root cause analysis.

So all of this is a play to create efficiencies for the application teams, but the strategy, the roadmap, the implementation, the testing, the tooling, it’s all created by your platform team, your engineering team, so that everyone is in the company can run with it.

Benjamin Wilms: Well said. Two last questions. Question number one is from your point of view, what should teams stop doing and what should they start doing?

Carlos Rojas: Let’s start with what they need to start doing, okay? Reliability platform engineering is no longer a separate operational concern. Just like back in the 2000s, like, oh, you got the Ops guys over there, you know, they’ll, they’ll make sure that it works at 2:00 AM in the morning.

Now it’s a fundamental capability to deliver on the promise of developer productivity, business agility. Right? Organizations that treat reliability platform engineering as a first class feature rather than an afterthought, they will become the competitive advantage for the organization versus the operational burden.

Organizations need to start thinking about platform reliability in the same sense as, “Hey, I’m gonna build these features, but I also have to build this other automation to maintain the reliability of the system.”

Benjamin Wilms: Yes.

Carlos Rojas: It cannot be an afterthought. It cannot be a separate team. It should be part of the same organization.

Okay. In terms of what I think that companies or organizations should stop doing, I have seen many organizations trying to scale reliability with humans. I don’t know if you remember, Ben, back in the day that it was like, oh, we need to support 10 more applications in your system, and you have thousands of them. I mean, like, well, for 10 new applications, I need five more people to be able to have three shifts, you know, “follow the sun” type of model, and, and that’s gonna be, I don’t know, x hundred of thousand dollars to be able to maintain that.

So, so basically scaling with humans was the way of doing it in the past, and, that created this big, big organizations for operations, which we were, you were wondering what, what do they do behind that closed door? I would say, let’s stop scaling with humans. Let’s start scaling with automation and automation practices.

And I’m not saying let’s cut the humans at all. All I’m saying is let’s have the right engineering mindset to be able to solve problems in an automated fashion.

Benjamin Wilms: And, scale where it’s really needed.

Carlos Rojas: Exactly, exactly. I’ll, I’ll give you another point of view. Many, many, many companies used to claim victory based on the number of incidents. Oh, we went from a hundred incidents to 60 incidents, or we, we chase 40% of, you know, the instability in the organization instead of counting impact, customers impacted right?

If you are in one of those companies that is only counting incidents. I’m not saying get rid of counting incidents, I’m saying don’t use that as your primary metric. I’d rather have a hundred incidents impacting one customer than one incident impacting a hundred million customers. So it, it’s all perspective.

I will say focus on the impact that you’re causing your customers, because at the end of the day, we’re doing all of this to create an amazing experience for our customers.

Benjamin Wilms: And also like one of our customers told me, “Hey Benjamin, since we are doing Chaos Engineering with Steadybit, the number of incidents is on the same level since three years. Meanwhile, our complexity and the system has been scaled by 300%. That is a success for us.

Carlos Rojas: Mm-hmm. Along those same lines, Ben, um, everything we’ve been talking about doesn’t happen today, and you get the output tomorrow. You don’t see the business output or the stability happening a day, a month later in your organization when you have thousands of applications and thousands of application teams.

Right? There is this J-stick effect. So it’s important to keep in mind that depending on the size of your organization, depending of the number of applications you have, depending on the maturity of your engineering teams, this can go in some cases through a journey of six months to two years to be able to see the actual results, right?

So the longer a company waits to invest in some of these things, the longer and the more risk they will introduce to impacting their customers. So counting incidents is okay. It shouldn’t be the ultimate metric that defines success. Right. That will be my, my suggestion for those who want to invest in these type of, you know, platform reliability engineering teams.

Benjamin Wilms: Now we are at the end of our episode, and if people want to get in contact with you, what is the best way to, to reach out to you?

Carlos Rojas: LinkedIn is the easiest way. I think we can share my profile as part of this podcast. I’m happy to engage with any organization to help you craft your strategy, to improve your teams, organizations, reliability, stability, resilience, etc.

Some of these, you know, strategies that have worked or not, some of these frameworks that, basically I’ve been sharing here are things that I have learned software, but also documented it in, in, in my book, so feel free to just take a look at the book as well.

There will be lots of studying guides in there for those companies that wanna start, investing in this. Okay?

Benjamin Wilms: Yeah. We can link also the book in, in our section, in the blog post as well. And yeah, then thank you very much for joining me for this episode. It was a pleasure. I learned a lot, and hope to see you soon again.

Carlos Rojas: Sounds like a plan. Ben, thank you so much for having me. Fantastic day. Okay, chao chao.

Benjamin Wilms: Chao chao.