What are resilience expectations in organizations?

Most organizations understand that quality cannot be an afterthought. Resilience expectations refer to the proactive measures and standards that organizations set to ensure their systems and services can withstand disruptions and maintain performance.

How do different systems implement resilience?

There are many kinds of systems and services implementing resilience through various strategies, including redundancy, failover mechanisms, and regular testing of recovery procedures to minimize downtime and enhance reliability.

Why is it important for organizations to focus on resilience?

Focusing on resilience is crucial because it helps organizations prepare for unexpected challenges, ensuring they can continue operations smoothly while safeguarding their resources and reputation.

What role does Steadybit play in building resilience?

Steadybit is on a path to help everyone build and maintain resilient systems by providing tools and methodologies that facilitate proactive testing, monitoring, and improvement of organizational resilience.

Can you explain the concept of quality in the context of resilience?

In the context of resilience, quality refers to the robustness and reliability of systems. Organizations must ensure that their services not only function as intended but also remain stable under stress or during crises.

How can organizations measure their resilience?

Organizations can measure their resilience through various metrics such as recovery time objectives (RTO), recovery point objectives (RPO), system uptime percentages, and conducting regular stress tests to evaluate how well their systems perform under pressure.

All Blog Posts

Declaring Resilience Expectations

Chaos Engineering Guides

28.02.2022 Ben Blackmore - 5 min read

Is chaos engineering for experts only? No! Learn how we imagine opening up chaos and resilience engineering to wider audiences through declared and reusable expectations!

Chaos engineering is typically centered around hypotheses, experiments, game days and more with the common objective of improving the resilience of a system. Engineers, SREs and others are improving resilience along various dimensions. For example, organizations may enhance how they deal with traffic spikes through better workload forecasts, improved system elasticity, operational handbooks, manual system reviews, game days and more. Well-behaving and performing systems are created through iteration, experience and well-defined expectations.

Today, we want to give you a sneak peek at one of our first steps towards what we consider a declarative approach to resilience expectations.

Resilience Expectations

Most organizations understand that quality cannot be an afterthought and take actions such as left-shifting various activities to ensure they start to be addressed earlier and by a larger audience with varying perspectives. Some organizations are starting to apply SRE practices such as error budgets to find a trade-off and signals for feature development, quality and resilience. On the far end of the spectrum, some companies apply continuous resilience to improve.

One thing all organizations investing in resilience have in common are expectations. Expectations that service-level agreements are upheld, that systems don’t crash during Black Friday, that service authors adhere to internal standards or external best practices, and many more. Developers need to be able to encode these expectations, evolve them with their services and optionally share and enforce them on a per-service or an organizational level.

Moving from experiments to expectations means that many complicated facets of chaos engineering can be avoided, and resilience engineering be democratized.

Sharing Expectations

There are many kinds of systems and services implementing a variety of business and social needs. It would be prudent to assume that all businesses’ needs and expectations regarding their resilience are the same. However, it would be equally prudent to assume that there aren’t similarities across businesses – especially across service types!

Let us take a look at Kubernetes as an example. Organizations are leveraging Kubernetes in a multitude of ways. Some deploy a handful of services; others operate Kubernetes clusters in cars; the following organization uses Kubernetes to implement serverless functions; others operate at insane scale. Such organizations have varying approaches to the resilience of their Kubernetes clusters. However, there are similarities to the resilience expectation of deployed workloads, e.g., basic Kubernetes best practices (probes & limits), service discovery, decoupling of services, correct timeout configuration and much more.

Imagine you are just starting with Kubernetes. Wouldn’t it be great to start adopting resilience expectations from experienced Kubernetes users? What if you could get a set of these directly from the Kubernetes authors or even Google?

Knowing what you want (resilience) is one thing – being able to describe through requirements, experiments, and operational manifests is something else entirely. Why not learn from the experts directly?

Outlook

Steadybit is on a path to help everyone build and maintain resilient systems. There are many steps ahead of us to achieve our mission. If the above sounds relevant, we would love to hear from you to discuss your needs and what we have in mind. You can contact us via our homepage.

Get started today

Full access to the Steadybit Chaos Engineering platform.
Available as SaaS and On-Premises!

Start Free Trial

or sign up with

Google GitHub

Book a Demo

Let us guide you through a personalized demo to kick-start your Chaos Engineering efforts and build more reliable systems for your business!

Book Demo