Is chaos engineering for experts only? No! Learn how we imagine opening up chaos and resilience engineering to wider audiences through declared and reusable expectations!
Chaos engineering is typically centered around hypotheses, experiments, game days and more with the common objective of improving the resilience of a system. Engineers, SREs and others are improving resilience along various dimensions. For example, organizations may enhance how they deal with traffic spikes through better workload forecasts, improved system elasticity, operational handbooks, manual system reviews, game days and more. Well-behaving and performing systems are created through iteration, experience and well-defined expectations.
Today, we want to give you a sneak peek at one of our first steps towards what we consider a declarative approach to resilience expectations.
Most organizations understand that quality cannot be an afterthought and take actions such as left-shifting various activities to ensure they start to be addressed earlier and by a larger audience with varying perspectives. Some organizations are starting to apply SRE practices such as error budgets to find a trade-off and signals for feature development, quality and resilience. On the far end of the spectrum, some companies apply continuous resilience to improve.
One thing all organizations investing in resilience have in common are expectations. Expectations that service-level agreements are upheld, that systems don’t crash during Black Friday, that service authors adhere to internal standards or external best practices, and many more. Developers need to be able to encode these expectations, evolve them with their services and optionally share and enforce them on a per-service or an organizational level.
Moving from experiments to expectations means that many complicated facets of chaos engineering can be avoided, and resilience engineering be democratized.
There are many kinds of systems and services implementing a variety of business and social needs. It would be prudent to assume that all businesses’ needs and expectations regarding their resilience are the same. However, it would be equally prudent to assume that there aren’t similarities across businesses – especially across service types!
Let us take a look at Kubernetes as an example. Organizations are leveraging Kubernetes in a multitude of ways. Some deploy a handful of services; others operate Kubernetes clusters in cars; the following organization uses Kubernetes to implement serverless functions; others operate at insane scale. Such organizations have varying approaches to the resilience of their Kubernetes clusters. However, there are similarities to the resilience expectation of deployed workloads, e.g., basic Kubernetes best practices (probes & limits), service discovery, decoupling of services, correct timeout configuration and much more.
Imagine you are just starting with Kubernetes. Wouldn’t it be great to start adopting resilience expectations from experienced Kubernetes users? What if you could get a set of these directly from the Kubernetes authors or even Google?
Knowing what you want (resilience) is one thing – being able to describe through requirements, experiments, and operational manifests is something else entirely. Why not learn from the experts directly?
Steadybit is on a path to help everyone build and maintain resilient systems. There are many steps ahead of us to achieve our mission. If the above sounds relevant, we would love to hear from you to discuss your needs and what we have in mind. You can contact us via our homepage.