A GameDay is a collaborative exercise to help you as a team find weaknesses in your system in an exploratory space to improve resilience. We explain what you need to keep in mind.
GameDays support teams and organizations in making their work visible. This sounds strange, as you would think: my code is running after all, what should I make visible? But it’s much more about every person working on a complex system revealing their knowledge.
A good analogy is the construction of a car: The development team builds a car together. There are experts for the engine and there are experts for the bodywork. There is the designer and there is the person responsible for the product. Together, they all put their knowledge into a pot so that a safe, stable and efficient car can be built. After the car is built enough to do a first crash test, these experts get together and discuss how the “system” car can be improved.
And this is exactly how game days can be understood. In regular cycles, system experts come together and do a planned crash test, to identify weak spots in their system product.
In general GameDays are to improve the resilience of your system. Ideally this should be done on a regular basis to be sustainable. But there are also other triggers, which may have some influence on your preparations:
Make sure you have some people attending who know how the application works and was intended to work – typically developers. They also may fix issues directly and often can tell beforehand if the experiment will fail without executing it. Just as important as knowing how the application works is knowing how the underlying platform works – this may be knowledge present in the dev team or you might require you to add someone from operations.
Make sure you have participants who are able and permitted to execute the attacks. This depends on your policies and tools used. The Steadybit platform makes it easy for non-experts to execute attacks and with the fine grained access control the permissions can be given down to single containers – this way you can do the experiments without systemwide privileges.
A GameDay typically lasts from 2 to 4 hours. So make sure that everybody has time.
Please don’t surprise your (non-participating) colleagues badly with the chaos you’ll be causing. Do let them know beforehand that you are hosting a GameDay and what impact this GameDay might have and how to reach out to you if something goes seriously wrong.
As a result of the global pandemic nowadays many people are working from home. If you don’t meet in person there are some caveat that you should be aware of:
This is what a possible agenda could look like, so that all participants know what to expect:
Bring everyone to the same page and so that they know what will happen on the GameDay. If Chaos Engineering is new to anyone, it is useful to make an introduction to it (for veterans you might want to skip this).
Even if you work on this system on a daily basis it is often beneficial to take a step back and view your system as a whole. Try to visualize it using a (virtual) whiteboard and diagrams. This brings everyone up to date, maybe someone missed changes over their vacation or was just too busy to notice. Having this, often proved useful in the past – it strengthens the communication in your team about the system.
With the architecture in mind we brainstorm for meaningful experiments. Ask yourself how the application behaves under certain conditions. Starting with fundamental experiments, like:
For each test case, design an experiment. First define a steady state: What application behavior do you expect, e.g.: “5,000 requests per minute on the index page are handled 99% successfully within one second.”
Next, hypothesize how the application will behave under given conditions: “When a single container is restarted the steady state is still given”
Now you need to define what target is how to attack. Typically you first start with a small blast radius (e.g. a single instance) and then increase it when successful (e.g. multiple instances or an entire zone).
First: If you know that a designed experiment will fail for a known specific reason, fix the issue first and then come back to the experiment later.
Verify that you have all necessary metrics and monitoring in place, so that you notice when things don’t run as planned and be prepared to stop the experiment. Now you’re ready to start the experiment. Before injecting the failures check if the steady state is given – you won’t get useful results when exploring an unstable system.
While running the experiment, observe the steady state and have a look at your monitoring. Do you see the expected behavior or any unforeseen issues up- or downstream? It is good to collect and record all necessary information – this will help you afterwards when you investigate and fix the issues that you discovered.
After running the experiments, it’s a good time to review all the executed experiments, starting with the ones showing unexpected results. It’s like doing an analysis as if an incident has happened – but in a controlled manner and without all the fuss. File new issues in your bug tracker, so the issues get resolved and can be retested next time. If only small changes are needed, e.g. your monitoring or application configuration, you may want to do them right away – but don’t forget to retest. For successful experiments, especially if they verify crucial behavior, you should talk about whether it is worth to automate these.
As you can see, it is not much effort to organize a GameDay and it does not require any time-consuming preparations. With the first GameDay you can already identify important weak points, whose elimination lets you sleep better. Repeating the exercise will sustainably improve the operation of your system.
Jesse Robins originally coined the term “GameDay” when he was responsible for availability at Amazon.com. Inspired by his training & experience as a firefighter, GameDays were set up to increase the reliability by creating major failures on a regular basis