A GameDay is a collaborative exercise to help you as a team find weaknesses in your system in an exploratory space to improve resilience. We explain what you need to keep in mind.
Why run GameDays?
GameDays support teams and organizations in making their work visible. This sounds strange, as you would think: my code is running after all, what should I make visible? But it's much more about every person working on a complex system revealing their knowledge.
A good analogy is the construction of a car: The development team builds a car together. There are experts for the engine and there are experts for the bodywork. There is the designer and there is the person responsible for the product. Together, they all put their knowledge into a pot so that a safe, stable and efficient car can be built. After the car is built enough to do a first crash test, these experts get together and discuss how the "system" car can be improved.
And this is exactly how game days can be understood. In regular cycles, system experts come together and do a planned crash test, to identify weak spots in their system product.
When should you have a GameDay?
In general GameDays are to improve the resilience of your system. Ideally this should be done on a regular basis to be sustainable. But there are also other triggers, which may have some influence on your preparations:
Do you want to reproduce an incident that led to a past outage? Make sure you have the post mortem for the incident at hand and/or invite the people who dealt with it.
Do you have a new system/major feature and want to ensure that monitoring, alerts and the planned resilience patterns are working properly? Get someone who implemented it and someone who will operate it.
Do you want to retest an issue from a past GameDay and want to ensure that the fix is working properly? Make sure you have the protocol from the past GameDay.
You want to try out the flow of a GameDay in general and don't know yet which topic could be interesting for the team? Then invite your colleagues to suggest possible post mortems or incidents from the past. This also gives you a direct feel for what issues the team is concerned about.
Who should participate?
Make sure you have some people attending who know how the application works and was intended to work - typically developers. They also may fix issues directly and often can tell beforehand if the experiment will fail without executing it. Just as important as knowing how the application works is knowing how the underlying platform works – this may be knowledge present in the dev team or you might require you to add someone from operations.
Make sure you have participants who are able and permitted to execute the attacks. This depends on your policies and tools used. The Steadybit platform makes it easy for non-experts to execute attacks and with the fine grained access control the permissions can be given down to single containers – this way you can do the experiments without systemwide privileges.
How long does a GameDay take?
A GameDay typically lasts from 2 to 4 hours. So make sure that everybody has time.
Some soft facts you should care about before and while gaming:
Please don’t surprise your (non-participating) colleagues badly with the chaos you’ll be causing. Do let them know beforehand that you are hosting a GameDay and what impact this GameDay might have and how to reach out to you if something goes seriously wrong.
As a result of the global pandemic nowadays many people are working from home. If you don’t meet in person there are some caveat that you should be aware of:
Prepare for connectivity loss. Provide alternative communication channels. Make sure if one of the participants gets permanently disconnected that you have someone jumping right in for him/her. You don’t want to lose the only person who can stop the experiment.
Be aware of distractions. In the home office, you can't "turn off" every distraction, but remember to at least turn off notifications.
It's all about making your work visible: Use Slack or a Miro board to track the progress, and exchange extra information. Above all, the chats and the digital whiteboard can be helpful as documentation afterwards.
Agenda for the next GameDay:
This is what a possible agenda could look like, so that all participants know what to expect:
Greeting and Introduction
Bring everyone to the same page and so that they know what will happen on the GameDay. If Chaos Engineering is new to anyone, it is useful to make an introduction to it (for veterans you might want to skip this).
Overview of the system architecture
Even if you work on this system on a daily basis it is often beneficial to take a step back and view your system as a whole. Try to visualize it using a (virtual) whiteboard and diagrams. This brings everyone up to date, maybe someone missed changes over their vacation or was just too busy to notice. Having this, often proved useful in the past – it strengthens the communication in your team about the system.
Designing Test Cases / Experiments
With the architecture in mind we brainstorm for meaningful experiments. Ask yourself how the application behaves under certain conditions. Starting with fundamental experiments, like:
How big is the impact when a single instance fails?
Are all requests handled successfully when doing a rolling update?
Is the instance restarted when the underlying host was rebooted?
Then you can continue with more slightly advanced experiments like:
What happens if external service XYZ is not available?
Can we still process requests when the message queue is full?
Is the application still working when the database has a failover?
For each test case, design an experiment. First define a steady state: What application behavior do you expect, e.g.: “5,000 requests per minute on the index page are handled 99% successfully within one second.”
Next, hypothesize how the application will behave under given conditions: “When a single container is restarted the steady state is still given”
Now you need to define what target is how to attack. Typically you first start with a small blast radius (e.g. a single instance) and then increase it when successful (e.g. multiple instances or an entire zone).
Executing the Experiments
First: If you know that a designed experiment will fail for a known specific reason, fix the issue first and then come back to the experiment later.
Verify that you have all necessary metrics and monitoring in place, so that you notice when things don’t run as planned and be prepared to stop the experiment. Now you’re ready to start the experiment. Before injecting the failures check if the steady state is given – you won’t get useful results when exploring an unstable system.
While running the experiment, observe the steady state and have a look at your monitoring. Do you see the expected behavior or any unforeseen issues up- or downstream? It is good to collect and record all necessary information – this will help you afterwards when you investigate and fix the issues that you discovered.
After running the experiments, it’s a good time to review all the executed experiments, starting with the ones showing unexpected results. It’s like doing an analysis as if an incident has happened – but in a controlled manner and without all the fuss. File new issues in your bug tracker, so the issues get resolved and can be retested next time. If only small changes are needed, e.g. your monitoring or application configuration, you may want to do them right away – but don’t forget to retest. For successful experiments, especially if they verify crucial behavior, you should talk about whether it is worth to automate these.
As you can see, it is not much effort to organize a GameDay and it does not require any time-consuming preparations. With the first GameDay you can already identify important weak points, whose elimination lets you sleep better. Repeating the exercise will sustainably improve the operation of your system.
Jesse Robins originally coined the term “GameDay” when he was responsible for availability at Amazon.com. Inspired by his training & experience as a firefighter, GameDays were set up to increase the reliability by creating major failures on a regular basis