February 3rd, 2020
With my client's team, I got a chance to run the first cloud gameday. In this post, I share my experiences and tips on how to run the first gameday successfully with your team and related parties. However, let me first define what a gameday is.
It is hard to give a universal definition of "GameDay." If you google it, you are likely to hit AWS "GameDay" event, a one-day exercise similar to a workshop. In these events, participants are introduced to a live playground of an application and tasked to keep it running in different situations, where the application is going to fail.
You also might have heard about Netflix's Chaos Monkey tool or chaos engineering in general. Or if you're into infosec, classic red vs. blue teaming exercises have a similar method.
What matters is that all these concepts test some applications against a situation where the normal operations are disturbed. Only imagination and technical expertise limit the testing. The most common ones are likely to be application crashes, host or virtual machine blackouts, and networking problems, or something similar.
The ability to withstand disturbances from normal operations is called resiliency. Partly this is what GameDays are about: testing your hypothesis about your application's resiliency in practice. But it is much more than just that. By running your GameDays, your team can discover operational faults previously unthought of, broken processes, and many action points for your backlog. The more positive aspects include a better understanding of your application infrastructure, specialized knowledge shared across team members, the experience from running incident processes before they happen in production, and a bit more trust that your application behaves as expected.
The concept is old, and some who may consider themselves lucky call it testing. Now, if you're like me, and have never run such a GameDay event to test the resilience of your application and operational processes around it, please continue. Before I jump into the tips and experiences section, I share some thoughts about why it might be that the practice is not more widespread everywhere.
Dev vs. ops: If your development is not dealing with operations, it's going to be harder to run a GameDay event with its full potential. It's even harder if you happen to buy your infrastructure operations outsourced.
Environmental hazards: There might be components shared by multiple environments, and you start to think if the GameDay you host could disturb your production. Such a cross-environment bleed is a finding you probably should act upon, and which should not stop you. If you need, scope such components out from GameDay's scope.
Cost issues: Having ten people in the same room for multiple hours has a time cost, which might make you wonder if it pays off. If you then think GameDay as a way to test your application and processes and to educate your team members about the application's inner workings at the same time, the price hopefully starts to sound reasonable offer.
Lack of time for a technical setup for tools: While there exist multiple tools and frameworks to streamline your GameDay operations, no such tool is required to get started. The only thing required is one production-like environment and access to create some of the test scenarios by hand or with plain good old shell scripts.
The team is busy, no time to run GameDay event before production: It's quite hard to argue against this one, but luckily, you can organize your GameDay after the product launch too. Frankly, the best practice is to have multiple GameDays. After all, your application architecture undergoes constant evolution.
I split the GameDay into three parts, which felt like the right approach. These parts are:
Pre-planning & scoping: With 2-3 people, who know the application, architecture, and processes well
GameDay teaches how to debug errors and from where to find data from (metrics, logs) to support your error diagnosis, or shows what data still needs to be collected
Invite all relevant parties, who would participate in operations and incident management
Plan scenarios before the event.
Start by drawing the application architecture
Scope the target, which parts are under the test, and which parts must not be touched
Discuss known limitations. There is not necessarily a need to break things we already agree on, that are going to cause problems. These are still findings, which you should log.
To run a three-hour GameDay session, be prepared with about three scenarios.
Run a retrospective after the GameDay.
Have a large enough physical space for all participant
Start simple, no need to utilize fancy frameworks or tools in the first place. Just turning off some nodes in your application cluster, killing processes, and straining the resources available for the application introduces a bunch of findings for you. After a while, you may want to start looking for the right tools and frameworks to automate and speed your testing. You should probably check ChaosIQ's Chaos Toolkit
Lot of blog posts and resources can be found from Github on awesome-chaos-engineering repo
Running events similar to GameDay helps you improve the resilience of your application by discovering new issues about the application and related processes. People tend to understand the value of GameDay when it is an incident management exercise. While organizing a GameDay takes up some of your precious time from other tasks, you can start simply by running the GameDay scenarios manually, and automate some of the scenarios you find insightful in the future events.