WavebreakmediaMicro - Fotolia
How to set up a chaos engineering game day
Is it fun to spend the day breaking stuff in a war room with your coworkers? Of course, but more than that, it's vital to the security and stability of certain applications.
It's better to exploit your own security vulnerabilities than to wait for an attacker to do it for you. When you understand how a hacker might disrupt or access your system, you can better prevent the intrusion.
Chaos engineering, a type of destructive testing, helps enterprises discover weaknesses in infrastructure or in how they identify and solve problems. The technique is not a simple assessment of systems, but an attempt to breach or break them. Developers and operations teams sometimes organize these activities into day-long events called chaos engineering game days. For some apps, the game day invites include executives whose products a security breach would affect.
Let's learn more about what a chaos engineering game day entails, and how businesses can host one to their advantage.
Types of chaos engineering days
There are three broad types of chaos engineering game days, all of which can help enterprise intrusion response, said James Burns, a developer advocate at LightStep, a distributed tracing platform. Those types of chaos engineering exercises are:
- active failures
- team tabletops
- executive tabletops
Burns has helped set up all three types of security exercises, including in a technical lead role at Stitch Fix, a personal styling retailer that operates without brick-and-mortar stores. Organizations don't need fancy chaos engineering tools to get started. An active failure day might start with a sudo halt command to remove capacity, or with the use of a traffic shaping tool to slow or block network traffic to databases. Watch IT systems performance in the wake of these disruptions.
"Expect to be surprised, and not just the first time," Burns said. "Systems change over time, and chaos [engineering] game days keep your knowledge of your systems fresh."
In Burns' experience, team members and management invariably ask why they need to run these exercises when stuff is already breaking in production. To answer, he points out the value of having focused time to examine failures, when people are alert and up to the task. "Would you rather learn how to better build and operate your system at 2 p.m., or 2 a.m.?" he asked.
Active failures
In an active failure day, the team is present, either in a conference room or remotely. Each team member has an incident response role. There are two primary roles in an active failure:
- Game runner. Burns calls this role the "master of disaster." This person causes a real -- but containable and reversible -- failure in systems running on staging or production servers. The game runner declares the start of incident, but doesn't disclose the nature of the failure.
- Initial responder. This person attempts to find the root cause of the failure and mitigate its impact.
Learning how to effectively manage resources in a crisis is a core part of an active failure practice. Additional team roles depend on the incident response procedure at a particular company. Some typical roles include:
- First on call. The person who receives the information saying that something is wrong and someone must fix it.
- Second on call. As the fallback person to the first on call, this person steps in to provide additional assistance.
- Incident commander. When a group follows the Incident Command System or a similar model, an incident commander will be first on call when the incident occurs. This person often transfers responsibilities to an engineering manager or executive that brings in additional resources and manages external communications.
The rest of the team operates a particular service or set of services, and they are called if there is a serious issue. In the case of serious outages or other problems, it's typically all hands on deck -- as many team members and resources as possible to do investigation or try mitigations.
In the postmortem process, the team identifies the gaps in tooling that could have helped during the chaos engineering game day response, and processes that might accelerate a return to normal. "These [changes] can then be prioritized as regular work without the cost of a reputation- or revenue-damaging outage," Burns said.
Manifold, a cloud-native marketplace provider, runs active failure scenarios. Most of Manifold's chaos engineering game days focus on back-end services. Game runners shut down or misconfigure a service without warning. They might break part of the back-end application specifically to observe how the front end handles the failure -- whether it displays an error or stops responding altogether. Team members work with the debugging and diagnostic tools in place to run through the scenario and glean incident response information.
Active failure chaos engineering requires communication with managers about potential risks, and potentially with executives too. Active failures illustrate the importance of defined incident response procedures and roles, though these things don't need to be in place at the outset of the exercise. The activity is about getting the team to be familiar with the uncomfortable, stressful feeling of dealing with failure. And at the end of the exercise, organizations often realize the value of formalized procedures and roles.
Team tabletops
Team tabletops follow a similar process to active failures, with one important distinction: No real failures happen. Tabletop chaos engineering games are simulation activities. Team tabletop days help everyone become prepared for the pressure of a live incident and ensure team members understand the tools involved in incident response, in a made-up scenario. But made-up doesn't mean fantasy.
"Often it's easiest to base the tabletop incident on a real incident in the past," Burns said.
The game runner must know the systems well enough to tell what's happening when, and what information would appear in monitoring and management tools. Have dashboards and chat logs available to reference in the course of the team tabletop, to help understand what was visible to engineers and customers at the time of the incident.
Manifold uses made-up events to practice debugging skills among its team members. These chaos engineering exercises can also simulate interactions with vendor support to make sure that on-call engineers know how to escalate and communicate in a real event.
Executive tabletops
In executive tabletop chaos engineering, the emphasis is on cross-functional communication and procedures for response. The game runner presents a simulated business threat in real time, such as a security breach that results in the loss of customer data. The executive team responds with what actions and communications they would make. This type of chaos engineering often reveals gaps in responsibility, lack of clarity about a situation's urgency and other important issues that hinder executive and business resilience to risk, Burns said.
You must prepare a realistic business risk for the company to run an executive tabletop, so consider:
- Can you learn from the example of a similar business in the market that experienced a real-life disruption?
- What keeps your employees, at various levels, awake at night?
- What system issues would most worry your customers?
These answers help the chaos engineering game runner choose a realized, plausible and relevant risk for the company. The game runner must also figure out how communication channels work, to and from executives. Then, it's a matter of getting executive time -- no easy task.
How maintain a chaos engineering program
It isn't easy to run a chaos engineering game day. Nonetheless, it should be both fun and instructive.
Manifold has hosted several styles of chaos engineering game days. Examples include 30-minute tabletop events as well as multi-hour active failure events that involve the full engineering team. A recent offsite Manifold event involved dice rolls, character classes and prizes for surviving the chaos incident.
To maintain a chaos engineering program, employees must enjoy the challenge. "Uncontrolled chaos will happen to your system -- save your seriousness for that," said James Bowes, CTO of Manifold. Role-playing game days are a great way to keep it interesting.
With each chaos engineering game day, the organization should build up its resistance to digital failure. "As you proceed, and if you are successful, it should become more difficult to find parts of the system to break," Bowes said. Let the participants know that the goal is to find problems; if they break something, consider that a success. But keep other teams and stakeholders informed.
Not every scenario is right for a chaos engineering game day. Manifold has run out of time before they could introduce an intended failure. Avoid chaos engineering setups that are too involved, and estimate if the required resources are in place for your idea. Failed chaos exercises are a learning opportunity to prepare better for the next game day.
Host active-failure game days in the staging environment instead of production to give yourself more prep time in advance of the event. Most of the setup time involves figuring out how to introduce the problem the game runner wants to cause.
In some cases, the game runner failed to break anything, which feels like a ruined event in the moment, but is a win from a software quality perspective. When a system proves resilient to failure, the game day has clarified the team's understanding of how the system works.
A tabletop chaos engineering game day example
Freshworks, a CRM and help desk tools provider, holds tabletop exercises for both engineers and executives. This chaos engineering strategy provides the same result as live exercises, but with less overhead. With these exercises, the team simulates a hypothetical scenario, and uses it to strengthen preparedness of the team, policies, systems, communications and procedures.
"These [exercises] are akin to muscle memory that golfers develop on the driving range when they hit the golf ball," said Prasad Ramakrishnan, CIO at Freshworks.
In Freshworks' example, an incident commander brings all the required parties into the conference room and defines the chaos engineering scenario. She then asks participants how they would respond to the simulated incident, and the group analyzes those responses in real time to determine the time to resolution. The team is judged on the policies and communication framework in place. Freshworks then refines its processes, policies and procedures -- and gets prepared for the next simulation.
A few keys to success:
- identify keywords that describe the urgency of the issue;
- determine failure modes;
- establish communication templates; and
- prepare a playbook for the incident response team.
"These types of drills are done two or three times a year, so the parties involved know how to respond when the real incident happens," Ramakrishnan said.