Andrea Danti - Fotolia
Chaos engineering enters mainstream QA, drills down to apps
Add a bit of chaos to predeployment application testing to gain resilience. Gremlin CEO Kolton Andrus discussed the value of breaking apps to fix them.
Chaos engineering and fault injection sound like destroyers of software quality, but these practices align with other proactive testing techniques, such as stress testing to improve software resilience. Fault injection -- a core technique in chaos engineering -- is like a flight simulator for application quality assurance, where risky situations are tested in a controlled setting before deployment.
Kolton Andrus, CEO of Gremlin, a provider of failure testing tool suites in San Jose, Calif., is a former software engineer for Netflix and Amazon -- he built the former's failure injection service. Andrus offered insights on the maturity of chaos engineering and what he sees as best practices for fault injection and proactive testing.
Where is chaos engineering in terms of adoption and maturity?
Kolton Andrus: If you look at the early adopter curve, we are somewhere between early adopter and early majority. Chaos engineering is practiced by more than just the Amazons, Netflixes and the Dropboxes of the world now. Many banks, traditional commerce companies and nontech businesses are doing fault injection, for instance.
We're seeing [chaos engineering] start to move into more of the mainstream. There's education to do, as a lot of people still don't really know why they should do chaos engineering and how to do it well.
What is the value of driving fault injection down to the application level?
Andrus: The big difference between chaos engineering and standard testing is when tests happen. We [at Gremlin] think of chaos engineering as thoughtful, planned experiments in which you break specific things before they go into production in order to see what happens. It is done very deliberately and very carefully.
Specificity is critical in application-level fault injection, and [the technique] is applied by creating failures in individual requests, individual network calls or a very specific communication. [An] isolated, targeted approach on the application level [reveals] specific information that can't be gathered at the infrastructure, container or other host environment. We want to test such specific activities, because, most of the time, not everything fails; just one specific thing fails.
For example, Amazon had a big S3 [Simple Storage Service] outage two years ago that was caused by a very specific [process], which was not proactively tested. Also, there have been many domain name service failures due to DNS attacks, all of which could have been prevented or recovered from faster if tested proactively.
The goal is to break application-level things before the application goes into production. This protects users from experiencing preventable outages or failures.
How do chaos engineers break things without causing harm to software and system environments?
Kolton AndrusCEO of Gremlin
Andrus: First of all, be sure you can shut down the experiment, clean it up and get back to square one. Create an eject button, [as] it's not safe to let an experiment keep going after something has gone wrong.
Always set up a blast radius. This is a cornerstone practice in chaos engineering, as it limits the damage a failed experiment can cause. Don't run a chaos experiment at 100% the first time. It's dangerous. Run the smallest experiment that will provide useful information, such as a test on a single user or single host or server. Even better, begin in the staging environment.
At this lower scale, it's good to focus tests on operational configuration. For example, test how the application works when there's an influx of traffic, how to keep dependencies controlled instead of overwhelmed and if configurations are correct.
If you do that small experiment and something breaks, you win. You've found something that doesn't work right, and you'll fix it and learn what to do if something similar happens in product. Users aren't impacted.
The next step is to run an experiment for, say, 10 users or 10 hosts or in region. In each step, you mitigate the risk of the experiment by verifying that a smaller piece of the software and process works.
Doing experiments within a small blast radius and with an eject function creates a low-impact environment for breaking things in preproduction.