4 practical methods to increase service resilience

Resiliency refers to the ability of your architecture to quickly predict, detect and mitigate potential performance failures. Here are four ways to fortify your software.

Matt Heusser, Excelon Development

Published: 23 Jul 2019

Modern software is vastly more complex compared with apps of old, which needed only a web server and a database to function. Today, architects deal with gnarly distributed architectures and the intricacies created by public and private clouds, server virtualization, containers, clusters and microservices -- just to cite a few.

But systems old and new can fail. Building resilient software for old apps was way simpler than what is required for today's applications.

The technical term for a system that fails yet remains functional from a customer or user perspective is a resilient system. When you think about resiliency, think about a forest ecosystem than can survive a wildfire, a drought or a flood. Instead of focusing on mean time between failures, an architecture based on resiliency is designed to improve mean time to repair.

Consider these four techniques to fortify your service resilience.

1. Implement rolling upgrades

With clusters like Kubernetes, you might run five, 10 or 100 servers. Instead of upgrading all of them at the same time, you should upgrade one server at a time and monitor its health. In the book How We Test Software at Microsoft, authors Alan Page, Ken Johnston and Bj Rollison first recommend routing recently upgraded servers only to internal employees servers and only release to customers once an architect deems that the release is stable. This allows paying customers to see the most stable version, while employees are essentially beta testers.

This approach, known as canary deployment, can help identify problems before they reach a paying customer. Make sure you aggregate and log anything that might indicate a performance issue to address before general release. Pay attention to any feedback from automated monitoring and reporting software and from your beta users. This method helps those developers identify issues without the need for comprehensive and repetitive testing.

2. Retry functionality and make services asynchronous

Most traditional web services are synchronous: You send a request and wait for a reply. However, these wait times can really add up. In the case of HealthCare.gov, the web browser made a request that went to the server, and that server tried to interact with numerous healthcare providers to find plan availability and premium costs. If any of those attempts failed, the entire transaction risked timing out and failing as well.

One way to add more service resilience is to disconnect the wait time. Open source databases like Redis do this automatically. Redis breaks the tradition of waiting for a server to make sure database responses are correct before it transmits information. Instead, it quickly sends back information that might be out of date. The server will eventually receive the correct information, but providing old data helps cut out the added wait time as the server waits for a response from the database.

3. Test in production with synthetic transactions

I once worked with a website that caused users to complain because it was slow and often timed out during login. By the time we had enough calls, we would test it by hand, and everything would appear fine. Today we would call that an observability problem.

One of the support engineers solved that problem by making a tiny test script that was eventually able to identify exactly when and why failures occurred.

Those types of small test scripts that run in production and provide user experience information are called synthetic transactions. Some frameworks provide them out of the box, but others might require a test tool. I personally prefer Document Object Model-to-database tests -- little snippets that check one particular piece of functionality all the time, logging the results and timing. Teams can even visualize these results on a dashboard.

But this technique is not for all applications. The addition of synthetic transactions requires time and energy that might be better spent on other projects. Make sure there's a tangible problem to be solved to avoid wasting time on an unnecessary process. Internal systems with high uptime that everyone will notice when they drop, such as Directory and Login servers, are bad candidates for this technique since user experience will suffer if the transactions strain the system's performance.

4. Engineer for redundancy, then ...

Service resilience doesn't just mean you need to engineer for redundancy. It means you need to test for it. Netflix's much-lauded Chaos Monkey terminates random cloud-based servers deployed with Spinnaker, an open source continuous delivery tool. The idea behind Chaos Monkey was to induce specific failure and measure the results. If a small purposeful break creates a real failure, then restore things and create redundancy.

Testing for redundancy does not have to mean using Chaos Monkey. If you don't use Chaos Monkey, then you certainly don't need to write your own. Instead, devise a test plan that includes forced failures. Knowing what happens under failure conditions might be good enough. Or you might find that the redundancy doesn't really fail over, and the fix is easy.

4 practical methods to increase service resilience

Resiliency refers to the ability of your architecture to quickly predict, detect and mitigate potential performance failures. Here are four ways to fortify your software.

1. Implement rolling upgrades

2. Retry functionality and make services asynchronous

3. Test in production with synthetic transactions

4. Engineer for redundancy, then ...

Dig Deeper on Application development and design

CrowdStrike chaos shows risks of concentrated ‘big IT’

Chaos Monkey

Tools and techniques to test Kubernetes objects

3 lessons from the 2021 Facebook outage for network pros