Attain magical five nines availability for cloud applications
Every company seeks the magical five nines -- or 99.999% availability. If major providers can't promise it, how can you get it?
If a tree falls in a forest and no one is around to hear it, does it make a sound? Let's apply that to IT. If part or all of your cloud data center fails and no applications experience outages, is it still a disaster?
Arguably, someone needs to restore the cloud to proper operations, but disaster can be in the eye of the beholder. To cloud users, it is possible to construct applications in such a way that minor or even major outages are not disasters.
Imagine that a mission-critical application needs 99.999% uptime, often called "five nines." Five nines means that during an entire year, the application can be unavailable for a maximum of 5 minutes and 15 seconds total.
Amazon's service-level agreement says its Elastic Compute Cloud (EC2) has 99.95% uptime. That gives AWS users an unavailability of 4 hours and 22 minutes -- 50 times more downtime than allowed.
The old way: Uptime responsibility falls to the infrastructure
In traditional environments, there are a variety of techniques IT pros can use to achieve the super high availability the mission-critical app warrants. A core assumption is that much of the responsibility for achieving five nines in traditional IT falls to the infrastructure.
Redundancy is built from the core hardware up. Servers are ordered with dual power supplies, dual NICs for each network, onboard RAID arrays, diagnostics software and more. Each rack has dual switches and power routed through different distribution systems that are backed up by batteries and diesel generators. Virtualization software is configured for high availability with live images running on multiple host machines. Data is kept in highly available storage arrays from vendors such as EMC and NetApp. Databases are clustered, and application servers are clustered and load-balanced.
Then, the entire infrastructure to support the application is duplicated in a totally separate data center with high-speed dedicated networks between them. There's a lot more, but you get the idea. It's a very expensive model to build and operate.
Why cloud's way is different
In the largest and most successful cloud environments, many of the elements above are no longer in place. Servers are a commodity and treated as disposable in order to drive out costs. Redundancy is far less common; a single power supply, commodity disk arrays and bare-bones virtualization are more the norm.
With a focus on cost and commodity infrastructure, it is easy to see why Amazon EC2 can only promise an SLA of 99.95%. Clearly a deliberate move by Amazon, it reflects a core philosophy: Infrastructure failures shouldn't be that interesting. The infrastructure assumptions of traditional IT are outdated and can't be relied upon to support application availability requirements.
So how can you achieve the magic five nines in this environment?
High availability means thinking services, not apps
The key principle of attaining high availability in the cloud is to move the responsibility from the infrastructure to the application.
Moving responsibility for availability to the application is a fairly complex undertaking. In existing systems, it can require significant budget and time to achieve. Before moving a mission-critical app to the cloud, make sure you understand this investment and risk.
At a high level, there are some key concepts to keep in mind. Each of these may require you adopt significantly new practices in your development and operations teams. There are four fundamental rules for building scalable and highly available cloud-native applications, but the first is most important: Think services, not applications.
Cloud-native systems are built as collections of loosely-coupled services, not as monolithic applications. Netflix is the poster child for cloud-native application architectures and practices. The company speaks in terms of "micro services," where functionality is decomposed into small atomic units that interact with each other through published Web services APIs. An example is the service that fetches the list of "Recently Watched" items when you're looking for something to stream. Other services bring the images, descriptions and data that appear on various screens. Each service has an API that others use to interact with it.
The Netflix application is actually a highly distributed network of hundreds or thousands of services. Thus, development teams work on services, not apps.
If you're writing a new application, follow Netflix's lead and think in terms of services. It becomes difficult if you're starting with an older application. First, you need to take what might be a monolithic system developed by multiple teams over several years and decompose it into services. In addition to the code, the data underlying the service should be decomposed.
What some think of as functions in a traditional application -- account look-up, inventory tracking, crediting payments, etc. -- would be decomposed into one or more discrete services.
Part two will cover the remaining three rules for building or modernizing scalable and highly available applications in the cloud, and how to generate five nines availability from three nines infrastructure.
About the author:
John Treadway is a senior vice president at Cloud Technology Partners and is based in Boston. You can reach him at [email protected]or on Twitter @johntreadway.