Arjuna Kodisinghe - stock.adobe.
AWS outage brings DR strategies back into focus
A lengthy outage for AWS in December brought many enterprise apps and services to a halt, but it also provided a moment of reflection on DR plans and data availability.
A service outage for a major AWS data center followed by two additional incidents this month have shined a light on enterprise disaster recovery plans.
AWS reported the first outage Dec. 7 at the company's main East Coast data center, located in northern Virginia. The incident, which lasted about seven hours, took down popular apps and web services and occurred less than a week after the hyperscaler's annual re:Invent conference, where AWS unveiled a new disaster recovery (DR) service and encouraged further adoption of an AWS-centered computing future.
On Dec. 15, AWS reported a second outage, impacting services running in the company's West Coast Oregon data center. On Wednesday, the public cloud provider reported a third outage, this one again occurring at the Virginia data center.
Although the second and third outages were significantly shorter, their collective impact has become a flashpoint for DR plans for enterprises.
Cloud analysts and consultants said service failures aren't eradicated by moving to the cloud, even when hyperscalers such as AWS and Microsoft Azure offer DR services and promise better uptime. Instead, enterprises need to construct their own plans on how to manage possible cloud outages or the ever-increasing threat of cyber attacks.
How an organization develops a plan for such events should be based on the enterprise's tolerance for downtime, cost tolerance and willingness to labor on cloud infrastructure, analysts and consultants said.
"It depends on where [an enterprise is] coming from for their existing disaster recovery strategy, where they need to go for different recovery points and times they can tolerate," said Krista Macomber, senior analyst at Evaluator Group. "At minimum, you ideally want to be able to fail over to a different region within that cloud provider."
Experts also warned that companies purchasing disaster recovery services distinguish between data availability and data protection. The inability to access data can become financially damaging, but the loss of data due to user error or user malice can be financially devastating.
"There's a big difference between backup recovery for data loss and availability of the service," said Christophe Bertrand, a senior analyst at Enterprise Strategy Group, a division of TechTarget. "You are always responsible for your backup and your data. I'd say you're also responsible for your applications."
Outage cause and effect
AWS confirmed the Dec. 7 outage was caused by a series of escalating networking issues within the company's own internal network in the East Coast data center in Virginia. These failures bled into customer-facing technologies such as the control planes for AWS services and AWS APIs.
AWS reported the initial networking issues began following an automated activity at 7:30 a.m. PST. The company considered the incident resolved and services restored by about 2:30 p.m. PST, although some lingering issues persisted in some AWS services until 7:30 PST. AWS declined a follow-up interview with SearchDisasterRecovery.
Wednesday's outage was similarly due to network connectivity issues and the failed launch of Amazon Elastic Compute Cloud (EC2) instances, according to AWS. The cloud services provider first noted the outage around 4:30 a.m. PST and claimed most issues were resolved by 7 a.m. PST. At the time of publication, AWS reported that some services, such as ElastiCache and Redshift, would continue to have issues until a full recovery is achieved.
The AWS outages impacted a massive number of enterprises and end users due to the hyperscaler's dominance on the cloud market, which totals about one-third of the public cloud market, according to Synergy Research Group.
"People have miscalculated the interdependency of [cloud services]," Bertrand said. "Best practices don't change because you're in the cloud."
Instead, the old on-premises habits of redundancy and service rollover into other data centers and regions remains important for cloud-native storage and computing.
"On premises has always had the potential of going down," said Ray Lucchesi, president of Silverton Consulting. "If you're going to move your IT activity out to the cloud, that doesn't reduce the need for disaster preparedness and capabilities."
Despite the public impact and visibility, an outage of this scale is fairly uncommon for AWS, according to Marc Staimer, president of Dragon Slayer Consulting.
"Most of these outages never get reported because they're so short," Staimer said. "Candidly, [Amazon does] a good job keeping their data centers up."
Lucchesi noted a hyperscaler's history of prior uptime is a cold comfort, even if an enterprise has yet to experience an outage before.
"Disaster recovery is a necessary evil if you're going to do data processing in this day and age," he said. "A 20 second outage can be millions [of dollars] for some people."
An ounce of preparation
Macomber said the cross-region replication of the new AWS Elastic Disaster Recovery (EDR), a DR service sold by AWS and marketed by the company as the DR of choice for the cloud, can help customers better manage AWS-specific outages. Those customers should also invest in third-party DR software to protect not only against service outages, but against other threats including ransomware and malware, Macomber said.
AWS EDR can help replicate across other AWS regions, but ultimately cannot leave the AWS cloud network or protect the data itself against cyber attacks, she noted.
"[AWS is] responsible for the availability of the service itself, the data protection is the customer's responsibility," she said.
Third-party DR services, such as the AWS-focused Zerto In-Cloud disaster recovery orchestration and automation tool for AWS EC2, can help with rolling data over into new regions in case of outages. Other popular choices, Macomber noted, included DR services from VMware and Cohesity.
"There isn't one silver bullet, optimal solution," she said. "The big thing we advise customers is taking a step back and evaluating the applications [and] the workloads."
Multi-cloud failover is also a possibility, moving data and workloads from AWS to another major hyperscaler such as Microsoft Azure or Google Cloud Platform in case of an outage.
Krista MacomberSenior Analyst, Evaluator Group
Multi-cloud strategies aren't without significant challenges, according to IDC analyst Andrew Smith.
He said major hyperscalers and private cloud providers, such as Oracle or IBM, don't typically run compatible services between one another without some additional work by IT admins. Relying on several clouds, however, can help avoid single vendor lock-in and protect data during a catastrophic event for one cloud or if user access is compromised by ransomware and other cyberthreats in a cloud used by an enterprise.
Common challenges facing multi-cloud setups also include a lack of consistent management tools across clouds, a lack of unified security services and the ever-looming question of storage and migration costs, according to Smith.
"A lot of [multi-cloud strategies are] to mitigate the threat of vendor lock-in or catastrophic outages, but there's a host of challenges that come with multiple cloud providers," Smith said. "We don't see that multi-cloud nirvana happening. … There isn't even parity of service across clouds."
Enterprises should make sure to review and codify their service-level agreements with a hyperscaler to ensure compensation is available in case of an outage, especially since many will likely rely on just one cloud provider.
"There's a little bit of collective bargaining that enterprises have to fall back on," Smith said.
Overshadowed by ransomware
Disaster recovery against outages is important, but analysts and consultants warned that ransomware, malware and other malicious actions should be the priority over public cloud outages when developing DR strategies.
"Ransomware will continue to be top of mind," Macomber said. "Those hacks continue to evolve. It's almost preparing for the inevitable."
Staimer said a certain level of paranoia for infrastructure safety is important as hyperscalers typically don't have protections or guarantees on accessed data and "one disgruntled employee can set you back years."
"Ultimately you can never contractually remove your own responsibility," he said. "Loss of revenue is much greater than the cost of protecting against it. It's insurance."
Tim McCarthy is a journalist living in the North Shore of Massachusetts. He covers cloud and data storage news.