Getty Images/iStockphoto
How an AWS multi-region architecture can strengthen DR
Meet AWS outages head on by learning how to build a multi-region architecture that achieves resiliency in the event of disaster.
For AWS admins who need to build a multi-region strategy to keep their cloud deployments resilient in the event of an outage, various approaches come with tradeoffs between cost, complexity and effectiveness.
A multi-region infrastructure setup is a complex undertaking. Organizations need to plan for various factors, including additional services, maintenance and the staffing required to manage complexity. Moving data back and forth between regions can drive up costs and complexity.
"The two main benefits of a multi-regional strategy are the performance of proximity to end users and resilience," said Tim Banks, principal cloud economist at The Duckbill Group, an AWS cost management consultancy.
The importance of staying active-active
AWS provides a range of backup and recovery options for users. At a minimum, admins need to replicate data across multiple regions, with an appropriate backup plan. This protects operations if a complete AWS region goes down.
The active-active architecture, which is growing in popularity, supports high availability for both data and business processes. It's also the costliest, said Mike Nolan, principal architect at IT consultant SPR.
In an active-active approach, systems are always on in multiple regions. If AWS loses a region, the setup running in another region or regions picks up the slack. This arrangement means that recovery point objectives (RPOs) and recovery time objectives (RTOs), two critical benchmarks of any disaster recovery (DR) plan, are never at risk.
Eventually, the region returns online, and the active-active setup needs to sync up. Data in the region that experienced the problem needs to catch up. To understand the catch-up path, you will need to do some planning and account for the particular software and systems used in the deployment.
Active-passive strategies are less resilient, less complex; they are also cheaper. Common active-passive strategies include backup and restore, pilot light and warm standby. Backup and restore, wherein the backup deployment does not exist until needed, is the least expensive. Pilot light and warm standby are more complex active-passive variants. They keep some application infrastructure primed or running, and data is up to date in preparation to spin up an active deployment.
An active-passive strategy typically involves data-only replication via backups. AWS admins can set up infrastructure as code and DevOps pipelines to deploy the restored infrastructure and applications quickly when needed. If a disaster occurs, these automation strategies get the core network and server platforms online. Data is recovered and application and configuration information in the pipelines get applications up and running. Even with automation, recovery takes longer in an active-passive strategy than in an active-active approach.
Consider a range of thresholds as part of your overall resiliency strategy. While active-active DR is the pinnacle of best RPO and RTO, it isn't always the right choice.
"Any sizeable organization will have a hard time justifying a single active-active approach. It adds complexity and increases cost in noncritical systems," Nolan said.
Organizations can create system tiers based on combinations of RTOs and RPOs for different aspects of their overall AWS use. Core network and security is a tier 0. These are critical to all aspects of getting systems working again. From there, the lower the tier number, the more critical both the RTO and RPO are. Tier 1 can include any customer-facing, revenue-driving system the business needs to operate, for example. Use tiers to quantify the value of various DR strategies for workloads.
The keys to application resiliency
A lot goes into protecting applications from downtime resulting from an AWS regional outage.
Start with the entry point to an application. "The front door to your application is DNS," Banks said. To keep traffic flowing to available services, the DNS must always point to accessible targets.
Next, set up health checks and monitoring for downtime, errors or other service degradations. Automate responses to these service interruptions and performance problems.
When there is a regional outage, you won't be the only AWS customer clamoring for capacity in other places. Make sure that you've pre-provisioned enough resources to handle the failover traffic in other regions, Banks said.
It is vital to build for failure at the application level, Nolan said. Ensure users are not left hanging if an aspect of a system is in error or unavailable. Consider chaos engineering, a game of unpredictably and intentionally breaking an application environment, as a practice to ensure your resiliency approach fulfills service-level agreements.
Approaches to data replication
Admins also need a plan to ensure data consistency, reliability and integrity. Active-active strategies across regions require a synchronous replication approach. By contrast, basic active-passive DR strategies support asynchronous data replication approaches, when the RTOs and RPOs have lower thresholds.
Consider appropriate database technology usage and determine if your availability requirements warrant the cost and complexity of choosing an active-active strategy for your architecture, Nolan said.
Conventional relational database management often requires an enterprise-level license to enable the multi-replication model required for multi-region active-active applications. Licenses for a relational database management system, or RDBMS, can be a significant cost factor in your overall approach to protecting AWS workloads.
NoSQL databases are better geared for these multi-region scenarios without heavy licensing costs. They are often built-in as offerings from AWS, such as Amazon DynamoDB. However, the transaction model of NoSQL is different than RDBMS offerings. NoSQL databases have what's sometimes referred to as eventual consistency, while RDBMSes are considered always consistent.
"This can have great implications on how your data present to users and your ability to meet user requirements," Nolan said. It also influences your approach to application architecture for AWS workloads.
Optimize network infrastructure
Assess the networking and infrastructure elements required to ensure DR and resiliency in a multi-region strategy. An infrastructure as code (IaC) approach can automate aspects of environment setup and enforce best practices. And yet, organizations commonly underestimate their IaC approach regarding DR, Nolan said.
Consider how frequently layers of your infrastructure architecture change in relation to the systems running within the regions. Avoid bundling static aspects of infrastructure with frequently changing parts. VPC subnets do not tend to change often. Doing so can have implications for network addressing. By contrast, security groups change many times, especially in evolving systems. Don't tie these configurations into single sets of scripts. Ensure automation does not hamper your ability to change the deployment as needed.
At the same time, be careful about hardcoding things that change across regions. Even the most resilient infrastructure will be in trouble, Banks said, if the database client in your code has a hardcoded regional endpoint.
Also, inventory your networking infrastructure, said Gavin McMurdo, chief technical adviser at IStreamPlanet, a video streaming service. Any AWS offering comes with different burst or sustained network throughput limits. It can be tempting to ignore these limits, which generally don't matter for something like a storage or database service. The limits could be a big deal in a DR scenario, McMurdo noted, when things are suddenly rolled over to another region.
Investigate how any dedicated network fiber ports are connected. McMurdo found it essential to work with AWS technical advisers to ensure that the fiber terminates in different devices, to avoid a single point of failure. Sometimes fiber ports get moved around by AWS staff as they consolidate and address failures. An upfront conversation about network design can reduce the risk that AWS staff introduce a single point of failure in the process of fixing another problem, he said.
Specific AWS services
Nolan recommended shortlisting several AWS services for implementation in an active-active architecture, broken out here by use:
Application execution
- Amazon CloudFront for regional or global content distribution networks
- EC2 Auto Scaling for traditional IaaS systems and application scenarios
- AWS Lambda for PaaS application scenarios
Data
- S3, RDS, DynamoDB and Amazon DocumentDB for data storage and access solutions, which have snapshotting capabilities
- Elastic Block Store and Elastic File System snapshots for associated disks and shared file systems
Infrastructure
- AWS Backup provides a consolidated dashboard for managing backup and recovery aspects for the tools listed above and others
- AWS CloudFormation to manage IaC
- AWS CodePipeline to manage push-button or automatic CI/CD pipelines of both IaC and applications
- Route 53 for multi-region DNS routing
- Security, identity and compliance tools
Know the costs
An active-active multi-regional architecture on AWS will cost significantly more than a single active region, Duckbill's Banks said. In addition to the cost of operating additional compute and storage resources, data transfer isn't free.
Consider the use of capacity reservations or reserved instances. With reserved instances, an organization makes a financial commitment -- paid upfront, partially upfront or monthly. Capacity reservations are essentially an attempt to call dibs on existing capacity in an availability zone, Banks said, and they do not require a fixed financial commitment.
Also, to design, monitor and maintain a complex infrastructure requires an organization to invest staff time. These expenses do not show up on the AWS invoice, but they exist.
A good practice in a multi-region setup is to minimize the amount of data that needs to move back and forth between regions.