Sergey Nivens - Fotolia
Perform an Amazon RDS failover with minimal downtime
Walk through this example of an RDS failover process using the Amazon Aurora engine to minimize production downtime and maintain performance in your cloud environment.
Public cloud providers offer many benefits, including their managed services. And if you're an AWS customer, there are plenty at your disposal.
For example, you can replace your hosted MySQL, PostgreSQL, Oracle or Microsoft SQL Server database with an Amazon Relational Database Service (RDS) instance. Amazon RDS is a robust service with built-in features, such as high availability through multi-availability-zone (AZ) replication, automated upgrades and snapshots.
While many users rely on Amazon RDS with multi-AZ failover for their production workloads, they rarely check to see if the switch to a standby database instance has caused production downtime or performance degradation. There is always some downtime -- even if it's only a few seconds -- because most failover processes are not automatic. It's important to know how Amazon RDS failover affects your workloads.
Explore Amazon Aurora
Let's discuss the Amazon RDS failover process, with the Amazon Aurora engine as an example. Aurora, Amazon's proprietary MySQL- and PostgreSQL-compatible database, offers a cost-effective option, with improved performance compared with other RDS database engines on the platform. This makes Aurora a natural choice for both production and test/dev workloads.
AWS splits the Amazon Aurora storage layer into 10 GB chunks, which are replicated six times across three AZs. Aurora can handle a loss of up to two of those copies for database writes or three copies for reads. AWS continuously scans disks that carry Aurora data and automatically uses its self-healing technology to repair data, as needed.
On the instance level, Amazon Aurora is the same as other RDS multi-AZ deployments. When you provision the database instance, you must select a multi-AZ deployment and pick the desired AZ where your standby replica will deploy. Aurora supports up to 16 read replicas, which are marked internally from tier-0 to tier-15. In the event of a crash, you can choose which tier will assume the master role. Replicas share the same volume as the primary instance, and you can deploy cross-region replicas if you use MySQL.
If your primary Amazon Aurora instance crashes, the database restart takes less than 60 seconds -- and you won't need to reply to the redo log. While the instance restarts, a simple CNAME change in Route 53 promotes the replica from the selected tier to the master in around 30 seconds. If you select a cross-region replica to promote to master, you can expect a longer delay. So, test this promotion in your environment.
While some say the Amazon RDS failover times mentioned above are intolerably slow, it's no faster to run your own database nodes in Amazon EC2 instances. In fact, this approach will probably create more headaches, because you must create and test a proper backup schedule, upgrade when minor versions are available, expand underlying disk volumes and scale vertically when you run your own database clusters from scratch.
More Amazon RDS failover, management options
At AWS re:Invent 2017, the cloud provider unveiled Amazon Aurora Multi-Master, which enables you to create multiple read and write masters and improves Aurora's availability. But the service is still in preview mode, and it's not suggested for production use.
To ensure your multi-AZ Amazon Aurora instance runs smoothly, you can also create Amazon CloudWatch alarms, and then tie them to Aurora metrics that monitor performance and availability, and deploy at least two read replicas.
Ultimately, while Amazon RDS is a managed AWS offering, you can't just provision your database instance once and then leave it be. However, the upcoming Amazon Aurora Serverless service could reduce management overhead even further by, for example, automating capacity management. In fact, in the next couple of years, cloud-managed services could turn increasingly serverless, which might change the way you provision cloud resources altogether.