Put your IT disaster recovery plan to the test
Operations teams need to make DR planning and preparation a priority every day of the year -- not just at test time. Here's how to keep apps and systems out of harm's way.
One of the central ideas of DevOps is if you do hard things, you grow comfortable doing hard things. Over time, a once-difficult task becomes routine. This same thinking should be applied to an IT disaster recovery plan. Planning for DR should be an everyday practice for all IT teams. By making it part of normal operations, there is no massive upheaval required to accommodate DR readiness, and both IT and the business will have more confidence in your DR capability.
Historically, DR testing was hard, manual and expensive. A disaster recovery plan was tested no more often than legal compliance required, if at all. If DR testing was only an annual event, every test failed because the documented processes were not updated for the last year of application and infrastructure changes. Since every DR test failed, neither IT nor the business had any faith in their DR readiness. Infrequent testing of an IT disaster recovery plan leads to significant problems at every test. A better solution is to test more often and eventually have fewer problems.
DR test strategies
DR automation products, such as VMware's Site Recovery Manager, take a large amount of manual work out of DR testing, which reduces the cost for DR readiness tests.
A nice side effect of virtualization is that DR tests can be run for subsets of the infrastructure without any impact on production applications. Rather than needing to fail over hundreds of applications at once for test purposes, each application can be verified by itself. This further reduces the cost for testing DR and assures readiness.
When you reduce the complexity and cost of testing your IT disaster recovery plan, you can make it routine. Any issue uncovered by a DR test can be addressed immediately, and the DR process can be rerun until all problems are resolved. The combination of virtualization and DR automation has made DR testing routine. By having DR awareness part of everyday practice, an IT team can expose potential problems before they become actual problems.
The next progression beyond regular disaster testing is to make DR awareness a standard part of design, implementation and operational processes. In the same way that security is a real problem if it is added after the fact, DR is much easier to conduct if it is built in to the plan.
One element is to have DR requirements identified as part of the specification for any application that is brought into the organization. Defining the business requirement for recovery time objective and recovery point objective for each component during the design phase is far more reasonable than trying to get the same information after the product is deployed and the project team has moved on to new projects.
The design phase is also where dependencies should be identified. By doing so, we know what other infrastructure and application components need to be protected in order to safeguard a new application. These dependencies will also dictate where the new application fits into the existing IT disaster recovery plan and procedures.
Another element is in the planning of infrastructure changes with full DR awareness. The obvious part is ensuring DR capabilities when storage or virtualization is upgraded, but other changes can alter DR readiness. We see more products deployed that offer DR capabilities in addition to whatever other purposes they serve. Most hyper-converged platforms include replication, which can offer DR capabilities either as stand-alone services or integrated into existing DR automation. Modern data protection products can often be used for DR capabilities, mainly when the data protection is near continuous rather a daily backup like legacy products. Utilizing the DR capabilities of new products, where there were no DR capabilities previously, will deliver additional value for the upgrade.
The people part of an IT disaster recovery plan
The DR readiness of an application is often the responsibility of either the implementation project team or the operations team. DR readiness should be a part of both teams' processes and shared across the application lifecycle until the application is retired. The implementation team needs to ensure DR readiness is part of the implementation, but the operations team must ensure that DR readiness does not degrade over time.
It is easy for operational requirements to affect an application's DR state. VMs will sometimes migrate to non-replicated data stores because the replicated data stores ran out of space. This satisfies the immediate operational requirement, but it leaves the VMs and their applications unprotected. Similarly, a VM might be migrated to another data store, which is not part of a replication consistency group. Consequently, the application might fail at DR time. Including DR requirements in routine documentation will help with this awareness, along with periodic auditing of DR compliance.
The core of making DR awareness an everyday practice for all of IT is that people need education about the requirements and capabilities for DR. If this knowledge is held inside the backup or DR team, there is no way for the project and operations teams to incorporate DR readiness into their workdays.
Both virtualization and DevOps work best when knowledge is shared and teams can communicate freely. DR knowledge should also be freely shared. Making DR readiness a core part of standard IT processes will help to maintain continuous DR readiness and reduce the business risk of any disastrous event.