Khunatorn - stock.adobe.com
Best practices for a strong disaster recovery testing strategy
Testing is a critical part of the disaster recovery planning process. Without proper testing, IT teams might miss crucial updates or make avoidable missteps in a recovery.
Good disaster recovery testing comes from thorough planning and preparation. An untested plan is another crisis waiting to happen, so it is critical to have a disaster recovery testing strategy in place.
Full disaster recovery plan testing is not something many organizations can do frequently. To plan and execute a disaster recovery test requires two valuable resources: time and money. For that reason alone, DR teams must be realistic in how many tests they can execute each year. Most major applications are only end-to-end tested once a year at most. Some applications can be tested once every three years. It depends on the DR team's requirements.
This places disaster recovery teams in a dilemma: If they can't test often enough, critical applications or processes could miss out on necessary updates. However, if they spread themselves too thin with extraneous testing, they risk using up the aforementioned valuable resources. A testing strategy must be almost as thorough as the recovery itself. This will ensure DR teams don't miss out on any required changes and can use even limited resources to the fullest.
To get the most out of a disaster recovery testing strategy, consider incorporating these best practices.
Determine the type of test and plan accordingly
Disaster recovery testing comes in two types: full DR test and component test. The difference is that component tests are smaller in nature and test a subset of the application. Most component tests are effectively a smoke test to help ensure the smaller parts of the overall application are working before committing significant resources to a full-blown DR test.
Before talking about the technical aspects of the test, it's critical to understand what is being tested. Is it a full interactive disaster recovery test with users being asked to log in, perform in a crisis scenario and prove that the application works as expected? Or is it enough to verify that the systems and software are available? Depending on the tools or processes in an organization's DR plan, it might be necessary to perform a full run-through of the plan to test how it will run in a crisis.
Ensure everything is in place early -- and double-check
It might seem trivial, but not checking key components before running a full test is one of the most common and preventable mistakes organizations make. The point of a DR test is to ensure things work as expected, but when there is a fix that can be done outside the full test, it's worth it to check that everything is all set beforehand. This is one area where component testing can come in handy.
A frequent example is when an IT team discovers that required firewall ports are not open. This is something they might find during the full DR test, but it's still easier to check ahead of time to preserve time and resources. Remediating firewall issues can be a frustrating process, and it's likely not something security and networking staff want to deal with in the middle of running an end-to-end DR test.
Good documentation is evergreen
The importance of good documentation is paramount. If a DR test is done by less experienced staff, they might face and resolve several problems along the way. However, if they don't document those issues and the remediations, that loss of important information can significantly affect the speed of the DR test or real recovery.
There are four types of documentation DR teams must have for a strong testing strategy:
- The current DR plan as written, with discrete steps and a schedule.
- Notes on any issues that came up during testing and how they were fixed. If there was a temporary workaround, outline what it was.
- Detailed documentation of the testing process. This should include what is being tested and by whom.
- Admin sign-off on test completion.
Don't bypass thorough wrap-up and reporting
It might seem simple, but post-test reporting is where many DR teams fall short. Unfortunately, this is the task that has the most impact and presence to the management level.
Management is not often interested in the nuts and bolts of IT, but relaying the success or failure at a high level is a complex undertaking. This is especially true when a production system is taken down to test a DR scenario. Just like with a real disaster, IT teams should create comprehensive documentation throughout the process to inform management of how the test went and any areas they must address.
To avoid overloading management with technical details during wrap-up, timely communication of high-level status during the test is critical. Keep in mind that some DR tests can be quite lengthy in execution, spanning 24 hours or more. Ensuring those key stakeholders stay apprised of what is happening keeps them happy and shows good communication.
Stuart Burns is a virtualization expert at a Fortune 500 company. He specializes in VMware and system integration with additional expertise in disaster recovery and systems management. Burns received vExpert status in 2015.