Getty Images

Disaster recovery glossary: 27 terms to know

Creating a disaster recovery plan is challenging, but critical for long-term business success. DR teams and IT can benefit from understanding these 27 terms.

In a time of crisis, no one wants to be unsure of what a business process or IT component is.

Disaster recovery is the strategy businesses use to prepare for and recover from disruption or cessation of operations. IT disaster recovery addresses data, service and infrastructure issues to restore functionality.

With such a broad array of functions and technologies involved, it's critical that employees working on disaster recovery understand the terminology. With the right information, you can either help fix an issue and move the recovery along, or inform other relevant parties what is happening moment by moment.

Some common DR terms might be unfamiliar to team members who normally focus on protocols, automation tools and other technical aspects of IT. The following list defines 27 key disaster recovery terms that IT teams, DR personnel and upper management should be familiar with to plan and execute a successful recovery.

Common disaster recovery terms you should know

  1. Backup. Backups are copies of data, applications and system configurations to alternate media that mitigate storage drive failures. There are three major types of backups:
    • Full backup. Backs up all specified data, regardless of whether it has changed since the last full backup. It is a complete copy of all data, making restore processes relatively simple. Administrators often only do full backups weekly or monthly for time and storage space efficiency.
    • Incremental backup. Backs up data that has changed since the last full or incremental backup job, making backup jobs smaller. Restores are more complex and time-consuming because incremental backups build on each other. Administrators might do a weekly full backup with six incremental nightly backups in between.
    • Differential backups. Backs up data that has changed since the last full backup, making it faster than a full, but slower than an incremental backup job. Differential backups consume more space than incremental backups, but less than full backups. Restore processes are quicker than with incremental backups because users only need the latest full and differential backups. Administrators typically use differential backups instead of incremental backups if fast restores are more important than fast backups.
  2. Business continuity. The process of keeping a business operational during and after a disaster. Business continuity planning identifies potential issues, points of failure, and requirements for service-level agreements and other legal obligations regarding allowable downtime. Rather than focus on full recovery of data and infrastructure, business continuity prioritizes maintaining business functions during a crisis. Business continuity and disaster recovery, or BCDR, are typically grouped together when it comes to DR planning.
  3. Business impact analysis (BIA). Evaluates the potential negative effects of a crisis on critical business operations. Conducted by internal personnel or outside consultants, a BIA involves gathering data and evaluating the resulting report. The team running the BIA typically meets with senior leadership to discuss the results and potentially implement changes to the organization's DR plan.
  4. Crisis communications. In a disaster, crisis communications are critical to dispense information to employees, stakeholders, the public and other relevant parties. Today, many organizations use automated emergency notification systems to deliver messages about an ongoing disruption.
  5. Disaster recovery as a service (DRaaS). The replication of data and services to a third party to address business continuity concerns. After evaluating alternate hot, warm or cold sites and identifying supporting failover services in-house, some organizations choose DRaaS to outsource storage and failover service hosting, which can be more cost-effective.
  6. Disaster recovery plan. Organizations create DR plans to guide personnel through the steps of restoring operations after a crisis. DR plans can address specific crises, such as floods or cyberattacks, but many organizations use an all-hazards approach. The plan document often includes goals and a statement of intent, followed by important passwords and authentication tools, contact information for employees, likely risks to prepare for, and tips for dealing with media. A DR plan should also include past data, if the plan has ever been used at the organization.
  7. Disaster recovery site. A physical alternative location for performing business operations and storing data during a crisis. There are three major formats for DR sites:
    • Hot site. A fully functional location capable of taking over all operations without interruption. When considering a hot site, start by balancing the expense of a duplicate site against the need for immediate operational failover. Businesses often use an existing secondary location as the hot site.
    • Warm site. A semifunctional site requiring some manual configuration, staff and equipment before taking over business operations. It's less expensive than a hot site, but cannot take over operations as quickly. A company might use an existing secondary location for this, but will have to transfer equipment and personnel first.
    • Cold site. An alternate location containing only minimal infrastructure and requiring additional equipment and configuration before it can take over operations. Cold sites are less expensive than hot and warm sites, but require significantly more effort to use following a disaster.
  8. Failback. Shifting business processes back to a primary site or services back to a primary platform after resolving a failure. For example, if a primary database fails, the service will switch to a secondary server. Once the primary server is fixed, the service fails back to it, and the secondary server goes back into standby.
  9. Failover. Shifting business processes to an alternate site or services to an alternate platform. For example, an organization might host a primary database used daily with a secondary replica on a different server in standby mode. If the primary server crashes, database services fail over to the secondary server.
  10. Fault tolerance. Devices or services that continue functioning even if a component fails are considered fault tolerant. Fault tolerance might be provided by redundancy or other means to ensure service continuity, even if that service is degraded. Examples include RAID arrays, data replication and redundant hardware.
  11. Hot spare. A connected and available component for failover events. One common hot spare example is an additional storage disk installed inside a server using a RAID array. If the RAID array detects an unavailable disk, it immediately integrates the hot spare. Identify storage services and clusters that might benefit from hot spare devices.
  12. Known good component. A replacement component or device tested and confirmed to operate correctly. Known good parts are often swapped in to help identify failed components during troubleshooting. For example, businesses can keep a spare known good network interface card on hand to help troubleshoot network connectivity problems.
  13. Mean time between failures (MTBF). The average time a repairable device operates between failures. A higher MTBF indicates greater reliability and availability. Calculate the MTBF by dividing the total uptime by the number of failures within a specified period.
  14. Mean time to failure (MTTF). A measure of reliability and expected service life for nonrepairable devices. A higher MTTF indicates a longer life span. IT teams can use this value to select devices and equipment or to justify higher-cost and better-quality components.
  15. Mean time to repair (MTTR). A measure of the average time it takes to restore operations after a failure or disaster incident. MTTR is the basis for recovery time objective values.
  16. RAID. Short for "redundant array of independent disks," RAID is a data protection method that uses multiple storage disks to mitigate the risk of any one disk failing. Data is spread across the disks to make it quicker to retrieve or recoverable if a disk fails. Common examples include RAID 1, or disk striping; RAID 2, or disk mirroring; and RAID 5, or disk striping with parity. Many other RAID configurations exist.
  17. Recovery point objective (RPO). Maximum tolerable data loss from a disaster. Begin determining RPO by evaluating data types in the environment and assessing how much data loss can be tolerated. For example, customer sales information databases might tolerate one hour of loss, while end-user documents might tolerate 24 hours. These values can define a backup schedule.
  18. Recovery time objective (RTO). Time duration in which a business process must be restored following a disaster to avoid unacceptable consequences. To determine RTO, begin by identifying essential business processes and their related service-level agreements. Define the consequences of those processes being unavailable. For example, a crashed customer database might result in unacceptable sales losses after one business day.
  19. Redundant. Use of multiple components to eliminate single points of failure. Examples include servers with multiple network interface cards, two or more routers supporting network segments, and multiple database servers hosting replicas of the same files. Start by identifying all single points of failure and then listing how to make each one redundant.
  20. Restore (file). Recovering one or more files from backup media to replace original files that users deleted or unintentionally changed. File restores are common requests from end users who want to recover a few resources.
  21. Restore (system). System restore-from-backup process that returns it to an earlier state. You might conduct a system restore following an operating system update with unintended consequences or after a storage disk failure. The restored data is usually OS configuration information.
  22. Risk. A potential for loss if a threat exploits a vulnerability. The loss could be downtime, lost data, sensitive data exfiltration, etc. Organizations calculate risk by assigning numeric values, prioritizing each and then multiplying the threat times the vulnerability. These calculations are part of a risk assessment, which organizations should conduct to create a DR plan.
  23. Service-level agreement (SLA). A contract between a consumer and provider that specifies services, performance levels and responsibilities. SLAs provide accountability for both parties.
  24. Single point of failure. A single link in a chain of services or devices that prevents the entire workload from functioning correctly if it fails. For example, if the business network connects to the internet using one router, that router is a single point of failure; the entire site loses internet connectivity if the router fails. Identifying and addressing single points of failure is crucial in DR planning.
  25. Snapshot. A point-in-time copy of specified information, often an entire disk. These are useful for quick restores, including setting a lab or test system back to a known point. Snapshots are normally short-term fixes rather than long-term disaster recovery tools and are not synonymous with backups.
  26. Threat. A potential natural or human-caused source of harm or damage. Threats might not be intentional. Threat examples include weather events, cybercriminals and system failures.
  27. Vulnerability. A weakness that a threat can exploit or exercise. Weaknesses include software bugs or flaws, poor security practices, and unencrypted protocols or connections.

Damon Garn owns Cogspinner Coaction and provides freelance IT writing and editing services. He has written multiple CompTIA study guides, including the Linux+, Cloud Essentials+ and Server+ guides, and contributes extensively to TechTarget Editorial and CompTIA Blogs.

Dig Deeper on Disaster recovery planning and management