A step-by-step guide to implementing data deduplication

The best way to select, implement and integrate data deduplication can vary depending on how the deduplication is performed. Learn some general principles that you can follow to select the right data deduplication approach and then integrate it into your environment.

The best way to select, implement and integrate data deduplication can vary depending on how the deduplication is performed. Here are some general principles that you can follow in selecting the right deduplicating approach and then integrating it into your environment.

Step 1: Assess your backup environment

What deduplication ratio a company achieves will depend heavily on the following factors:

  • Type of data
  • Change rate of the data
  • Amount of redundant data
  • Type of backup performed (full, incremental or differential)
  • Retention length of the archived or backup data

The challenge most companies have is quickly and effectively gathering this data. Agentless data gathering and information classification tools from Aptare Inc., Asigra Inc., Bocada Inc. and Kazeon Systems Inc. can assist in performing these assessments while requiring minimal or no changes to your servers in the form of agent deployments.

Step 2: Establish how much you can change your backup environment

Deploying backup software that uses software agents will require installing agents on each server or virtual machine and doing server reboots after it's installed. This approach generally results in faster backup times and higher deduplication ratios than using a data deduplication appliance. However, it can take more time and require many changes to a company's backup environment. Using a data deduplication appliance typically requires no changes to servers, though a company will need to tune its backup software according to if the appliance is configured as a file server or a virtual tape library (VTL).

Step 3: Purchase a scalable storage architecture

The amount of data that a company initially plans to back up and what it actually ends up backing up are usually two very different numbers. A company usually finds deduplication so effective when it starts using it in its backup process that it quickly scales its use and deployment beyond initial intentions, so you should confirm that deduplicating hardware appliances can scale both performance and capacity. You should also verify that the hardware and software deduplication products can provide global deduplication and replication features to maximize duplication's benefits throughout the enterprise, facilitate technology refreshes and/or capacity growth, and efficiently bring in deduplicated data from remote offices.

Step 4: Check the level of integration between backup software and hardware appliances

The level of integration that a hardware appliance has with backup software (or vice versa) can expedite backups and recoveries. For example, ExaGrid Systems Inc. ExaGrid appliances recognize backup streams from CA ARCserve and can better deduplicate data from that backup software than streams from backup software that it doesn't recognize. Enterprise backup software is also starting to better manage disk storage systems so data can be placed on different disk storage systems with different tiers of disk, so they can back up and recover data more quickly short term and then more cost-effectively store it long term.

Step 5: Perform the first backup

The first backup using agent-based deduplication software can potentially be a harrowing experience. It can create a significant amount of overhead on the server and take much longer than normal to complete because it needs to deduplicate all of the data. However, once the first backup is complete, it only needs to back up and deduplicate changed data going forward. Using a hardware appliance, the experience tends to be the opposite. The first backup may occur quickly but backups may slow over time depending on how scalable the hardware appliance is, how much data is changing and how much data growth that a company is experiencing.

About the author: Jerome M. Wendt is lead analyst and president of DCIG Inc.

Dig Deeper on Data reduction and deduplication