Solving common data deduplication system problems

Still not getting the deduplication ratios vendors are promising? Learn about the most common problems with deduplication systems and how to fix them.

It's been said that we never really solve any problems in IT -- we just move them. Data deduplication is no exception to that rule. While deduplication systems have helped make data backup and recovery much easier, they also come with a number of challenges. The savvy storage or backup administrator will familiarize themselves with these challenges and do whatever they can to work around them.

Your backup system creates duplicate data in three different ways: repeated full backups of the same file system or application; repeated incremental backups of the same file or application; and backups of files that happen to reside in multiple places (e.g., the same OS/application on multiple machines). Hash-based deduplication systems (e.g., CommVault Systems Inc., EMC Corp., FalconStor Software, Quantum Corp., Symantec Corp.) will identify and eliminate all three types of duplicate data, but their level of granularity is limited to their chunk size, which is typically 8K or larger. Delta-differential-based deduplication systems (e.g., IBM Corp., ExaGrid Systems, Sepaton Inc.) will only identify and eliminate the first two types of duplicate data, but their level of granularity can be as small as a few bytes. These differences typically result in a dedupe ratio draw, but can yield significant differences in certain environments, which is why most experts suggest you test multiple products.

More on deduplication systems

Learn about source deduplication vs. target deduplication  

Read our tutorial data deduplication 

Listen to a podcast on Global deduplication

Because roughly half of the duplicate data in most backup data comes from multiple full backups, people using IBM Tivoli Storage Manager (TSM) as their backup product will experience lower deduplication ratios than customers using other backup products. This is due to TSM's progressive incremental feature that allows users to never again do a full backup on file systems being backed up by TSM. However, because TSM users perform full backups on their databases and applications, and because full backups aren't the only place where duplicate data is found, TSM users can still benefit from deduplication systems -- their dedupe ratios will simply be smaller.

The second type of duplicate data comes from incremental backups, which contain versions of files or applications that have changed since the most recent full backup. If a file is modified and backed up every day, and the backup system retains backups for 90 days, there will be 90 versions of that file in the backup system. A deduplication system will identify the segments of data that are unique and redundant among those 90 different versions and store only the unique segments. However, there are file types that do not have different versions (video, audio, photos or imaging data, and PDF files); every file is unique unto itself and is not a version of a previous iteration of the same file. An incremental backup that contains these types of files contains completely unique data, so there is nothing to deduplicate them against. Since there is a cost associated with deduplicated storage, customers with significant portions of such files should consider not storing them on a deduplication system, as they will gain no benefit and only increase their cost.

Data deduplication systems and encryption: What to watch out for

Data deduplication systems work by finding and eliminating patterns; encryption systems work by eliminating patterns. Do not encrypt backup data before it is sent to the deduplication system -- your deduplication ratio will be 1:1.

Compression works a little like encryption in that it also finds and eliminates patterns, just in a very different way. The way most compression systems work results in a scrambling of the data that has a similar effect as encryption; it can also completely remove all abilities of your deduplication system to deduplicate the data.

The compression challenge often results in a stalemate between database administrators who want their backups to go faster and backup administrators who want their backups to get deduped. Since databases are often created with very large capacities and very small actual amounts of data, they tend to compress very well. This is why turning on the backup compression feature often results in database backups that go two to four times faster than they do without compression. The only way to get around this particular challenge is to use a backup software product that has integrated source deduplication and client compression, such as CommVault Simpana, IBM TSM or Symantec NetBackup.

Multiplexing and deduplication systems

The next dedupe challenge with backups only applies to companies using virtual tape libraries (VTLs) and backup software that supports multiplexing. Multiplexing several different backups to the same tape drive can also scramble the data and completely confounds all dedupe. Even products that are able to decipher the different backup streams from a multiplexed image (e.g., FalconStor, Sepaton) tell you not to multiplex backups to their devices because it simply wastes time.

Consider the dedupe tax

The final backup dedupe challenge has to do with the backup window. The way that some deduplication systems perform the dedupe task actually results in a slow-down of the incoming backup. Most people don't notice this because they are moving from tape to disk, and a dedupe system is still faster. However, users who are already using disk staging may notice a reduction in backup performance and an increase in the amount of time it takes to back up their data. Not all products have this particular characteristic, and the ones that do demonstrate it in varying degrees -- only a proof-of-concept test in your environment will let you know for sure.

Data deduplication systems offer the best hope for making significant enhancements to your current backup and recovery system without making wholesale architectural changes.

The restore challenge is much easier to understand; the way most deduplication systems store data results in the most recent backups being written in a fragmented way. Restoring deduplicated backups may therefore take longer than it would have taken if the backup had not been deduplicated. This phenomenon is referred to as the dedupe tax.

When considering the dedupe tax, think about whether or not you're planning to use the dedupe system as the source for tape copies, because it is during large restores and tape copies that the dedupe tax is most prevalent. Suppose, for example, that you plan on using LTO-5 drives that have a native speed of 140 MBps and a native capacity of 1.5 TB. Suppose also that you have examined your full backup tapes and have discovered that you consistently fit 2.25 TB of data on your 1.5 TB tapes, meaning that you're getting a 1.5:1 compression ratio. This means that your 140 MBps tape drive should be running at roughly 210 MBps during copies. Make sure that during your proof of concept you verify that the dedupe system is able to provide the required performance (210 MBps in this example). If it cannot, you may want to consider a different system.

The final challenge with deduplicated restores is that they are still restores, which is why dedupe is not a panacea. A large system that must be restored still requires a bulk copy of data from the dedupe system to the production system. Only a total architectural change of your backup system from traditional backup to something like continuous data protection (CDP) or near-CDP can address this particular challenge, as they offer restore times measured in seconds not hours.

Data deduplication systems offer the best hope for making significant enhancements to your current backup and recovery system without making wholesale architectural changes. Just be sure that you are aware of the challenges of dedupe before you sign a purchase order.

About the author:
W. Curtis Preston (a.k.a. "Mr. Backup"), Executive Editor and Independent Backup Expert, has been singularly focused on data backup and recovery for more than 15 years. From starting as a backup admin at a $35 billion dollar credit card company to being one of the most sought-after consultants, writers and speakers in this space, it's hard to find someone more focused on recovering lost data. He is the webmaster of BackupCentral.com, the author of hundreds of articles, and the books "Backup and Recovery" and "Using SANs and NAS."

Dig Deeper on Data reduction and deduplication