Data deduping explained: Deduplication in data backup environments tutorial
Learn everything you wanted to know about data deduping in our tutorial on data deduplication and data backup.
By W. Curtis Preston
Data deduplication is one of the biggest game-changers in data backup and data storage in the past several years, and it is important to have a firm understanding of its basics if you're considering using it in your environment. The purpose of this tutorial is to help you gain that understanding.
When the term deduplication, also referred to as data dedupe or data deduping, is used without any qualifiers (e.g. file-level dedupe), we are typically referring to subfile-level deduplication. This means that individual files are broken down into segments and those segments are examined for commonality. If two segments are deemed to be the same (even if they are in different files), then one of the segments is deleted and replaced with a pointer to the other segment. Segments that are deemed to be new or unique are, of course, stored as well.
Different files -- even if they are in different filesystems or on different servers -- may have segments in common for a number of reasons. In backups, duplicate segments between files might indicate that the same exact file exists in multiple places. Duplicates are also created when performing repeated full backups of the same servers. Finally, duplicate segments are created when performing incremental backups of files. Even if only a few bytes of a file have changed, the entire file is usually backed up by the backup system. If you break that file down into segments, most of the segments between different versions of the same file will be the same, and only the new, unique segments need to be stored.
WHERE YOUR DATA IS DEDUPED: INLINE DEDUPLICATION AND POST-PROCESSING
The two primary approaches (inline deduplication and post-processing deduplication) are roughly analogous to synchronous replication and asynchronous replication. Inline deduplication is roughly analogous to synchronous replication, as it does not acknowledge a write until a segment has been determined to be unique or redundant; the original, native, data is never written to disk. In an inline system, only new, unique segments are written to disk. Post-process deduplication is roughly analogous to asynchronous replication as it allows the original data to be written to disk and deduplicated at a later time. "Later" can be in seconds, minutes, or hours later depending on which system we are talking about and how it is has been configured.
There is not sufficient space here to explain the merits of these two approaches, but the following few statements will give you an overview of their claims. Inline vendors claim to be more efficient and require less disk. Post-process vendors claim to allow for faster initial writes and faster read performance for more recent data -- mainly because it is left stored in its native format. Both approaches have merits and limitations, and one should not select a product based on its position in this argument alone. One should select a product based on its price/performance numbers, which may or may not be affected by their choice to do inline or post-process.
Editor's Tip: Learn more about inline deduplication vs. post-processing in this article.
HOW IS DUPLICATE DATA IDENTIFIED?
There are three primary approaches to this question as well: hash-based, modified hash-based and delta differential. Hash-based vendors take segments of files and run them through a cryptographic hashing algorithm, such as SHA-1, Tigger, or SHA-256, each of which create a numeric value (160 bits to 256 bits depending on the algorithm) that can be compared against the numeric values of every other segment that the dedupe system has ever seen. Two segments that have the same hash are considered to be redundant. A modified hash-based approach typically uses a much smaller hash (e.g., CRC of only 16 bits) to see if two segments might be the same; they are referred to as redundancy candidates. If two segments look like they might be the same, a binary-level comparison verifies that they are indeed the same before one of them is deleted and replaced with a pointer.
Delta differential systems attempt to associate larger segments to each other (e.g., two full backups of the same database) and do a block-level comparison of them against each other. The delta differential approach is only useful in backup systems, as it only works when comparing multiple versions of the same data to each other. This does not happen in primary storage; therefore, all primary storage deduplication systems use either the hash-based or the modified hash-based approach to identifying duplicate data.
Editor's Tip: Learn more about file-level vs. block-level deduplication in this expert response.
TARGET DEDUPLICATION VS. SOURCE DEDUPLICATION AND HYBRID APPROACHES
Where should deduplicate data be identified? This question only applies to backup systems, and there are three possible answers: target, source, and hybrid. A target deduplication system is used as a target for regular (non-deduplicated backups), and is typically presented to the backup server as a NAS share or virtual tape library (VTL). Once the backups arrive at the target, they are deduplicated (either inline or as a post-process) and written to disk. This is referred to as target dedupe, and its main advantage is that it allows you to keep your existing backup software.
If you're willing to change backup software, you can switch to Source deduplication, where duplicate data is identified at the server being backed up, and before it is sent across the network. If a given segment or file has already been backed up, it is not sent across the LAN or WAN again -- it has been deduped at the source. The biggest advantage to this approach is the savings in bandwidth, making source dedupe the perfect solution for remote and mobile data.
The hybrid approach requires a little bit more explanation. It is essentially a target deduplication system, as redundant data is not eliminated until it reaches the target; however, it is not as simple is that. Remember that to deduplicate data, the files must be first broken down into segments. In a hash-based approach, a numeric value -- or hash -- is then calculated on the segment, and then that value is looked up in the hash table to see if it has been seen before. Typically, all three of these steps are performed in the same place -- either at the source or the target. In a hybrid system, the first one or two steps can be done on the client being backed up, and the final step can be done on the backup server. The advantage of this approach (over typical target approaches) is that data may be compressed or encrypted at the client. Compressing or encrypting data before it reaches a typical target deduplication system would significantly impact your dedupe ratio, possibly eliminating it altogether. But this approach allows for both compression and encryption before data is sent across the network.
Editor's Tip: Learn more about source deduplication in this expert response.
WHAT IS PRIMARY STORAGE DEDUPLICATION?
Deduplication is also used in primary data storage, where duplicate data is not as common -- but it does exist. Just as in backups, the same exact file may reside in multiple places, or end-users may save multiple versions of the same file as a way of protecting themselves against "fat finger" incidents. One type of data that has a lot of commonality between different files is system images for virtualization systems. The C: (or root) drive for one system is almost exactly the same as the C: (root) drive for another system. A good deduplication system will identify all of those common files and segments and replace them with a single copy.
Whether we're talking backups or primary storage, the amount of disk saved is highly dependent on the type of data being stored and the amount of duplicate segments found within that data. Typical savings in backup range from 5:1 to 20:1 and average around 10:1, with users who do frequent full backups tending towards the higher ratios. Savings in primary storage are usually expressed in reduction percentages, such as "there was a 50% reduction," which sounds a lot better than a 2:1 deduplication ratio. Typical savings in primary storage range from 50% to 60% or more for typical data, and as much as 90% or more for things like virtual desktop images.
Editor's Tip: To learn more about primary storage dedupe, listen to SearchStorage's FAQ on primary storage deduplication.
YOUR MILEAGE MAY VARY: DATA DEDUPLICATION RATIOS
There is no "may" about it – your mileage will vary when data deduping. Your dedupe performance and data deduplication ratio will be significantly different than that of your neighbors. A dedupe system that is appropriate for your neighbor may be entirely inappropriate for you, since different approaches work better for different data types and behavior patterns. Therefore, the most important thing to remember when selecting a deduplication system is to perform a full proof of concept test before signing the check.
This has been a high level overview of the basics of deduplication for both primary storage and backup. For more detailed information, bookmark SearchDataBackup's special section on data reduction and deduplication.