Tip

Compression, deduplication and encryption: What's the difference?

Learn the distinctions and similarities among data compression, dedupe and encryption as these data protection methods gain importance in everyday storage.

As more corporate data is relegated to spinning disk, storage administrators must implement, configure and manage this escalating capacity -- stretching disk space to the limit, while protecting important data against loss or theft.

Compression, deduplication and encryption are common data protection technologies for managing and optimizing disk storage, and it's important to understand the role each one plays in the data center.

Data compression

Compression attempts to reduce the size of a file by removing redundant data within the file. By making files smaller, less disk space is consumed, and more files can be stored on disk. For example, a 100 KB text file might be compressed to 52 KB by removing extra spaces or replacing long character strings with short representations. An algorithm recreates the original data when the file is read. Picture files are also usually compressed. For example, the JPEG image file format uses compression to eliminate redundant pixel data.

Almost any file can be compressed, though files with nonredundant data may compress little, if at all, so compression ratios are a guideline and not a rule. For example, a 2-to-1 compression ratio can ideally allow 400 GB worth of files on a 200 GB disk. It's difficult to determine exactly how much a file can be compressed until a compression algorithm is applied.

Data deduplication

A typical data center may store many copies of the same file. File deduplication -- sometimes called data reduction or single-instance storage -- is another space-saving data protection technology intended to eliminate redundant files on a storage system. By saving only one instance of a file, disk space can be significantly reduced.

For example, suppose the same 10 MB PowerPoint presentation is stored in 10 folders for each sales associate or department. That's 100 MB of disk space consumed to maintain the same 10 MB file. File deduplication ensures that only one complete copy is saved to disk. Subsequent iterations of the file are only saved as references that point to the saved copy, so users still see their own files in place. Similarly, a storage system may retain 200 emails, each with a 1 MB attachment. With deduplication, the 200 MB needed to store each 1 MB attachment is reduced to just 1 MB for one iteration of the file.

Block-level deduplication saves unique iterations of each block in a file. If a user updates a file, only the revised data is saved. Because the changes don't establish a completely new file, block-level deduplication is more efficient than file-level dedupe. Block-level deduplication, though, requires more processing power and uses a bigger index to track individual pieces.

Deduplication can also provide more granular control, removing redundant portions of files, potentially down to the block level. When evaluating a deduplication product, it's important to understand the granularity offered by the vendor's platform.

Vendors frequently tout their deduplication ratio. As an example, a 10-to-1 deduplication ratio means that 10 times more data is protected than the physical space needed to store it.

Encryption

With an increase in government regulations and corporate litigation, data storage managers have to pay close attention to the role of security in enterprise storage. Encryption is a key data protection technology that prevents unauthorized users from accessing information, even if files are hacked and stolen.

Encryption uses a mathematical algorithm with a unique key to encode a file into a form that cannot be read. No one else can access or use the encrypted file until it is decrypted using the identical key. If the encryption key is lost, any data encrypted with that key will be inaccessible.

Encryption is a key element of data protection products. When moving data from one location to another, it's best to encrypt it at the original site, while it's in transit and when it's at rest in the new spot. It's important to properly encrypt backups with a good algorithm and solid key management.

End-to-end encryption prevents data sent between two locations from being viewed by anyone who intercepts the communication channel. Text messaging applications, such as Apple iMessage, often use end-to-end encryption, but law enforcement agencies have criticized vendors because the technology prevents them from accessing information.

Understanding how they work together

Used together, the three technologies enable increased capacity and protection. A purpose-built backup appliance, for example, will typically incorporate deduplication, compression and encryption, with objectives that include protecting data from breaches and removing redundant data.

Deduplication is often associated with compression, and many storage systems support both data protection technologies because they optimize storage capacity and enable longer retention periods, lower bandwidth consumption and faster recoveries. An organization can use the two together, for example, by deduplicating a storage environment and then compressing the files.

How they diverge

The data protection technologies are similar, but they operate differently.

Deduplication looks for redundant pieces of data, while compression uses an algorithm to reduce the bits required to represent data. Deduplication is effective in organizations that have a lot of redundant data, such as backup systems that have several versions of the same file. Compression is effective in decreasing the size of unique files, such as images, videos and databases. While deduplication typically works at the block level, compression tends to work at the file level.

While deduplication and compression are storage-focused, encryption is more of a security feature.

How these three intersect

If an organization is using all three data protection technologies, it should dedupe and compress data prior to encryption. Encrypted files are tough to dedupe and compress because an organization will likely need the key to unlock the files first.

Compression and deduplication can have a negative effect on storage performance. Both data protection technologies need substantial compute resources and may increase latency. One way to find out how well a storage system incorporates compression and deduplication is to run a test.

In addition, organizations should be careful about resiliency when deduplicating and compressing. With both data protection methods removing redundant data, a storage environment may be left with one complete copy of content, so proper backup is an important element.

How deduplication and compression work with other technologies

Deduplication and compression often combine with other technologies for improved protection.

Deduplication, along with replication, aids disaster recovery. In this case, an organization would use deduplication first to reduce the amount of duplicate data and then replicate the data off-site.

Erasure coding can work with compression and deduplication to conserve storage capacity. Erasure coding reconstructs data that's been corrupted.

RAID also works with compression and deduplication. RAID stores the same data in different places to protect it in case of a failure.

Next Steps

Best practices for encrypting data backups

Inline or post-processing deduplication?

See how deduplication and compression technologies work

Dig Deeper on Disk-based backup