How does deduplication in cloud computing work and is it beneficial?

The deduplication process reduces the amount of data in a storage system, but dedupe in the cloud may be more valuable to the cloud provider than the customer.

Deduplication in cloud and other storage platforms is a process by which repeated or duplicate data is removed from a data stream to reduce the amount of physical data stored in an appliance or system.

In primary storage, deduplication helps to reduce the amount of physical space consumed by removing identical blocks of data and using metadata to associate the logical copies of data to the physical ones. In the public cloud, the deduplication capabilities of the storage platform aren't exposed to the user.

If the provider chooses to implement deduplication in cloud computing, then that benefit is retained for the cloud provider. This is because storage space is billed based on logical capacity used -- rather than the physical capacity -- and any reduction in savings is used by the service provider to offer a cheaper service or to reduce its costs.

But for anyone using cloud storage for backup, there's an issue. Copying multiple backup images to the cloud will consume large volumes of storage, much more than if a deduplicating platform, such as a disk system, was used as the storage target.

There are a number of solutions to the deduplication in cloud problem. Many backup software platforms will dedupe at the source, and hold only the deduplicated data on physical storage. The backup software owns and manages the metadata that does the logical-to-physical translation.

An alternative is to look for a storage gateway that can offer a storage interface and do the deduplication. In this instance, the administrator isn't dependent on the backup software, and data can be more easily imported into other platforms.

The most obvious issue is that whichever backup software is used will own the metadata, so, ideally, a storage deduplicating gateway is the better option. This ensures that the data in the backup environment is portable outside of the backup software, without having to rehydrate the data to move it to another platform.

Beyond deduplication in cloud, the process works well on groups of virtual machines, where the base operating system is similar or identical across multiple VMs.

In the backup world, deduplication is used to reduce the volume of physical data stored when doing repeated backups of the same data set, such as a VM. When only a small percentage (say 5% to 10%) of the actual data changes between backups, deduplication ensures that the physical space consumed is as optimal as possible. Backup systems can see deduplication rates of 20:1 and higher.

Next Steps

Tips on cutting cloud expenses include deduplication

Strengths and weaknesses of deduplication for backup

How deduplication and compression can benefit virtual servers

Dig Deeper on Data reduction and deduplication