microworks - Fotolia
How do I decide when to deduplicate data and where?
On its surface, data deduplication is a positive because it deletes redundant data. Use the technology carefully, but at the right moment for the best results.
Data deduplication has the potential to significantly reduce an organization's storage footprint. Even so, the way in which you deduplicate data can play a major role in its effectiveness.
The first thing you must understand about data deduplication is that the deduplication ratios advertised by vendors -- 25:1, 50:1, etc. -- are usually best-case estimates. There is no way a vendor can guarantee the ratio by which your data footprint can be reduced. That's because the nature of your data is the single most important factor in determining how effectively the vendor can deduplicate data.
Deduplication works by removing duplicate data. If no redundancy exists within the data, then no deduplication engine will be able to reduce the data's footprint. Some types of data that tend not to benefit from deduplication include compressed media files, such as MPEG, JPG, etc.; compressed archive files, such as ZIP, CAB, etc.; and scientific data, which is often somewhat random.
The way in which systems perform deduplication can also make a difference. Most engines deduplicate data either inline or post-process.
Inline, post-process, global: Which dedupe, if any, is right for you?
Inline deduplication happens in real time. If, for example, you continuously stream data to the cloud, then inline deduplication could be beneficial because it may be able to shrink the data before it is transmitted, thereby reducing the required bandwidth and transfer time.
Post-process deduplication runs on a scheduled basis. A post-process deduplication engine might, for example, deduplicate data at 11 p.m. each night.
Post-process deduplication can sometimes achieve a higher ratio than inline deduplication, but it does have its disadvantages. For one thing, the storage repository needs to be large enough to store the data in its uncompressed form prior to deduplication. It might also require some additional storage space to accommodate the overhead associated with the deduplication process. Another disadvantage is that post-process tends to be resource-intensive, so you probably don't want to schedule the engine to deduplicate data in the middle of the workday.
Some organizations combine inline and post-process deduplication into a process called global deduplication. Imagine several different sources of data are being deduplicated inline and written to a common storage target. Although each stream of data has been deduplicated, there is always the possibility that there will be some cross-stream redundancy. Post-process deduplication can be used to eliminate this redundancy.
Global deduplication combines the best of both worlds. The inline deduplication engine minimizes the amount of data flowing across the wire, while the post-process deduplication engine removes any redundant data that may be present on the storage device.