erasure coding (EC)
What is erasure coding (EC)?
Erasure coding (EC) is a method of data protection in which data is broken into fragments, expanded and encoded with redundant data pieces, and stored across a set of different locations or storage media.
If a drive fails or data becomes corrupted, the data can be reconstructed from the segments stored on the other drives. In this way, EC can help increase data redundancy without the overhead or limitations that come with different implementations of RAID (redundant array of independent disks).
How does erasure coding work?
Erasure coding works by splitting a unit of data, such as a file or object, into multiple fragments, or data blocks, and then creating additional fragments, or parity blocks, that can be used for data recovery. For each parity fragment, the erasure coding algorithm calculates the parity's value based on the original data fragments. The data and parity fragments are stored across multiple drives to protect against data loss in case a drive fails or data becomes corrupted on one of the drives. If such an event occurs, the parity fragments can be used to rebuild the data unit without experiencing data loss.
For example, a storage system might use a 5+2 encoding configuration to distribute data across multiple physical drives. In this configuration, the EC algorithm breaks each data unit into five data fragments and then adds two parity fragments, which are calculated from the original data. Each fragment is stored on a different physical drive. As a result, the storage system must include at least seven drives.
In a 5+2 configuration, the parity data consumes 29% of raw capacity. The configuration can also tolerate up to two disk failures, whether the disks contain data fragments or parity fragments. However, EC is flexible enough to support a wide range of configurations. For example, a 17+3 encoding splits each data unit into 17 segments and then adds three parity segments. Although this configuration requires at least 20 physical drives, it can support up to three simultaneous disk failures, while reducing the parity overhead to less than 18%.
Erasure coding makes it possible to protect data without having to fully replicate it because the data can be reconstructed from parity fragments. For instance, in a simple 2+1 configuration, a data unit is split into two segments with one parity fragment added for protection. If an application tries to retrieve data from either of the data segments and those segments are available, the operation proceeds as normal, even if the parity segment is unavailable.
However, if the first data fragment is available but the second isn't, or vice versa, data is read from the first data fragment and the parity fragment. Together, these two fragments are used to reconstruct the data that was in the second fragment, making it possible to continue data operations while the disk is being rebuilt.
What is RAID?
RAID relies on two primary mechanisms for protecting data: mirroring and striping with parity. Mirroring is one of the most basic forms of data protection. When used alone, it's referred to as RAID 1. In this configuration, multiple copies of the data are stored on two or more drives. If one drive fails, the data can be retrieved from one of the other drives without interruption to service. Mirroring is easy to set up and maintain, but it uses a large amount of storage resources, similar to any form of replication.
Striping with parity, referred to as RAID 5, stripes data across multiple hard disks and adds parity blocks to protect the data. If a drive fails, the missing data can be reconstructed using the data on the other disks. However, RAID 5 can support only one disk failure at a time. For this reason, some vendors offer RAID 6 storage systems, which can handle up to two simultaneous disk fails. Different RAID configurations can also be combined, as in RAID 10, which uses disk mirroring and data striping without parity to protect data.
Erasure coding vs. RAID
Erasure codes, also known as forward error correction codes, were developed more than 50 years ago to help detect and correct errors in data transmissions. The technology has since been adopted for storage to help protect data in the event of a drive failure or data corruption. More recently, EC has been gaining popularity for use with large object-based data sets, particularly those in the cloud. As data sets continue to grow and object storage is more widely adopted, EC is becoming an increasingly viable alternative to RAID.
The following are a few pros and cons of EC and RAID technologies:
- Erasure coding can exceed RAID 6 in terms of the number of failed drives that can be tolerated, increasing the level of fault tolerance. In a 10+6 erasure coding configuration, 16 data and parity segments are spread across 16 drives, making it possible to handle up to six simultaneous drive failures.
- Erasure coding is much more flexible than RAID, whose configurations are fairly rigid. With EC, organizations can implement a storage system to meet their specific data protection requirements. In addition, EC can reduce the amount of time it takes to rebuild a disk that has failed, depending on the configuration and number of disks.
- Erasure coding has the drawback of being a processing-intensive operation. The EC algorithm must run against all data written to storage, and the data and parity segments must be written across all participating disks. If a disk fails, rebuild operations put an even greater strain on central processing unit resources because the data must be reconstructed on the fly.
- RAID is a well-established technology with fixed levels of redundancy and parity, making its execution relatively straightforward.
- RAID configurations, whether mirroring or striping with parity, have much less of an effect on performance and can often improve it.
- RAID can have higher storage overhead compared to erasure coding, as it duplicates all the data rather than just the parity code.
- The various RAID configurations have been integral to data center operations for many years, but they do come with significant challenges. For example, mirroring is inefficient when it comes to resource utilization, and striping with parity can protect against only two simultaneous disk failures at best.
- Another issue with RAID is related to capacity. As disk drives become larger, it takes much more time to rebuild a drive if it should fail. Not only can this affect application performance, but it can also increase the risk of losing data. For example, if a drive fails in a RAID 5 configuration, it might take days to rebuild that drive, leaving the storage array in a vulnerable position until the rebuild is complete. An incapacitated disk can also affect application performance.
Erasure coding vs. replication
Erasure coding and replication are both methods used to ensure data resilience and availability, but they differ in their approach and the tradeoffs they offer.
The main differences between the two approaches include the following:
- Replication is a simple process that involves keeping exact duplicates of the original data on different storage nodes instead of distributing redundant information across multiple nodes in erasure coding.
- Replication generally requires more storage space compared to erasure coding because it stores full copies of data on each storage device or node. On the other hand, erasure coding typically requires less storage overhead because it generates parity blocks instead of storing full replicas of data, making it more resource-efficient in such scenarios.
- Erasure coding is highly scalable and well suited for distributed storage systems with a large number of nodes, as it can efficiently distribute data and parity blocks across the storage infrastructure. Replication can become less efficient and more resource-intensive as copies increase, especially in large-scale distributed storage systems.
- Erasure coding works best with cold data that has served its initial purpose and is accessed and modified less frequently. Replication is more suitable for hot or highly valuable data, which is accessed and modified regularly.
- When dealing with a chunk of data smaller than the block size, erasure coding might produce more blocks compared to replication because of the additional parity blocks needed, resulting in heightened memory consumption.
Key use cases for erasure coding
Erasure coding is an essential part of object-based cloud storage and is ideal for environments that require high levels of data security and disaster recovery (DR).
The following are some key use cases of erasure coding:
- Distributed storage systems. Erasure coding is especially useful for distributed storage applications and ensures data durability across multiple nodes, even in the presence of network disruptions.
- Disk arrays. Erasure coding enhances fault tolerance in disk array configurations, mitigating the risk of data loss due to disk failures.
- Data grids. Erasure coding enables efficient data distribution and replication in data grid architectures, facilitating reliable access to large data sets.
- Cloud data stores. Major cloud storage services, such as Amazon Simple Storage Service (S3), Microsoft Azure and Google Cloud, use erasure coding extensively to protect their vast stores of data.
- Object-based storage. Erasure coding has proven especially beneficial for protecting object-based storage systems, as well as distributed systems, making it well suited to cloud storage services. That said, erasure coding has also been making its way into on-premises object storage systems, such as the Dell Elastic Cloud Storage object storage platform.
- Large data sets. Erasure coding can be useful with large quantities of data and any applications or systems that must tolerate failures, such as disk array systems, data grids, distributed storage applications, object stores and archival storage. Most of today's use cases revolve around large data sets for which RAID isn't a practical option. To support EC, the infrastructure must be able to deliver the necessary performance, which is why its predominant use case has been with major cloud services.
- Backups and archives. Erasure coding is often recommended for storage such as backups or archives -- the types of data sets that are fairly static and not write-intensive. However, erasure coding is finding its way into a variety of systems trying to avoid the high costs of replication. For example, many Hadoop Distributed File System setups now use EC to reduce the overhead associated with storing redundant data across data nodes.
What are the benefits of erasure coding?
EC offers several important benefits that should be considered when planning data storage:
- Better resource utilization. Replication techniques, such as RAID 1 mirroring, use a high percentage of storage capacity for data copies. Erasure coding can significantly reduce storage consumption, while still protecting data. This is because the parity or erasure codes are distributed among several nodes to provide redundancy without requiring complete data replication. However, the exact amount of capacity saving depends on the encoding configuration, but whatever it is, it still translates to greater storage efficiency and lower storage costs.
- Lower risk of data loss. When a RAID array is made up of high-capacity disks, rebuilding a failed drive can take an extremely long time, which increases the risk of data loss should another drive fail before the first one can be rebuilt. Erasure coding can handle many more simultaneous disk failures, depending on the encoding configuration, which means that there is a lower risk of data loss if a drive goes down.
- Greater flexibility. RAID tends to be limited to fairly fixed configurations. Although vendors can execute proprietary RAID configurations, most RAID implementations are fairly standard. Erasure coding provides far more flexibility. Organizations can choose the data-to-parity ratio that best fits their specific workloads and storage systems.
- Greater durability. Erasure coding enables an organization to configure a storage system that offers a high degree of availability and durability. For example, Amazon S3 is designed for 99.999999999% object durability across multiple Availability Zones. Unlike RAID 6, which can sustain only two simultaneous disk failures, an EC-based system can be configured to handle substantially more.
- Enhanced resiliency. Due to the distributed nature of erasure-coded data, the system can retrieve the original data even in the case of several failures or losses. This is especially useful in archive or cloud storage systems where data integrity is crucial.
When planning their storage strategies, organizations must consider several factors, including how to protect against data loss and provide DR. Straightforward replication is one approach, and RAID is another. Erasure coding is yet one more.
Each strategy comes with advantages and disadvantages. However, with the growing amount of data and continued move to object storage, EC is destined to gain momentum. Erasure coding enables organizations to meet their scalability needs and still protect their data without incurring the high costs of full replication. Even so, no technology can flourish without adapting to industry changes, and the EC in service today could look much different five years down the road.
Learn the key differences among replication, snapshots and backups in data protection and cloud storage environments. Understand their roles in ensuring data integrity and availability.