Wikibon CTO: Erasure coding can help reduce data backup costs

Erasure coding can help reduce the data needed for backups and lower the cost of backup and recovery over the traditional approach, Wikibon CTO says.

Erasure coding can help to reduce the amount of data required for backups and substantially lower the cost of backup and recovery compared to what many IT shops pay today, according to one industry analyst.

David Floyer, co-founder and chief technology officer at Wikibon, a community-focused research and analysis firm based in Marlborough, Mass., said the firm did a detailed study on the cost of traditional forms of backup and recovery compared to a fresh approach that would make use of erasure coding.

A single dispersed copy used with a system that incorporates erasure coding would be "more accessible and more reliable than a best-of-breed, three-data-center, array-based synchronization topology," Floyer said. It would also be "somewhere in the order of nine to 25 times less expensive," for the same level of recovery point objective and recovery time objective (RPO and RTO), he said.

"So, by combining erasure coding with other technologies, such as snapshots, deduplication, compression of the data," Floyer said, "vastly reduced cost of backups can be achieved with actually higher availability and higher recoverability."

In this podcast interview with TechTarget Senior Writer Carol Sliwa, Floyer also shared his views on the advantages and disadvantages of erasure coding, the amount of data for which erasure coding makes sense, the decision point on the amount of erasure coding to do, and the long-term potential uses of the technology.

What do you see as the main upside and the main downside of erasure coding?

David Floyer: The main upside is flexibility. You can choose the level of protection very, very easily indeed, and you can dial it up or dial it down. For example, if you have 16 slices [of data] and you want to add four slices [with erasure coding], that's a far more efficient way of doing it than in the traditional RAID format. It'll give you a lot more upside. You can lose essentially four slices and still recover the data -- and you can lose any four. It's more efficient, and it's more flexible.

The main downside of erasure coding is that the more protection you give, the higher the overhead that you have in decoding it. You have to bring in all of the slices, and you have to process them, and there's a lot of processing. So, in general, the higher the read rate, the greater the overhead, both in elapsed time and in CPU resources to actually do the erasure coding.

Can erasure coding eliminate the need for backups?

Floyer: It doesn't eliminate the requirement for backups. What it does is help reduce the amount of data that's required for a backup, and allow new models of backup and of using the data for more than just backup -- using it for backup and archive, for example.

If you think about backup, you've got two major factors: the RPO, the recovery point objective; that's how much data you're going to lose if you have a disaster; and you have an RTO, so, how quickly can you get the backup system up and ready? And you have a degree of reliability of that second area. There is obviously a chance that you will lose both the primary and the secondary copies. And if you look, for example, at how the banks manage this, they usually go with at least a three-system copy. One of the large banks in their email system has four copies of data, so that at any point in time they can fail over to any one of the four copies. So, you can see that people are getting very sophisticated in demanding high availability, high RTO and very high RPO as well.

The reason erasure coding is so important is that if you can spread that data out over multiple locations or within a location, your level of redundancy is so much higher. The amount of data that you actually have to transport is very, very significantly reduced. If you combine that with other techniques, such as snapshotting and with deduplication and compression, you can get to environments where you can really reduce the amount of data very significantly that you're sending over the network and get very high levels of availability at much, much lower costs.

We did an estimate that with erasure coding for high-availability backup, you could reduce the cost overall by somewhere north of 10 times. [It would be] one-tenth the cost of doing traditional copies, backup, taking it out over tape or in real time. So, [there would be] very significant reductions in the cost.

But you still need the backups, because you need that point in time. You need the logs to recover from, just in case of a software error. You need constant transfer of data to be able to recover quickly. So, we won't get rid of backup any time soon, but erasure coding will be part of the backup solution and make it significantly cheaper.

What's the minimum threshold of data at which an IT shop should consider erasure coding?

Floyer: If you are below a petabyte of data, the amount of savings that you will get at the moment from erasure coding will be small. You may as well use RAID 6. You'll do that under the covers. But putting in a system for erasure coding is not going to make too much business sense at the moment.

If you look at the major storage cloud providers that are using erasure coding, they are, for example, very large photo sites or very large audio sites, sites where you've got a large amount of data, where it's in a sort of cache when it's used, but then it is in archive mode. You want to make it safe by being able to spread that in different locations. You want to be able to get it back, but instant response is not necessary. If it comes back in a minute, that's fine.

How does an end user go about making the decision of how much erasure coding to do?

Floyer: I think that will come in the application design or application implementation stage. And as you deploy it, then you will put in a set of storage appropriate for that particular application. If you move an application to another set of storage as part of a migration, you might consider changing it to take it into account then. But the overhead of change is so high that that's probably unlikely.

So, the best way of looking at it is: With this new application I'm putting in, does it make sense to use the erasure coding from the get-go, from this particular piece of software that I'm implementing? And if it does, then you make the decision then. I think that retrospectively adding it, that may be appropriate later on in five years' time, but at the moment, that's not the way to think about it. If you've got a new application, you should let the archive vendor be the person who sorts that out for you, has it set up in the right way and allows you to take advantage of that erasure coding. And it's the ISVs [independent software vendors] in general who will drive this because it'll reduce the cost of the hardware to run it on, and in a value environment, they will be able to extract more value for themselves than the user has to spend on the hardware.

What's your vision on how enterprise IT shops will use erasure coding beyond RAID 5 or RAID 6 in the short term and the long term?

Floyer: In the short term, erasure coding will come in on the disk side as an alternative and give greater flexibility to storage administration guys to decide on the balance between protection and cost. So, that is clearly going to happen. It is happening already, and it'll happen over the next few years.

The long-term vision, I think, is more clear: Erasure coding is going to go and hide itself. As flash becomes more prevalent, a huge amount of flash is erasure coding. Flash storage technology is inherently very dirty. It's a lot of errors, and it uses erasure coding in many directions in order to preserve the confidence in the data itself. So, it will subsume to just part of the technology.

One of the very interesting thought experiments is: Where would you see, for example, an all-flash array? Could all enterprise storage be on an all-flash array? And if you think about it, what you ought to be able to do is separate out IOPS or bandwidth, the access to data, either reading or writing, from the storage itself. And if you are writing data or reading it, then there is going to be some degree of degradation of flash storage. So, what you want then is to be able to vary the degree of protection of that storage according to whether it's being written to a lot or whether it's just staying there without any access at all.

And the obvious answer on how to do that is varying erasure coding. So, what we could expect to see is all-flash arrays maybe of more than one type of array where you use erasure coding to provide under the covers a type of tiering without having to move data. And that's very important. As soon as you start moving data, especially with flash, you are creating wear. So, this way of having wear as one of the parameters could be a very interesting way of bringing the cost of flash down such that you could use it for pretty well every part of enterprise storage. That's a long-term vision. It won't get there until five years or so. But I think it's a very interesting one and one that erasure coding will enable.

Dig Deeper on Data center storage