Artur Marciniec - Fotolia
How is big data changing data archiving strategies?
Analyst Jon Toigo explains why performing analytics on big data sets means that data is no longer considered cold, reducing the importance of archives.
When we think about archiving strategies, we often talk about the past and what we know about our data. When it comes to the future, we cannot possibly know all of the events, trends and changes that will impact our information archive requirements in terms of the data we need to preserve, the platform we use to preserve it with, and the tools required to make it accessible to future users and systems. This realm of the "unknown unknowns" has a way, unfortunately, of making an archiving strategy seem more like a thought experiment to many corporate bean counters rather than a part of a real strategic initiative that will help contain costs, reduce risk or improve productivity.
When archive programs are approved and funded, they are often construed as a standalone project. In many cases, archive projects have their own staff, processor, network and storage infrastructure quite apart from the production data center. Even cloud service providers like Amazon Web Services or Google offer discrete services for archival storage that are separate and distinct from their managed hosting and storage services.
However, with the emergence of interest in big data analytics and the appearance of server-side and software-defined storage infrastructures, the model of standalone archive strategies is being called into question. Big data analytics, which applies a set of technologies to examining ongoing trends of multiple and otherwise unrelated data sets, sees no data as archival. Instead, all data is active and has value in day-to-day business decision making or problem resolution. Archives have no real meaning in a framework like this.
Moreover, the movement of storage architecture away from a centralized pool or repository and toward a series of discrete, server-side, direct-attached configurations connected to individual server nodes in a cluster -- whether to support federated processing strategies like Hadoop and MapReduce or workload virtualization strategies like VMware Virtual SAN or Microsoft Cluster Storage Spaces -- is challenging traditional notions of storage tiering in which the tertiary tier contains archival data. Companies embracing these so-called agile philosophies in infrastructure design cannot conceive of a standalone archive practice. Clearly, the way we conceive an archive needs to change. We will need to stop thinking of an archive as a set of operations and infrastructure separate and distinct from production operations and infrastructure -- a "bolt-on" set of technologies and services -- and instead look for an archive-in-place strategy. Archive in place, fundamentally, means leaving archival data where it is physically located, but marking the data and perhaps applying special services to the data that befit its archival class.