Essential Guide

Browse Sections

Editor's note

Data archiving has never been a sexy concept. In general, it's considered the concern of museums and academic and special libraries, organizations that are typically funded differently than for-profit businesses that perceive value in old stuff.

That's a long way of saying that hardly anyone enters the study of computer technology or computer science with the goal of becoming a data archivist. In just about every aspect of computing, the emphasis is on the here and now -- doing things faster, more efficiently and with greater agility than in the past. Data that's no longer referenced with any frequency tends to fall off the radar. The only concern is that it not be deleted, because that could have devastating consequences. But data archiving has many benefits to offer besides regulatory compliance and historical preservation.

Based on a study of more than 3,000 corporate storage infrastructures, as much as 40% of the capacity of every disk drive spinning in a company is occupied by data that hasn't been referenced in the last month, six months or one year. Yet, 7 W to 21 W of electrical power is supplied to each drive every second to keep them spinning; in addition, when drives fail, they're replaced and rewritten with the same data from backups or as part of a RAID set rebuild. That means we're wasting both electrical power (to energize drives and bleed off the heat they generate) and staff time (to confirm data, drive integrity and perform periodic maintenance), while building more capacity into our infrastructure to store new data year after year.

Then there's the issue of productivity. It may seem inconsequential that a search for a string of words takes a few milliseconds longer as you clot up your file systems with more data, but multiply that by the number of searches conducted every day by all the employees, customers, and others in and outside of your firm who have permission to scan your data. You're talking about a lot of wasted time searching through the 40% of your data that's included in search results only because it's physically recorded in and around active data.

The bottom line is that the business-value case for data archiving -- based on cost containment (archive data to reduce the urgency to buy more storage capacity), risk reduction (archive data to ensure regulatory compliance) and improved productivity (archive data to get it out of the way of searches, report generation, backups and so on) -- is pretty persuasive, whether you believe the data has historical merit or not. As simple as this business case may be to understand and appreciate, archiving itself remains a mystery to many IT folk. There are myriad issues of methodology and technology to parse when you develop a vision for your archive system and a strategy for bringing it to bear on your current infrastructure.

Understanding deep archives versus active archives is one of the first hurdles in reaching an understanding of data archiving. There are many definitions and uses of the term, often skewed to support one vendor or another's products.

Here's my definition: An archive is a collection of data -- a fact everyone cites but that offers no value whatsoever to the illumination of the strategy. A backup is also a collection of data, but backups aren't archives, at least not in the sense of long-term data preservation. Backups offer short-term protection of data assets against the corruption or deletion of the data itself or the breakage of the primary storage infrastructure. Backups are cycled and updated frequently to account for the latest data assets, and ideally, archival-quality data is periodically excluded from production backups because its restoration isn't generally needed in an emergency and to make the backup process work faster.

Archives are stores of files or data sets that are rarely re-referenced. These data assets are usually retained in a separate storage system with its own processes for data management and protection, able to deliver speeds and feeds appropriate for efficient data ingestion and rare data access, once written. Capacity is prized over performance, but performance must still be provided that's adequate to the profile of reads and writes common in an archive platform.

Contrary to popular belief, archive systems aren't junk technology kept in service so they can perform the lowly task of storing old data no one cares about. They're part of an ecosystem of storage systems no more or less critical than your nimblest solid-state drive or hard disk drive hybrid array placed behind your most mission-critical transaction processing system. Without the archive system, the transaction system can't function cost effectively or at peak efficiency.

Still, some draw a distinction between deep archive and active archive. The former is what we have traditionally considered an archive to be: a collection of data with historical business value or specific regulatory or legal retention requirements that's seldom if ever accessed. Deep archive systems, because of their limited re-reference rates, are designed with specific attention to the container that will be used to hold data for later use and to the technology to which that container is written -- specifically, to the longevity of the media, including its interoperability with future data access technologies. It does no good to have a 20-year-old archive of kept technology that can no longer interface with contemporary servers or operating systems. That's a big concern of deep archiving.

Active archive, by contrast, refers to storing data that, once written, changes very infrequently. However, it's data that may be read a lot but not modified, so it presents a different set of storage requirements than either read-, write- or modify intensive primary storage systems or read-, write-, modify, never deep-archive systems. Think video: Once the television episode is recorded, it won't be modified, but it might be replayed (read) many times. The video file is archival data, but it's still active (read). That's active archive in a nutshell, but it's actually another kind of primary storage.

Finally, it's worth noting that archives don't just happen as data becomes stale. You need to plan them carefully. Hierarchical storage management (HSM) systems, which migrate data from faster storage to slower storage based on metadata information such as the date last accessed and the date last modified, don't create an archive in the strict sense of the term. Still, HSM can help to identify candidate data for inclusion in a deep or active archive, but it's also adept at identifying temporary data that can simply be deleted once its usefulness has been depleted.

1Implementing your archival storage platform

The best way to design an archive platform depends on a number of factors: How much data will you be storing? How long should data be retained? How frequently will the data be accessed? Below is a selection of expert tips and answers that provide insight on how to approach these questions, and on the best strategies to make your archival storage approach successful.