Tip

Archiving methods: Smart meta tags, archive in place and FLAPE

Jon Toigo outlines several archiving techniques, the technologies that make them run smoothly, and explains why data classification is crucial.

Do you know what FLAPE is? If you don’t, you might not be keeping up with the latest in archiving methods. Find out what Jon Toigo says you must do to keep current when it comes to archiving and how to avoid accidentally filling up a storage junk drawer.

The idea of archiving data -- placing it in a secure, energy-efficient and sensibly organized repository for access at some future date -- sounds straightforward enough. However, even such a simple idea is fraught with issues that need to be well-thought-out in advance. This tip discusses some of the decisions that may determine the success or failure of your archiving project.

A typical issue in most discussions of archiving is whether the organizational scheme and format of the data will stand the test of time. What if the manner in which data is classed needs to be changed over time -- for example, when laws governing retention of certain types of data change. And what if the application used to write the data is no longer supported by any operating system or hardware platform, say, in 10 years? Must you keep a copy of your current apps and processors “under glass” for the foreseeable future? These are valid issues that need to be addressed simply and effectively.

It’s helpful to think about the answers to these questions in terms of developments in data archiving methods. In the past, data was classified for inclusion into an archive by using metadata, or data about the data. A data set or file that, according to its metadata, had not been accessed or modified in 30, 60, 90 days or longer was simply migrated off of its production storage and into the archive. This approach, however, lacked any sort of granularity. It said nothing about the importance of the data, its relevance to a specific business process, or its association with any regulatory or legal retention mandates. So the process was just as likely to include junk files, Internet cookies and browser dregs, and even old virtual machine disk files for temporary VMs that may have been created during a long-past test/dev effort.

The bottom line is, unless you have a “data hygiene” program in place that clears away the clutter, chances are your archive is destined to become the same sort of junk drawer as your primary storage over time. That makes the archive more challenging to search and use in the future.

An option for non-granular data selection is to tie class data based on the individual who creates it. If Joe works in accounting, all data from Joe’s workstation could be handled as “accounting data” and is subjected to an archive policy established for data of that class. However, this practice opens the door to trouble later, for example, when Joe changes positions and moves from accounting to sales (where different policies pertain to data), or when Joe develops a social media addiction and all of his tweets and blogs about his kids and hobbies start to be stored alongside his legitimate work files in the archive. Again, the result will be an archival junk drawer that will prove difficult to search or use.

Data classes tied to department workflows

The best way to classify data is to combine concepts for the greatest granularity: create data classes that are tied to departmental workflows, rather than tying classes to user roles, then include triggers like DATE LAST ACCESSED and DATE LAST MODIFIED in metadata to identify when to move the relevant files to the archive. Some burgeoning tiered storage architectures like FLAPE (flash plus tape) enable you to store data immediately into the archive as it is written to primary storage (flash, disk or a combination of the two) so that, instead of moving data into the archive at some later date, a file that has reached its archive point can simply be deleted from the primary storage.

As a rule, archivists today try to steer clear of data formats that are potentially “time bound.” File systems seem pretty stable, but the “containers” used to store data bits in a manner that will make them accessible without the original software used to create them -- for example, commercial formats like Adobe PDF or some of the still-experimental “standards-based” XML containers -- remain problematic. One major national archive selected Adobe PDF for its data container a few years ago, then regretted the decision when they had to “un-injest,” reformat, then re-ingest all of their data each one of the 30-odd times Adobe changed the PDF format in the first two years.

Another challenge may develop from the earlier assumption about the stability and permanency of the file system. In fact, file systems change all the time, and new file systems that include features like de-duplication and compression as part of the storage methodology, or erasure coding as a data protection method, are being introduced for every popular operating system in use today. The flattening of the file system that has accompanied Web development may portend the replacement of a hierarchical or tree-based model within a comparatively short time by a new paradigm that saves all data as objects, either self-describing or indexed into a sort of database-like structure.

Object storage for archiving

Newer commercial technologies like Caringo Software’s SWARM technology or Spectra Logic’s Black Pearl illustrate some of the alternatives that are maturing into real solutions in the object storage market. That said, there are no dominant models as of this writing, and standards-based efforts are in their infancy. At some point, object storage and archive will intersect in a big way. 

The promise of treating all data sets as objects is twofold:
1. Rich metadata can be stored with data sets to more precisely identify and classify the data objects so very granular policies for retention and maintenance can be applied.
2. The entire storage infrastructure and the data on it can be managed holistically and without any need for special software or appliances to provide data protection or preservation services, since the rules for protecting data are baked into the metadata-linked policies for all objects of a given class.

Caringo’s SWARM technology, for example, enables stored data to be labeled via metadata for inclusion in an erasure coding scheme that can spread replicated parts of objects around the storage infrastructure. This allows an object to be reconstructed from available parts on other storage devices if a particular storage device fails. For other data classes, where such protection is not required, objects can be assigned much simpler mirroring policies via their metadata handles. Preservation policies could be assigned just as readily, making data storage a common infrastructure for both archive and primary storage.

One reason for such an “archive in place” strategy is to facilitate big data analytics. Another is to contain the costs of storage services. For archivists, however, the promise is to simplify the method by which data is classified and preserved over time.

Dig Deeper on Storage architecture and strategy