The king of all backup jobs: Backing up the Library of Congress

The Library of Congress is the ultimate backup job, preserving petabytes of Internet data and created from turning physical files into digital ones.

Consider it the great-granddaddy of backup jobs: Protecting data at the Library of Congress.

Founded in 1800, the Library of Congress is probably most famous for its collection: millions of books, photographs, sound recordings, manuscripts and other materials collected over the last 213 years and representing the bulk of American history.

But like any library in the 21st century, the institution is grappling with managing the rise of digital data, collected from both outside sources and created from the library's efforts to convert physical documents into digital files.

Jane Mandelbaum, special projects manager for the information technology services directorate at the Library of Congress, said the library now manages "multiple petabytes" of digital content in four data centers, and noted that "the volume of digital content grows in both number of files and in the diversity of formats to be managed and delivered."

Some storage vendors like to name-drop the Library of Congress as a unit of measurement when advertising data capacity or throughput -- using the library like this has earned it a spot on a Wikipedia page covering unusual units of measurement. In reality, though, it is difficult to estimate out how big the physical collection would be in a digital format. (The University of California, Berkeley, estimated two very different figures for the size of the library's print collection -- 10 TB and 208 TB -- in separate studies, suggesting how difficult that task really is.)

In terms of sheer size, the Library of Congress represents the ultimate backup job. Consider that the library has about 34.5 million books and printed materials, part of the institution's collection of 151 million pieces of movies, manuscripts, sound recordings, photos and other materials. Mandelbaum said the library is working to convert physical items into digital files, such as newspapers and sound recordings, creating multiple copies to ensure they are properly backed up.

"For digital content designated for long-term storage, we make use of our multiple data centers, so that we can have copies at geographically diverse locations," said Mandelbaum, who also noted that the library relies on commercially available storage products but did not disclose which products.

"We are continuously evaluating and strengthening our storage architecture to provide cost-effective tiered storage, so we can manage requirements for different types of content and usage. We use both disk and tape storage, and we procure equipment and software available in the commercial marketplace. We focus on the functionality of equipment and software that we acquire, so we can select for functionality and scale within each tier. Our goal is to be able to do technical refreshes of equipment as it becomes cost-effective for each piece or type of equipment," said Mandelbaum, who noted in a follow-up interview that "the specific mix of products and configurations changes as we do technical refreshes, and as the technologies change and evolve. We have found that we have to adapt and adjust as industry predictions and technology availability change. We build our environment so we can swap out specific components as needed."

One particular backup task of interest is the library's effort to preserve websites -- library officials have said that the average lifespan of a website is just 44 to 75 days, using sites that served as primary research sources on the aftermath of Hurricane Katrina in 2005 as one example.

"The library began a pilot project in 2000 to collect and preserve websites. Now the library continues to evaluate, select, provide access to and preserve a wide variety of websites, including newer formats such as blogs. The library has collected approximately 300 [TB] of Web archive data, growing at about 5 [TB] per month," said Mandelbaum.

Preserving the Web means more than Web pages -- microblogging sites like Twitter create far more material that could be added to the collection. The library announced in 2010 that it reached a deal with the social media giant Twitter to back up public tweets and associated metadata Twitter users created from 2006 through 2010 (about 21 billion tweets, which add up to about 20 TB).

The library has continued adding tweets to that collection, according to a library white paper, and now includes about 170 billion tweets from the microblog provider. Researchers have asked for access to the library's Twitter archive; it hasn't made it publicly available yet, and officials have said conducting a search through billions of tweets can take 24 hours, according to a library white paper on the project.

Other major backup projects would convert old sound recordings and newspapers into digital formats.

Library staffers worked in 2010 to convert more than 10,000 pieces of music from old 78 RPM records published from 1901 through 1925 as part of the National Jukebox project, Mandelbaum said.

According to the project's website, digital archival copies are stored at the Library of Congress Packard Campus for Audio Visual Conservation in Culpeper, Va. Smaller MP3-formatted versions are published at the National Jukebox website, where users can listen to the files.

The Library of Congress' Chronicling America project poses another backup challenge for the library, as it continues to build an online searchable database of image files that record American newspapers, which currently date from 1836 to 1922. Mandelbaum said they've added about 1 million new newspaper pages in the past year and have a total of about 5 million pages that are currently available.

Mandelbaum said the library's staff works with professional library and preservation organizations to stay up-to-date with the latest developments in technology and procedures, plus follows legal requirements for federal agencies to protect data.

"The library's biggest challenge in terms of digital content designated for long-term storage is to invest wisely in plans to sustain the content in a way that will have the greatest likelihood of cost-effectively meeting the needs of future users -- Congress, the public, the educational community, researchers and other libraries. The changes in the technology platforms and user expectations make this a continuing challenge," said Mandelbaum.

Dig Deeper on Archiving and tape backup