Tip

Four points of archiving data on AWS S3

Storing archives on S3 can be a cost-effective option with proper planning and long-term management.

AWS S3 can be a cost-effective option for storing archives. Moving archives to the cloud allows companies to eliminate on-premises hardware such as network-attached storage (NAS) file stores. It can also help save on the number of redundant copies you keep to mitigate the risk of a media failure or some other problem when retrieving files from an archive.

Proper planning is crucial to maximize AWS Simple Storage Service (S3) benefits. There a few things to keep in mind when moving archives to S3 and managing them long-term in the cloud.

Plan an organizational structure of your archives

Some organizations may want to organize archives by operational function and date; other companies find it easier to follow an organizational hierarchy. Whatever approach works best for your organization, consider how you will implement chargeback for archives. For example, if you plan to bill departments for all of their archives, you'll want to have a structure that allows you to easily generate billing reports.

Buckets are the logical unit of storage in AWS S3. Each bucket can have up to 10 tags such as name value pairs like "Department: Finance." These tags are useful for generating billing reports, but it's important to use a consistent set of tags across all archive buckets.

AWS is planning to require that all bucket names follow DNS naming conventions. Bucket names should be three to 63 characters with distinct labels separated by periods. Use a hierarchical naming convention. Names should look like this, for example: archive.finance.audit and archive.finance.accountspayable.

Each AWS account can have 100 buckets at one time. If a single account will manage all archives, plan accordingly. There is no limit to the number of objects stored in a bucket. There is no performance penalty for storing objects in a few buckets or across many buckets. Amazon S3 supports file folders within buckets, providing an alternative to using multiple buckets. Folders do not support cost allocation tags, however.

Determine the best way to transfer data

Depending on how much data you have to transfer to S3, you might want to consider using the AWS Import/Export data migration service. Instead of transferring files over the Internet, customers ship disks to Amazon and have the data loaded to S3 from within an Amazon data center. This service is available in the U.S. East (Northern Virginia), U.S. West (Oregon), U.S. West (Northern California), EU (Ireland), and Asia Pacific (Singapore) regions.

Recommendations for using the Import/Export service instead of file transfers over the Internet depend on network speed and the volume of data you have to transfer. If you have speeds of 10 Mbps, consider the service could be useful if you have to transfer more than 600 GB of data. At 100 Mbps, the Import/Export service is feasible once volumes exceed 5 TB. The AWS Import/Export Calculator can help you estimate the cost of using this service for your archives.

Verify, verify, verify

Regardless how you transfer data, you'll need to verify it's correctly written to S3. Errors in transmission can result in differences in the source and target files. Most Linux distributions include the md5sum utility for calculating a hash value of a file. This can be used to calculate a hash value of the source and target files. If there is a difference in the two hash values, then there was an error in the transmission and the file should be resent. Because these files are presumably archives of valuable information, it's important to verify the integrity of data stored in AWS S3.

Look to less expensive alternatives

Amazon Glacier is a specialized file archive service that costs $0.01 per gigabyte per month -- about one-third the current price of S3, depending on the amount of data stored. Consider moving archives from AWS S3 to Glacier if you won't need to retrieve or delete them anytime soon. Retrieving data from Glacier can take hours and Amazon charges to delete data within Glacier within the first three months you put it there.

One way to realize the benefits of AWS S3 and Glacier is to use lifecycle management rules to migrate files to Glacier according to company policies. Take, for example, an archive file that has been in S3 for six months. You probably won't access it; if you do need to access it, a delay of several hours to retrieve it will not disrupt business operations. A lifecycle configuration rule can be associated with the S3 bucket so files are transferred to S3 automatically after the specified time leading to lower overall storage costs.

Companies can use AWS S3 for archiving -- but it's best to plan for the long term so you can streamline ongoing management operations, such as billing individual users and controlling costs by using Glacier when appropriate.

About the author:
Dan Sullivan holds a Master of Science degree and is an author, systems architect and consultant with more than 20 years of IT experience. He has had engagements in advanced analytics, systems architecture, database design, enterprise security and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail and education. Dan has written extensively about topics that range from data warehousing, cloud computing and advanced analytics to security management, collaboration and text mining.

Dig Deeper on AWS infrastructure