adam121 - Fotolia

Tip

Parsing AWS CloudTrail logs for useful information

AWS CloudTrail logs all API calls, which means it generates a lot of data. But digging for useful data among all the logs takes some work.

AWS CloudTrail is the cloud provider's first step toward an auditing product. It creates a log record for each API call from any entity within the AWS cloud. This is both the power and the drawback of the system. AWS CloudTrail has a "log all or nothing" approach, meaning it generates a lot of data. Some of that data is useful to admins and some of it is not so useful. And finding the CloudTrail logs that are most meaningful to your enterprise can be difficult.

If a program asks for a list of EC2 instances, for example, CloudTrail logs that call. If a program asks for a spot instance price history, that call is logged. More interestingly, if a program makes a change in your Security Group Rules, that is also logged. This is the power and weakness of AWS CloudTrail.

If you have hundreds of instances all performing a high volume of work and interacting with AWS, you can easily make thousands or tens of thousands of API calls per hour. All CloudTrail logs are placed into compressed JSON (JavaScript Object Notation) files under a special directory in your Simple Storage Service (S3) bucket. The directory structure takes the following form:

S3://<yourDirectoryName>/security/AWSLogs/<accountId>CloudTrail/us-east-1/yyyy/mm/dd/accountId_CloudTrail_us-east-1_yyyymmddThhmmZ_guid.json.gz

CloudTrail creates a new directory for each day of each month and then creates multiple files per day. Each file can contain the log of a single API call or several other calls. Many IT shops are only concerned with calls to security-related APIs.

The trick with making AWS CloudTrail useful is being able to sift through a large collection of data to find the information that matters to your organization. Some organizations that use CloudTrail have developed a set of scripts to process log files. The scripts all tend to follow the same pattern:

1. Copy the files from S3 to a local machine using aws s3 cp

2. Unzip all of the copied files using find . –name "*.gz" |xargs gunzip

3. Parse the files for the particular API call or calls your organization is most interested in. This can be tricky because each line in the JSON file may contain one or more calls, for one or more time periods.

The following script uses "eventVersion" as the demarcation of an individual API call and then filters on a particular call that you're interested in.

find . | xargs sed 's/eventVersion/\n&/g' | grep ParticularApiCallICareAbout | sed 's/^.* ParticularApiCallICareAbout / ParticularApiCallICareAbout /' | sed 's/userAgent.*//' | sort -u > toBeLookedAt

This sort of script narrows down the enormous set of API calls to a particular set of calls. You then still have to process that smaller list -- manually or with other scripting -- to determine if there was or wasn't a problem.

About the author:
Brian Tarbox has been doing mission-critical programming since he created a timing-and-scoring program for the Head of the Connecticut Regatta back in 1981. Though primarily an Amazon Java programmer, Brian is a firm believer that engineers should be polylingual and use the best language for the problem. Brian holds patents in the fields of UX and VideoOnDemand with several more in process. His Log4JFugue open source project won the 2010 Duke's Choice award for the most innovative use of Java; he also won a JavaOne best speaker Rock Star award as well as a Most Innovative Use of Jira award from Atlassian in 2010. Brian has published several dozen technical papers and is a regular speaker at local Meetups.

Next Steps

Pairing REST and JSON to build APIs

Track user activity with CloudTrail

Dig Deeper on AWS infrastructure