Warakorn - Fotolia

Tip

Improve query performance, cut costs with Amazon Athena

Amazon Athena can query more data formats than its competitor Google BigQuery, but developers must choose easy-to-process formats to enjoy Athena's benefits.

Data is more valuable than ever, and modern IT teams must ingest, manage and analyze data quickly. Enterprises understand the value of data analytics -- particularly in the cloud -- but many still run into common financial and performance barriers.

Not all data is good data. Improperly structured data can result in high costs for AWS customers. Several AWS tools can optimize data to improve query performance and reduce costs -- and pair well with Amazon Athena, AWS' interactive SQL-based query service.

Data analysts use Amazon Athena to query large amounts of data stored in Amazon Simple Storage Service (S3) with a simple SQL interface. Athena, which is entirely serverless, requires no provisioning and can quickly handle petabyte-scale searches. Athena is Amazon's answer to Google BigQuery, which has grown in popularity since its launch in 2011.

Unlike BigQuery, Amazon Athena doesn't require that companies store data in a specific format, nor does it require that developers write custom code to upload directly to it. Instead, Athena executes queries directly from files stored in S3. Developers can optimize data for Athena to improve query performance and reduce costs; the service also supports traditional formats, such as CSV files. To get started with data queries, upload any supported document formats into an S3 bucket and point Athena at that bucket.

Data analysts use Amazon Athena to query large amounts of data stored in Amazon Simple Storage Service with a simple SQL interface.

IT teams should plan queries ahead of time and not run them repeatedly for large data sets. Because AWS charges per TB of data scanned, costs can add up quickly if an application accidentally queries large amounts of data every few minutes. To prevent this, compress and store data in a columnar format, such as Apache Parquet, before uploading to S3.

Format data in S3

Amazon Athena uses standard SQL, and developers often use big data SQL back ends to track usage analytics, as they can handle and manipulate large volumes of data to form useful reports. For example, an application could log every action a user takes in a given session and then create a log file in Amazon S3. A typical analytics log might track user ID, action name, action value, browser, location, IP address, start time and end time.

Big data management and analytics weather tumult -- with more in store

Big data management and analytics saw plenty of commotion last year, as bleeding-edge users dug deeper into machine learning, streaming architectures gained attention and cloud computing exercised greater overall influence. Please listen to this podcast, which includes a discussion about Amazon Athena.

The application could store a log in S3 from a single session as a CSV or plain text format, but Athena works more efficiently with data stored in Parquet format. Apache provides a C++ library for Parquet, as well as several adapters for other languages, like Node.js.

The node-parquet module is simple to use and shows how to compress data into Parquet format for S3 and Athena. This code writes out data to a Parquet format then uploads that data to S3:

const parquet = require('node-parquet');

var schema = {

  user_id: {type: 'int32'},

  action_name: {type: 'byte_array'},

  action_value: {type: 'byte_array', optional: true},

  browser: {type: 'byte_array'},

  location: {type: 'byte_array'},

  ip_address: {type: 'byte_array'},

  start_time: {type: 'int32'},

  end_time: {type: 'int32'},

};

var data = [

  [ 1234, 'Login', null, 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'Search', 'Chris Moyer', 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'View', 'Book: Building Applications in the Cloud', 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'Add to Cart', 'Building Applications in the Cloud', 'Chrome', 'Ohio', '10.0.0.1', 1496941960, 1496942080],

  [ 1234, 'Checkout', '1 Item', 'Chrome', 'Ohio', '10.0.0.1', 1496951960, 1496952080],

];

var writer = new parquet.ParquetWriter(`${session_id}.parquet`, schema);

writer.write(data);

writer.close();

writer.on('end', () => {

  s3.upload({

    Bucket: 'my-parquet-data',

    Key: '${session_id}.parquet',

    Body: fs.createReadStream(`${session_id}.parquet`),

  }).promise().catch(e => {

    // Retry logic if there's an issue, or alert

    // if nothing can be retried.

  });

});

Parquet files have a strict format, so you need to define the schema before uploading to S3. This process also ensures that files will conform to the same schema and enables Athena to more easily and correctly process data. Developers can also use GZIP compression to further improve query performance.

Query data with Athena

After uploading data to S3, create a virtual table in Athena to query the data. This instructs Athena on how to parse S3 data into an SQL-compatible format. Use regular expressions or specify that the input data is in Parquet format. Specify Parquet as the data format, and the data will automatically move into the proper table schema based on how it's stored. Be sure to specify which columns to load into Athena; choose the same format and column names created when saving the data to S3.

Partitions, which are similar to indexes, can improve query performance. For example, you might choose to partition on user ID and action name, which would quickly group all activities from a particular user or action type.

After completing the Athena configuration, any SQL queries running either through the AWS Management Console or Athena API will search S3 directly. There's no caching or synchronization process; the data automatically updates, and the IT team only pays when performing a search.

Amazon Athena is helpful for developers that don't want to maintain a running SQL server, but it's only efficient for infrequent queries. Applications running constant queries against data should load data into a full database, like Amazon Aurora, or a data warehouse, like Amazon Redshift.

Next Steps

AWS dev tools provide better insights and control

Redshift, MySQL database options to improve AWS querying

Athena among host of Amazon cloud improvements

Dig Deeper on AWS database and analytics strategy