Funtap - stock.adobe.com

Tip

Optimize Amazon Athena performance with these 5 tuning tips

Amazon Athena can provide an efficient, cost-effective method of data analysis. But did you properly optimize Athena performance to see these benefits?

Amazon Athena introduced a serverless approach to data analysis built to scale automatically and provide quick results. But that doesn't mean that administrators can set it and forget it.

AWS released Athena to expand its data analytics services portfolio, which already included EMR and Redshift. Athena originally focused on analyzing data stored in Amazon S3 using SQL syntax. Now it supports DynamoDB, Redshift and data accessed by the AWS Glue catalog. Built on the Trino and Presto engines, Athena also supports Apache Spark, all of which are open source projects. The serverless approach makes Athena an accessible option without launching expensive compute infrastructure.

However, without proper optimization best practices, poor query performance can negatively affect applications and user experience. These five quick tips, such as data partitioning and monitoring usage patterns, can help.

Consider data files

First, consider how to further optimize any stored data files for analysis. Take a closer look at the status of stored files, including the following information:

  • Formats. Athena supports typical data file formats such as CSV, Parquet, ORC, JSON, text and binary.
  • File size. File size has an impact on both performance and cost. Avoid formats that result in larger data files, such as text, CSV or JSON.
  • Compression. Compress data files with Snappy, Zlib, Deflate, Bzip or Gzip. Compressed files in columnar formats, such as ORC or Parquet, deliver better performance, which results in lower data scanning costs.

Partition data

Data partitioning reduces the amount of data queries, which can result in increased performance. The main principle behind partitioning is splitting data files according to a data range contained in each file. If a date field is relevant, those files can enter storage folders according to year, month and day, making certain queries more efficient.

Partitioning files can present a challenge. The relevant field chosen to partition your organization's files must appear in most query conditions. A partition pattern may optimize some queries, but not others. It is important to choose the right partitioning pattern based on application requirements.

Review queries

Another consideration is how the number of queries, and how long they take, can negatively impact performance.

Assess query cost and timing

Query conditions, statements or retrieved columns can result in suboptimal data access patterns. Running the EXPLAIN ANALYZE statement on specific queries returns valuable information. This can include the amount of accessed data and rows, the nature of the query, and the query's performance. This provides additional visibility to optimize queries as well as file placement and patterns.

Limit queries

The number of files a particular query scans is also an important factor to consider. Balance the size of each file and the number of files accessed by common queries. Running queries that access thousands of files can result in poor performance or errors from exceeding request rate limits. Store files greater than 128MB but less than 1GB. AWS offers the S3DistCp tool, which automates file placement and sizing in S3.

Reuse query results

Be sure to reuse Athena query results for often-used queries or queries performed on predictable sets of data. This feature makes Athena look for results of previous query executions, such as caching. Athena returns these results and stores them in S3, which can optimize performance and cost.

Manage costs

As a serverless option, Athena assigns compute capacity when executing queries by default. The cost for this setup is $5 per TB of data scanned, based on U.S. Eastern region pricing. Athena also offers the Provisioned Capacity option, which assigns 4 vCPU and 16 GB of memory per data processing unit (DPU) as well as doesn't incur data scanned cost. Application owners pay $0.3 per DPU hour, with a 1 hour minimum, which is billed per minute.

For heavy or high-volume queries, consider Provisioned Capacity, which provides a flat-rate, hourly pricing structure. However, it does introduce the operational task of monitoring and assigning appropriate compute capacity for performance and cost optimization.

Monitor usage patterns

Athena simplifies the execution of queries on large amounts of data. Applications with a high volume of queries should evaluate server-based services such as EMR or Redshift. A good practice is to monitor CloudWatch, and use tools such as DPUAllocated, DPUConsumed and ProcessedBytes. These provide a view of usage patterns and are helpful when considering alternative configurations.

Ernesto Marquez is the owner and project director at Concurrency Labs, where he helps startups launch and grow their applications on AWS. He particularly enjoys building serverless architectures, automating everything and helping customers cut their AWS costs.

Dig Deeper on Cloud provider platforms and tools