chris - Fotolia
The benefits of columnar storage and the Parquet file format
What's behind Apache Parquet's growing popularity? It may be the file format's columnar storage orientation, which leads to benefits including improved query performance.
As organizations consider migrating structured data sets to the cloud, data practitioners must decide how those data sets will be stored.
Naïve approaches will transfer the data sets in their raw forms, often as database extracts or field-separated files. Other alternatives, such as the Hadoop Distributed File System, are appealing because the data set is distributed across a number of storage nodes, allowing for parallel execution of data access requests.
However, one file format that has gained popularity is Apache Parquet. The Parquet file format is column-oriented. However, to understand its value, one must first gain an appreciation for columnar storage and how it differs from the conventional database storage layout.
How data is stored: Rows vs. columns
Tables in a relational database management system are stored in a row-oriented fashion, which means that all the values in one row are stored, followed by all the values in the next row, etc. This orientation is suited to transaction processing systems, which typically create or modify one or a few records at a time for performance reasons.
The trouble with columnar storage is that business intelligence (BI) and analytics applications execute queries that scan the selected attributes of a vast number of rows or records, needing only those columns or aggregates of those columns to support the user's needs. But because of their row-based functions, a row-oriented database must read the entire row of values to get to the attributes that are part of the filter conditions. As a result, analytic and BI queries often access a lot more data than is needed to satisfy the request.
The result of this increased I/O is much slower performance and response times for analytic queries. Tuning the environment by adding indexes, precomputing certain aggregates, or reorganizing the data using materialized views and online analytical processing -- or OLAP -- cubes helps, but this still requires additional processing time and storage.
An alternative approach organizes the data vertically along a columnar layout. In a column-oriented storage layout, each column's values are aligned and stored separately. This columnar alignment allows a query engine to determine the columns that are necessary to calculate the conditions, as well as the columns that are being returned, and retrieve only the values requested from those specific columns.
Because the data values within each column form a virtual index, there is a reduced need for those special indexes; the columnar layout enables rapid access to the data without the need to expand the database footprint, resulting in dramatically improved query performance.
The advantages of a columnar data layout
There are some clear benefits of using a file format with columnar storage, such as a reduced storage footprint and faster response times. When data values are stored in a columnar layout, those columns with a smaller data value cardinality -- such as those using small code sets, or 0/1 for binary flag-type values like male/female or true/false -- are eminently compressible. Applying data compression algorithms to some columnar data can lead to dramatic reductions in file size, sometimes reducing storage needs by a large order of magnitude.
As was previously noted, when executing queries, the database engine needs to load only the columns that are accessed in the query instead of the entire row. However, there are two other aspects of the columnar layout that contribute to increased performance.
Because the data is compressed, there is less data to stream from the disk to the CPU, and that significantly reduces the latency delays associated with data access. Second, consecutive data values loaded from disk fill the CPU cache with the values needed for the subsequent iterations of the query condition; data access pipelining can prefetch the next sets of values into the cache while the CPU is executing the current set of comparisons. This reduces data access latency even more.
Because the Apache Parquet file format was designed to support big data applications using columnar storage, it has been engineered to integrate what might be called semi-structured data, including nested structures, as well as sparsely populated columns. Parquet allows for using different encodings to compress different columns so users can determine the optimal compression for data. Parquet also allows multiple types of data in the columns, including more complex data types and structures, such as dictionaries/hash tables, indexes, bit vectors, etc.