Sergey Nivens - Fotolia
What to consider when choosing big data file formats
While JSON may be the go-to data format for developers, Parquet, ORC or other options may be better for analytics apps. Learn more about big data file formats.
Modern enterprise applications and business analytics live in different worlds when it comes to managing data. Developers prefer JSON because of its ease of readability, which makes it easy to weave into apps. Business analytics applications benefit from other formats and approaches for structuring the data inside these file formats that are smaller and faster.
The challenge is: There is no ideal format and structure for every use case. Enterprises can improve their analytics by creating processes to transform and restructure the data into the most appropriate big data file formats. This may not be a big issue for one-off analytics projects that require more work planning to extract, transform and load (ETL) data than to run the actual analytics. But the payoff can be significant if management decides to scale up a particular analytics workload.
Why JSON?
The Java Script Object Notation, or JSON, data format has emerged as the de facto format for transmitting data among applications running in enterprises, cloud applications, IoT and mobile applications, due to its support for nested and complex data structures, as well as readability. But it's not efficient when it comes to storing or analyzing data. As it turns out, neither is the comma-separated values (CSV) format, the old standby championed by spreadsheet jockeys.
"Using JSON as the storage format for analytics would lead to orders-of-magnitude performance degradation compared with the ideal case," said Reynold Xin, chief architect and co-founder of Databricks, a cloud-based analytics platform. This poor performance is caused by several reasons. JSON and other string formats, like CSV, are slow to parse. The compression ratio of a row-oriented format like JSON is much worse than a column-oriented format. Also, there are no statistics built into the JSON format for guiding an analytics app to skip unnecessary data.
Mohit BhatnagarSVP of products, Qubole
Another problem is that JSON includes metadata with the data in every record. As a result, the volume of the data will be five to 10 times higher compared with other big data file formats. "If you have to do analysis on a 2 TB JSON files and all you need is to read three to four out of 30 columns of data for your analytics, then the process will read all the data -- 95% of which is not needed in this scenario," said Ramu Kalvakuntla, chief architect at Clarity Insights, a business management consultancy.
The main issue lies in table joins. Modern databases like MongoDB, Cassandra and Amazon DynamoDB that are optimized for key-value read have great performance with JSON files. But many traditional analytics databases or cloud data warehouses require constant table joins. Mohit Bhatnagar, senior vice president of products at cloud data platform provider Qubole, said, "There is no wrong format, per se -- only flawed scenarios in which people choose to use it. There is no universal adaptive single database that can accommodate all scenarios. Thus, the reason multiple database products co-exist in one organization, increasing the management and maintenance costs overall."
Many alternatives to JSON
A variety of other alternatives have emerged that are generally better suited for analytics applications, such as the Apache Parquet and Apache Optimized Row Columnar (ORC) file formats. These more column-oriented formats make it easier to read a few columns in a large file. They achieve nearly 90% compression with five to 10 times higher performance than JSON file formats, Kalvakuntla said. Apache Avro is another popular row-oriented format that is quick to process and compresses well, but it isn't as fast when pulling data from a single column.
Although these formats can help store data more efficiently, they can also introduce data management challenges for robust production pipelines using Parquet and Apache Spark, Databricks' Xin said. As a result, data engineers need to build complex pipelines with large amounts of plumbing code to prevent data corruption, while doing concurrent reads and writes while they prepare data sets. Performance also suffers as data volumes grow, since data gets distributed across a larger number of files. Also, it's hard for data engineers to address the poor-quality data in big data file formats, such as Parquet, as there is no way to enforce schemas.
An alternative approach is to separate the data format from the programming abstraction used by users, developers and analysts writing queries, said Justin Makeig, director of product management at MarkLogic, an enterprise database provider. Under the covers, some database management system (DBMS) applications can analyze the data and query patterns and adjust the indexes and query plans.
Bill Tolson, vice president of marketing at compliance storage archiving service Archive360, said many companies must store business data in formats that are required for compliance but are not optimal for analytics. One ETL process for making this data more useful for analytics involves transforming this raw archival data into hierarchical and related JSON records, which makes it easier to slot into different analytics formats. But this process must be done to ensure that sensitive information, like banking account numbers, are properly classified or masked during the transformation process.
Automating data transformation
Enterprises might consider adopting tools to help dynamically tune for different analytics scenarios. The original creators of Apache Spark have created Databricks Delta, which combines a data processing engine and associated open format built on top of Apache Spark and Parquet. This can help generate optimized layouts and indexes for building data pipelines to support big data use cases for analytics and machine learning.
Another approach is to take advantage of the indexing and query optimizers baked into modern databases. DBMS vendors are investing heavily in automating this process, Makeig said. Better automation could adjust indexing strategies on the fly without requiring query writers or database administrators to declare every query upfront. Other work is also being done to look at how to do more predictive planning based on runtime behavior.
Balancing simplicity with efficiency
Some experts recommend data analytics managers default to columnar file formats, like Parquet and ORC, for efficiency, since they work well for a wide variety of user cases, said David Besemer, vice president of engineering at OmniSci, a GPU-accelerated analytics platform. If there is a need to store data in a different formation, then it would be best to treat the analytical processing system as a sandbox into which data is ingested for processing.
Others suggest starting with technical needs. Gökhan Öner, senior solutions architect at Hazelcast, an in-memory data grid company, said: "When choosing a data format for analytics workloads, storage efficiency, processing speed and interoperability should be considered, along with the tools and languages used for processing the data." He has found that they can operate on different data sets, and each one of them may be stored and processed more efficiently using different big data file formats for different use cases. Rather than selecting the most space-efficient or fastest-processed format for each different case, choosing a couple of interoperable data formats that give the overall optimal performance for all requirements will have bigger advantages over time.
Start with the end in mind
One good practice is to start with the type of business insight that is expected to consume the data with consideration to the personas, applications, hardware and established processes. Database choice comes after choosing big data file formats, as you'll want to choose the database that can process the data most efficiently. "For organizations that have already made their database choice, new business typically introduces new database products, and it is usually a data architect's role to holistically consider these factors and find a balance when considering budget, team size, data growth rate and future business aspirations," Bhatnagar said.
It's also important to not get too bogged down in analytics efficiency, since that is but one part in a much larger data wrangling pipeline. "The inefficiencies of messy, ambiguous, ungoverned data are likely orders of magnitude more costly and risky than the runtime performance due to the underlying data structures," Makeig said. Many data scientists spend most of their time wrangling data, making sense of it and fixing upstream quality issues, rather than conducting analysis. This can be especially problematic when trying to work with data collected by other researchers. Moving to a centrally governed hub of semantically consistent data enables data scientists to focus on analysis, not costly one-off ETL, where many people might be doing the same prep work differently across the organization.