agsandrew - Fotolia

Cloudera adds data engineering capability to enable DataOps

Big data vendor Cloudera is looking to help data engineers use its platform with a new service that brings more power and management to running Spark for building data pipelines.

Big data vendor Cloudera is growing its portfolio with a series of efforts aimed at enabling a DataOps model.

Earlier this month, the company, based in Santa Clara, Calif., announced new and upcoming features for its Cloudera Data Platform, including Cloudera Data Engineering and Cloudera Data Visualization. The Data Engineering service makes use of Apache Spark for data queries and the Apache Airflow platform for workflow monitoring. The Data Visualization offering is based on technology that comes from Cloudera's 2019 acquisition of Arcadia Data, which provides reporting and charting functionality.

Cloudera Data Engineering is generally available now; Cloudera Data Visualization is in technical preview.

According to Doug Henschen, an analyst at Constellation Research, Cloudera makes a good case for the breadth and depth of capabilities it can deliver without the heavy lifting of knitting together multiple point solutions, like databases, analytics environments and streaming tools. That said, he added that Cloudera also knows it still has work to do on simplifying its platform to lower the cost of ownership and maximize value for customers looking to support data engineering, as well as data science, data warehousing and operational database use cases.

How Cloudera Data Engineering enables DataOps

David Menninger, a senior vice president and research director at Ventana Research, said Cloudera's announcements focus on rounding out the platform to provide a one-stop shop for everything related to big data, from streaming data to data engineering and machine learning.

Cloudera Data Platform Data Engineering screenshot
The new Cloudera Data Engineering service is meant to provide users with visibility and management into data pipelines and resource utilization.

"The new data engineering capabilities address a critical need in the market that many others are calling DataOps," Menninger said. "DataOps addresses the process of automating all the data pipelines that feed analytics to ensure these systems can be put into production and maintained as requirements change."

DataOps addresses the process of automating all the data pipelines that feed analytics to ensure these systems can be put into production and maintained as requirements change.
Dave MenningerSenior vice president and research director, Ventana Research

Shaun Ahmadian, senior manager of product management for data engineering at Cloudera, said the goal of the new data engineering service is to decouple a lot of the analytic workflows from the data engineering workflows. Data engineers will now get the tools they specifically need to build data pipelines and make sure the right data is available, he added.

Raja Aluri, director of engineering at Cloudera, explained that data engineers often write their own Spark jobs for data pipelines, as they want the programmatic power of Spark to do complex data transformations. Spark is nothing new for Cloudera, he said, but what is new is specific tooling in Cloudera Data Engineering that makes it easier for data engineers to build and manage data pipelines.

"We provide an optimized, autoscaling way to run Spark jobs," Aluri said.

Bringing Apache Airflow to data engineering

While Spark is a foundational element of Cloudera Data Engineering, so, too, is the Apache Airflow open source project. Airflow is a workflow orchestration service platform originally developed by Airbnb in 2014 and contributed to the Apache Software Foundation in 2016.

Airflow is now a mature technology, Aluri said, adding that there was interest from the Cloudera customer base in making use of the platform to help improve data workflows. According to Ahmadian, a key benefit of Apache Airflow is that it's written in the open source Python programming language.

"By having the data pipeline primarily defined as Python code, it attracts a lot of developers it will help with any customization that is needed," Ahmadian said.

Dig Deeper on Data governance