ra2 studio - Fotolia

What you need to know about Cloudera vs. AWS for big data

Enterprises in need of a big data platform must run some analytics of their own to choose a vendor. AWS' integration between services can't be beat, but is Cloudera a better choice?

When it merged with fellow big data management vendor Hortonworks in January 2019, Cloudera Inc. gained a better chance to compete with cloud providers' Hadoop offerings -- setting up an AWS faceoff.

The upcoming Cloudera Data Platform (CDP) will be an open source, cloud-hosted big data offering meant to challenge Amazon Elastic MapReduce (EMR) -- AWS' Hadoop service -- and other cloud-oriented big data analytics applications also built on Hadoop. CDP does not have a release date yet.

Cloudera also partnered with IBM in June 2019 to collaborate on big data and AI offerings and resell each other's services: Cloudera Enterprise Data Hub and DataFlow as well as IBM Watson Studio and Big SQL.

Let's take a look at what this Cloudera and IBM partnership might mean for users with big data workloads on the cloud and how CDP changes the contest of Cloudera vs. Amazon EMR.

What IBM brings to Cloudera

The Cloudera and IBM partnership is first a reaffirmation of the Hortonworks-IBM partnership prior to the Cloudera merger, said Dave Mariani, founder and chief strategy officer at data warehouse virtualization provider AtScale.

Before they merged, Cloudera and Hortonworks focused on the Hadoop file system and tools for large data lakes. With these capabilities, enterprises could save all their data in one place and repurpose it for various analytics and AI purposes. In practice, though, enterprises have struggled with Hadoop performance problems, and as a result, many enterprises have turned to cloud providers to outfit their data management fabric.

Post-merger, Cloudera's partnership with IBM could help enterprise customers address Hadoop performance problems through IBM's extensive service and support organization and partnerships. In contrast, AWS provides a comprehensive set of tools for automating many aspects of big data deployments and is an attractive choice for companies with AWS development and deployment skills.

Cloudera vs. Amazon EMR

The Cloudera and IBM partnership and CDP offering should be most attractive to companies entering the early stages of a big data analytics strategy that have data and applications spread across on-premises and cloud environments. It is not likely to draw companies with a substantial AWS presence and skill set.

In partnering with IBM, Cloudera has tied itself to IBM's hybrid and multi-cloud agenda. Therefore, Cloudera and IBM should be the best fit for enterprises with a hybrid cloud data strategy, Mariani said. IBM asserts that a hybrid or multi-cloud approach is more realistic than locking in to one provider, he said.

IBM's approach to supporting modern app development is to use Kubernetes and containers so that workloads can run anywhere: on premises, private cloud or public cloud. AWS, on the other hand, wants all workloads to run only on its cloud.

While multi-cloud may be a viable approach, Mariani does not expect many enterprises to go that route soon. The cloud users he speaks with are all-in on their chosen public cloud vendor and contract a secondary vendor for backup only. The main benefit customers see in AWS and other vendors isn't easy access to servers, but the tightly integrated services and tools that take enterprise IT out of the systems integration business, he said.

For example, Amazon EMR uses S3 and integrates with its data catalog AWS Glue and with its database Redshift. AWS' strengths come from API integrations, availability and scale in terms of geographic regions and interoperability across its range of services. These native tie-ins put third-party technologies such as Cloudera at a disadvantage to EMR, especially if data platform buyers are trained and certified on AWS operations and management.

CDP vs. AWS

Cloudera wins vs. AWS, though, when organizations seek high-end service, support, implementation, security and compliance for the data platform, said Marty Puranik, president and CEO of Atlantic.net, a hosting provider.

Cloudera Data Platform will have security, governance and metadata baked into the exchange fabric between data sources and analytics workloads when it launches. Cloudera has created the Shared Data Experience connection fabric, called SDX, that manages and automates these processes. To build security into Amazon EMR, developers must set up the encryption between their apps.

One valuable capability on the AWS side vs. Cloudera is that it supports Jupyter-based EMR notebooks that easily work across AWS products such S3, DynamoDB and Redshift. CDP often involves more work connecting Jupyter-based notebooks to these services. Jupyter notebooks are useful for data visualization, cleaning, modeling and other tasks. The sharable documents can contain live code, equations, visualizations and narrative text.

Implementation and cost

The ultimate costs of Cloudera vs. AWS for big data management come down to implementation, compliance, security and performance. AWS caters to enterprises with in-house expertise and cloud centers for excellence, whereas Cloudera and IBM offer more guidance through professional services

"AWS will have a lower sticker price, but could end up being much more if you don't know what you're doing," Puranik said.

For example, developers can incur significant egress charges if they send out more data from the cloud than required for a particular workload. Another big problem could come from a misconfiguration issue, such as leaving an S3 bucket open, as was the case in the 2019 Capital One breach

If users aren't all-in on one particular public cloud yet, or unsure of what they need, they should look at the Cloudera and IBM first, even if the upfront cost is higher than Amazon EMR. To truly understand if one or the other data platform is a fit, use trial workloads.

"Start with smaller projects, if possible, and see which one fits your organization best," Puranik said.

Dig Deeper on AWS database and analytics strategy