Getty Images/iStockphoto

Databricks' $100m acquisition of Arcion adds data ingestion

The data lakehouse specialist's latest purchase adds native data ingestion capabilities that will help users stream data in real time to inform AI and machine learning models.

Databricks on Monday made its second major acquisition in the past four months, reaching an agreement to purchase Arcion for $100 million in a move that adds new data ingestion and data replication capabilities.

The deal comes after Databricks acquired MosaicML for $1.3 billion in June to enable customers to more easily develop their own generative AI models.

Databricks was already an investor in Arcion, having participated in Arcion's $13 million series A funding round in February 2022. The 2016 startup, a data integration specialist based in San Mateo, Calif., had raised $18.2 million before being acquired by Databricks.

Databricks, now valued at about $43 billion, raised $500 million in new funding in September 2023 after raising more than $1.6 billion in financing in August 2022 and $1 billion in February 2021.

Databricks did not specify how it planned to use its most recent funding. But given that it has now acquired MosaicML and Arcion in rapid succession -- MosaicML coming shortly before the funding and Arcion shortly after -- acquisitions are a clear part of its roadmap.

Based in San Francisco, Databricks was one of the pioneers of the data lakehouse when it was founded in 2013.

Now, as AI and machine learning continue to gain momentum as organizations develop generative AI models, lakehouses are similarly gaining traction given that they combine the storage capabilities of data lakes and data warehouses as well as enable organizations to combine the disparate data types needed to train large language models.

Additive capabilities

Data ingestion and replication capabilities are becoming increasingly important.

As worldwide data increases at an exponential rate, the data that organizations collect is likewise growing in both volume and complexity. As a result, data ingestion pipelines into platforms such as Databricks' lakehouse are useful to avoid data silos that result if data is stored in disparate databases and applications.

Data replication, meanwhile, is vital to ensure that the same data can be used in different locations such as on premises and cloud databases and remain consistent as it's used to feed different models and other data products.

Databricks, however, does not provide its own data ingestion and replication capabilities for data natively created by customers.

The vendor's platform can connect to data ingestion tools that move data directly from its source, such as Microsoft's Azure Data Factory and Fivetran. In addition, with Auto Loader, Databricks can ingest data from cloud data warehouses such as AWS, Google Cloud and Microsoft Azure.

Once the acquisition of Arcion is complete and Arcion's technology is integrated with the rest of the Databricks platform, customers will no longer have to use third-party platforms for ingesting and replicating data not already in the cloud.

As a result of those ingestion and replication capabilities, the acquisition is strategic for Databricks, according to Kevin Petrie, an analyst at Eckerson Group.

"This is a good move for Databricks because enterprises need to synchronize their on-premises operational databases with cloud analytics platforms such as Databricks," he said. "With this acquisition, Databricks makes it a little easier for lakehouse adopters to migrate and continuously update operational data with the lakehouse platform."

Databricks also adds somewhat unique data ingestion and replication capabilities with its acquisition of Arcion, Petrie continued. He noted that Arcion is one of the few vendors able to ingest and replicate data in real time from certain databases and applications.

"There is a short list of vendors that can automatically extract live data updates from heritage systems … without slowing the performance of source workloads," he said. "A number of large enterprises trust Arcion to do this."

Arcion's platform is built on a change data capture engine that ingests data as it's created.

In addition, Arcion connects with more than 20 enterprise databases and data warehouses, according to Databricks. As a result, the acquisition will enable Databricks to better ingest and replicate streaming data, apply Databricks' existing governance and security capabilities, and make that data actionable to inform real-time analysis.

Real time data, meanwhile, is gaining value as organizations develop generative AI models that require continuous training to remain as accurate as possible.

Databricks' acquisition of Arcion, therefore, addresses a real need, according to Doug Henschen, an analyst at Constellation Research.

"The type of change data capture tech that Arcion provides is seeing growing demand for low latency use cases," he said. "The acquisition will complement Databricks' existing integration technologies and its growing streaming data capabilities."

Without tools such as those provided by Arcion, Databricks users would still need to develop their own integrations with integration platform as a service (iPaaS) vendors to build AI and machine learning pipelines, Henschen continued.

"Up-to-the-minute analyses and timely machine learning and AI predictions depend on fast access to the latest data," he said. "This acquisition will give Databricks a tightly integrated, platform-native option in addition to what customers might use from third-party iPaaS partners."

Next steps

Databricks declined to divulge details about its roadmap after revealing its acquisition of Arcion.

In early October, however, after unveiling new large language model and GPU optimization capabilities to help improve generative AI outcomes, Prem Prakash -- Databricks' principal product marketing manager of AI and machine learning -- said simplifying model development and deployment is an ongoing focus.

This is a good move for Databricks because enterprises need to synchronize their on-premises operational databases with cloud analytics platforms such as Databricks. With this acquisition, Databricks makes it a little easier for lakehouse adopters to migrate and continuously update operational data with the lakehouse platform.
Kevin PetrieAnalyst, Eckerson Group

The vendor's acquisition of Arcion fits with that strategy because it makes ingesting the data used to train and maintain models more efficient.

Beyond simplifying model development and deployment, Prakash said improved model governance capabilities are part of Databricks' product development plans.

Data governance has long been an important way for organizations to ensure regulatory compliance and data security while safely enabling employees to work with data. Now, as generative AI models become more widespread, the same governance is needed to ensure that those models are consumed in ways that don't put organizations at risk while also leading to better decisions.

Petrie, meanwhile, suggested that Databricks would be wise to incorporate not only Arcion's ingestion and replication capabilities but also its no-code interface that lets users develop streaming pipelines with generative AI-fueled natural language processing.

"It will be interesting to see whether and how Arcion increases ease of use further within Databricks with a generative AI conversational interface," he said. "This is a clear trend among data pipeline vendors, helping data engineers increase their productivity."

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data integration