
Getty Images/iStockphoto
Databricks simplifies data ingestion with Lakeflow Connect
With preconfigured connectors to Salesforce and Workday -- and more to follow -- the data cloud and AI vendor is taking aim at making streamlining data ingestion.
Databricks on Wednesday launched Lakeflow Connect with the general availability of connectors for Salesforce and Workday.
Lakeflow Connect is a set of low-code/no-code connectors between the Databricks Data Intelligence Platform and SaaS applications, databases and other file sources first unveiled in in July 2024. Together with Delta Live Tables (DLT) for data transformation and Databricks Workflows for data orchestration, it makes up the Lakeflow engineering suite.
Lakeflow Connect is powered by serverless compute, which enables users to run workflows without having to provision clusters -- Databricks manages and scales the requisite compute power -- and integrates with Databricks' governance, observability and security capabilities, including Unity Catalog.
Kevin Petrie, an analyst at BARC U.S., noted that BARC research shows over 90% of AI leaders are at least testing the use of structured data to inform applications while nearly two-thirds are using real-time data to train applications.
As a result, Lakeflow Connect is a significant addition, according to Petrie.
"Salesforce and Workday applications provide exactly this type of data as inputs for real-time machine learning and GenAI use cases," he said. "Databricks is right to simplify data access in this fashion."
Based in San Francisco, Databricks is a data platform vendor that helped pioneer the lakehouse format for storing data. Over the past two-plus years, like many other data management vendors, Databricks has expanded into AI development.
Connecting to data
Data ingestion is critical but complex.
It's simply the process of obtaining and importing data into systems such as databases, data warehouses, data lakes and data lakehouses. But building and maintaining pipelines that move data from the systems where it's created -- such as Salesforce and Workday -- into systems where it's stored and prepared it for informing analysis is complicated.
It often involves developing an infrastructure that includes data extraction tools, streaming data platforms such as Apache Kafka and change data capture (CDC), among other capabilities.
The result is that engineers spend substantial time piecing together and maintaining disparate tools, some of which eventually fail when the scale exceeds their capabilities, with both the time and technology purchases adding up to a significant expense.
Databricks heard from customers about the trouble they were having ingesting data, and that feedback provided the impetus for developing Lakeflow Connect, according to Michael Armbrust, a distinguished software engineer at Databricks.
The vendor provided connectors to numerous data sources before Lakeflow Connect, but they had to be configured by customers and maintained as the APIs, schemas and other aspects of data sources changed. In October 2023, Databricks acquired Arcion for $100 million to add improved data ingestion capabilities. Lakeflow Connect represents Databricks' integration of Arcion with its Data Intelligence Platform.
"Customers need this data, but before this announcement they were forced to use third-party tools that often times at large scale would fall over, so they would have to build their own custom solutions," Armbrust said. "This makes [ingestion] point-and-click within Databricks."
Using Lakeflow Connect's first two connectors, engineers can create data ingestion pipelines with either a few clicks or a few lines of code so that data created in Salesforce and Workday can be quickly and easily extracted and moved into the Data Intelligence Platform.
In addition, because the connectors integrate with the Data Intelligence Platform, once in the Databricks environment, data governance developed in the Unity Catalog is automatically applied to Salesforce and Workday data as it's ingested.
Donald Farmer, founder and principal of TreeHive Strategy, noted that many other vendors provide connectors to data sources to simplify data ingestion. For example, Qlik provides Connector Factory for its customers.
However, Lakeflow Connect is nevertheless valuable for Databricks users, demonstrating progress on the part of the vendor and representing a "milestone." In particular, its integration with Unity Catalog and CDC capabilities are notable, according to Farmer.
"It's difficult to say that Lakehouse Connect is unique, but the integration with Unity Catalog and the CDC which they acquired from Arcion are useful elements," he said.
In addition, Farmer highlighted serverless compute as an important aspect of Lakeflow Connect.
"The serverless compute may be quietly important, not just for its seamless scalability but for the rapid startup times which are important in reducing latency when running many complex pipelines," he said.
Beyond simplifying data ingestion, Lakeflow Connect is designed to make it easier for data engineers to transform and orchestrate data within the Lakeflow engineering environment to prepare it for analysis and AI development. In conjunction with and Databricks Workflows Lakeflow Connect helps provide engineers with unified data preparation environment.
Looking ahead
With Lakeflow Connect for Salesforce and Workday now generally available, connectors for Google Analytics, Microsoft SQL Server, Oracle NetSuite, PostgreSQL, ServiceNow and SharePoint are part of Databricks' roadmap.
Armbrust declined to give a timeline for their general availability, but said there will be some developments related to Lakeflow Connect at Databricks' Data + AI Summit in June.
Beyond Lakeflow Connect, Databricks is focused on unifying and simplifying data engineering, according to Armbrust.
"If you are [an Apache Spark] or Scala expert, you could always use Databricks to do pretty cool things," he said. "This year, we want to make it possible for someone who knows just a bit of SQL or just knows how to point and click in the UI to build those same production quality pipelines."
Meanwhile, with Databricks now making AI development a significant part of its platform, Petrie suggested that the vendor would be wise to do more to help different personas work together to build generative AI tools including agents that can autonomously to take on certain repetitive tasks.
AI development involves data management, model management and application development meaning data engineers, data scientists and developers need to collaborate.
"I'll be interested to see how Databricks helps [personas] collaborate across these lifecycles to build and manage safe, effective agentic AI," Petrie said. "This will require further integration of Lakeflow for the data lifecycle with Mosaic AI and MLflow for the model and application lifecycles."
Eric Avidon is a senior news writer for Informa TechTarget and a journalist with more than 25 years of experience. He covers analytics and data management.