michelangelus - Fotolia
Databricks Data Ingestion Network brings data to lakehouse
Databricks Ingest entered a public preview in a move by Databricks to enable a lakehouse that combines the best features of the data lake and data warehouse models.
Databricks introduced the Data Ingestion Network and Databricks Ingest, expanding on its open source Delta Lake efforts with a pair of services that aim to make it easier for organizations to create what is known as a lakehouse.
The Data Ingestion Network and Databricks Ingest write into the open Delta Lake format. Delta Lake was first open sourced in April 2019 and became a Linux Foundation project in October 2019. Delta Lake enables an overlay on a data lake, describing the data within and providing a way for users to handle data for both business intelligence and machine learning applications.
Databricks Ingest is the Delta Lake-based service, while the Data Ingestion Network includes a partner program, which initially includes, among others, Microsoft Azure Data Factory, Fivetran, Qlik, Infoworks, StreamSets and Syncsort. Databricks launched the Data Ingestion Network and Databricks Ingest as public previews on Feb. 24; it has not yet disclosed details on full general availability.
Databricks Ingest and the Data Ingestion Network have much promise, said Ventana Research analyst David Menninger.
"Accessing and preparing data is generally the most time-consuming part of the analytics process," Menninger said. "Anything vendors can do to streamline this process will provide immediate benefit to organizations using their technology."
Organizations need to work with hundreds of cloud data sources, in addition to myriad on-premises data sources, he noted. The key to success for Databricks will be the breadth and depth of the partnerships Databricks establishes as part of the Data Ingestion Network, Menninger said.
From data lake to lakehouse
With the Delta Lake effort, Databricks is trying to help define the emerging approach known as a lakehouse for data, which combines the best elements of data lake and data warehouse model.
David MenningerAnalyst, Ventana Research
Menninger said he isn't particularly a fan of the term "lakehouse," though the idea has merit.
"Names are a funny thing in the tech industry and vendors want to come up with new names and want those names associated with their brand," he said. "Regardless of what it is called, their concept is correct -- we're seeing a merging of data warehouses and data lakes."
His firm's research has shown for years that relational technology is important to organizations using big data, and the research also shows a strong relationship between data warehouses and data lakes, Menninger said.
"In more than eight out of 10 cases, organizations report their data warehouses and data lakes are related," he said. "In about a third of organizations, the data lake replaces the data warehouse. In about a third of organizations, the data lake feeds the data warehouse, and in about 15% of cases the data warehouse feeds the data lake."
The evolution of delta lake
Since Delta Lake was first open sourced in 2019, Bharath Gowda, vice president of product marketing at Databricks, said "thousands" of enterprises have been using Delta Lake to bring improved reliability and scalability for their data lakes.
"Databricks Ingest and the Data Ingestion Network are Databricks services, and write into the open Delta Lake format," Gowda explained. "It is possible to read tables created with the new ingestion services with open source Delta Lake."
Databricks says partnerships key to bringing data to the lakehouse
As part of the services rollout, Gowda noted that Databricks created a new partner gallery within the Databricks user interface that enables users to identify vetted partners to ingest data into their data lake. He added that Databricks has also streamlined and standardized the process for ingest partners to bring data into Delta Lake.
Databricks developed a new Copy Command to move data from cloud storage to Delta Lake smoothly, without accidentally duplicating data from ingesting the same data twice, Gowda said.
Now, Databricks is planning to expand the breadth and depth of the Data Ingestion Network with more partners, including Informatica, Segment and Talend.
"Beyond Databricks Ingest, we're exploring new ways to make it easier for customers to run business analytics and machine learning workloads from a lakehouse architecture," Gowda said.