sdecoret - stock.adobe.com

Tip

Data management trends: GenAI, governance and lakehouses

The top data management trends of 2023 -- generative AI, data governance, observability and a shift toward data lakehouses -- are major factors for maximizing data value in 2024.

Generative AI hype dominated 2023, but GenAI isn't the only trend influencing data operations in 2024. The need for improved data observability and governance is increasing as data continues to be a core element of business operations, analytics, machine learning and AI.

Several data management trends from 2022 continued to evolve in 2023, including the move toward cloud data lake and data lakehouse architectures. Macroeconomic conditions -- including inflation -- continue to put pressure on organizations trying to maximize their potential data value. Despite challenges with the economy, some vendors were also able to raise money, though the volume of funding in 2023 was a shadow of the numbers seen in 2021 and 2022.

Generative AI dominates data

It's no surprise that generative AI was a dominant trend in data management, as it was in IT and other industries.

Nearly every major database and data platform vendor had some form of generative AI news in 2023. Some vendors included generative AI as a tool to act as an assistant, helping users to conduct different tasks. Managing data platforms and writing different types of data queries has long been a complicated exercise and generative AI simplifies it.

Among the many vendors that integrated some form of AI assistant, Dremio launched its Text-to-SQL AI-powered tool in June, which enables users to generate SQL queries more easily. In August, Couchbase announced Capella iQ, a generative AI tool that helps developers write database application code. Also in August, SnapLogic rolled out its SnapGPT AI tool to help users build data pipelines using natural language. Alation announced its Allie AI tool in October to help improve productivity for its suite of data catalog and governance tools.

Beyond integrating an AI-powered assistant, database vendors added new capabilities to help enable large language models (LLM). LLMs act as a knowledge base for retrieval-augmented generation (RAG), typically by providing vector-database-type capabilities. The capabilities generally involve supporting vector embeddings as a data type and providing vector search functionality. Many database vendors added support for vector search in 2023, including Rockset, Neo4j, Oracle Database 23c, MongoDB and SingleStore.

Data lakehouse momentum continues to build

The increasingly popular data lakehouse -- cloud object storage used as a data lake -- has data analytics uses similar to a data warehouse.

Databricks pioneered the basic concepts of the data lakehouse in 2020 and others have jumped into the market in the years since. Databricks pushed data lakehouse efforts forward during 2023 with multiple updates, among the most notable being the release of Delta Lake 3.0 in June. Delta Lake is one of the three leading open source data lake table formats, alongside Apache Iceberg and Apache Hudi.

To help limit any potential confusion and lock-in risks across the three open source data lake table formats, the OneTable open source project announced an interoperable metadata layer across Hudi, Delta Lake and Iceberg. Apache Hudi vendor Onehouse started OneTable with backing from Google and Microsoft.

Oracle got into the lakehouse action with the launch of its MySQL HeatWave Lakehouse service in July. MySQL Heatwave is a service that combines both operation and analytical database capabilities as a converged database, which is another trend going strong overall.

Data governance and observability remain a top priority

Whether it's for AI, data operations or analytics, the topic of data governance is increasingly important.

Being able to understand where data comes from, how to make it available and use it is important for security, privacy, accuracy and reliability. Over the course of 2023, multiple vendors expanded and enhanced data governance capabilities to help manage data.

The need to boost data governance led Informatica to acquire startup Privitar in June, to help improve capabilities for the cloud data platform vendor. Collibra brought improvements to its data quality, lineage and discovery capabilities.

In November, Starburst updated its Galaxy cloud services with automated data governance, powered in part by GenAI.

Having observability is a part of being able to effectively manage and govern data. With the rise of generative AI and vector databases in 2023, the capability to observe and govern data used for AI going forward is growing in importance. In November, Monte Carlo launched new data observability capabilities specifically focused on vector databases.

Investment funding slows down

Being able to understand where data comes from, how to make it available and use it is important for security, privacy, accuracy and reliability.

One of the many indicators of health for the data management industry is the pace of funding activity for emerging vendors.

Though the volume of funding events was smaller than in the past two years, throughout 2023, several data platform vendors secured major funding rounds to power expansion and innovations.

At the beginning of the year, InfluxData, creator of the InfluxDB time series database, secured $81 million in a funding round in February. The company released InfluxDB 3.0 in April, and new deployment options including InfluxDB Clustered for private cloud and on-premises environments.

Onehouse raised $25 million in February to fuel data lakehouse interoperability with its OneTable effort. Databricks raised $500 million in September and plans to use the funds for R&D focused on generative AI, along with geographic growth. Databricks has introduced new tools for building generative AI applications powered by customers' own data, such as vector search and RAG pipelines.

Also in September, Denodo obtained a $336 million equity investment from private equity firm TPG Growth. Denodo recently added new data governance capabilities including data lineage and launched a free tier to reach new users.

Data management should remain a foundation for data analytics, operations and AI efforts in 2024 and beyond. The further integration of generative AI into data platforms, including data lakehouse efforts, makes sense for both vendors and users to improve efficiency and get more done, with less effort.

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and has been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.

Dig Deeper on Database management