KOHb - Getty Images

Beyond generative AI, data quality a data management trend

Vendors and users have mostly focused on generative AI so far in 2023, but observability to ensure quality and governance also remain key to successful pipelines for analytics.

With improved efficiency as its main intended outcome, generative AI has so far been the primary data management trend in 2023.

However, just as many analytics vendors have unveiled capabilities incorporating generative AI but not yet made them generally available, most data management vendors that have revealed plans for generative AI have not yet released those capabilities fueled by generative AI.

Alteryx, Informatica, Databricks and Snowflake as well as tech giants AWS, Google and Microsoft are among the many vendors that have said they are developing generative AI capabilities. But most of the tools that have been introduced are now in preview or in the product development stage rather than generally available.

"Data management is in the exploratory phase with generative AI because no one has really worked out what LLMs [large language models] are going to do in the world of data management," said Donald Farmer, founder and principal of TreeHive Strategy. "In some ways it's easy to see use cases [in analytics]. But how LLMs fit in at the back end is a little more difficult to see."

In particular, concerns about data privacy, compliance and governance are slowing the development of generative AI-fueled data management tools, he added.

Meanwhile, despite its prominence, generative AI has not been the only data management technology to get a lot of attention to date in 2023.

Data quality and data observability have also been in focus; so has data governance. And vendors have also emphasized developing data management and analytics ecosystems with integrations.

Where we stand

Generative AI is not a new technology. It was first developed in the 1960s and has been used in data management and analytics on a small scale for years.

OpenAI's release of ChatGPT in November 2023, however, marked a momentous advance in generative AI and LLM technology.

Suddenly, the simple, efficient and widespread use of analytics and data management that had always been hoped for seemed possible given the expansive natural language processing (NLP) capabilities of LLMs.

While many data platforms have NLP capabilities, they're limited and still require data literacy training to use. LLMs such as ChatGPT and Google Bard can eliminate that prerequisite, enabling more users within organizations to work with data.

In addition, by reducing the need to code, the data engineers and data scientists tasked with building and maintaining data pipelines -- work that has required writing significant amounts of code to execute -- will be freed from the time it takes to develop code as well as certain other repetitive tasks. Instead, they can do more deep analysis.

"Essentially, [generative AI] is all about increasing the productivity of data teams, and especially data engineers who have been struggling for a long time to stay on top of their workloads," said Kevin Petrie, an analyst at Eckerson Group. "It can speed up everything they do."

But the data management technology now available hasn't yet incorporated the generative AI that will make data operations more efficient.

The tooling is being developed and, in some cases, previewed by some users. But it has not yet been released.

"We are still in the early days of applying generative AI to data management, but there is a lot of progress and potential," said Jitesh Ghai, Informatica's chief product officer. "Where things stand is most capabilities are still largely theoretical or in the research/prototype stage. We haven't seen widespread production deployment and use yet."

There are exceptions. For example, data observability specialist Monte Carlo in June launched a text-to-SQL tool that uses generative AI to enable users to type commands in natural language that automatically gets converted to SQL code. It also developed a tool that uses generative AI to automatically alert users when there are problems with existing code.

Also in June, Dremio released its own generative AI-driven text-to-SQL translator.

But the few tools now generally available are basic compared to what vendors have planned, Ghai noted.

"Most current capabilities augment existing processes versus enabling brand new workflows," he said. "They provide productivity gains but still require substantial human guidance, supervision and risk management."

While other capabilities revealed but still being developed remain private, some organizations are applying generative AI to data management on their own, according to Petrie.

They are building their own smaller language models designed to address specific domains, such as fraud detection.

"Many companies also are starting to build language models themselves to handle domain-specific use cases, often feeding them their own data as prompts and training inputs," Petrie said. "But we're in early days."

Applications for generative AI

Analytics platforms generally have the same goal of enabling users to examine data to reach insights that lead to decisions. There are differences, such as Sisense and Domo focusing on embedded BI, Tableau and Qlik emphasizing self-service BI, and ThoughtSpot centering on natural language query.

But essentially they're trying to reach the same end point.

Data management, however, is more diverse. Data management includes the entire data pipeline from the point of ingestion through integration and preparation right up to the point a BI tool -- or perhaps an e-commerce website or supply chain management platform -- takes over and enables action.

Within data management, vendors specialize in different parts of the pipeline.

Some, such as Fivetran and Talend, focus on data ingestion and integration. Acceldata and Monte Carlo, meanwhile, focus on data observability to monitor data as it moves from point to point throughout the pipeline to ensure its quality. Still others, including Alteryx and Informatica, have tools that address the entire data pipeline. Then there are databases, warehouses, lakes and lakehouses where data rests when not being deployed for analysis.

With so much specialization, the generative AI capabilities various vendors are developing are diverse.

Among others, Monte Carlo's generative AI capabilities are designed to make data observability more efficient. Databricks and Dremio aim to simplify lakehouse management. Alteryx's Aidin and Informatica's Claire GPT target the entire data pipeline.

"Data management vendors are using language models to enrich their pipeline tools in a few ways," Petrie said.

For example, he noted that Informatica's language models target data discoverability and exploration as well as lineage and quality while Illumex's language models fuel its data catalog.

Farmer, meanwhile, said that he's beginning to see trends emerging within data management vendors' intended use of generative AI.

As they become more familiar with developing language models, they're learning the subtleties between pre-training models, fine-tuning them and using AI inference (applying rules to evaluate and analyze new information).

"These three things are very different; we're starting to see better understanding of the different phases of LLM development," he said. "That suggests that they're getting a bit more mature in their understanding."

Ghai similarly noted that some similarities are starting to emerge, despite the different specializations within data management.

He noted that just as Monte Carlo and Dremio have released text-to-SQL tools, other vendors are doing something similar to make data engineers and data scientists more efficient. Likewise, many analytics vendors are using generative AI to summarize reports and visualizations.

Still another early application is automation of simple data preparation tasks, again reducing some of the manual burden normally placed on data engineers.

"There are some common themes emerging in how vendors are applying generative AI to analytics tools but also quite a bit of divergence in approaches," Ghai said.

Top benefits of generative AI for businesses.
Seven benefits of generative AI for the enterprise.

Other data management trends

While the most prominent recent product development news has centered around generative AI, there have also been other notable data management developments so far in 2023.

Data observability, which is closely tied to data quality, is one that is gaining momentum, according to Ghai.

Data observability used to be a relatively simple process when organizations had just a few primary data sources and kept all their data in an on-premises database.

Now, however, many enterprises have not only moved some or all of their data to the cloud but also store it in multiple clouds. Meanwhile, as the amount of data being collected worldwide is growing exponentially, the amount of data individual organizations collect is similarly expanding rapidly and that data is far more complex than it was in the past, coming in from more diverse sources and in disparate formats.

Monitoring all that data to ensure its quality, therefore, is more difficult now than even a handful of years ago.

As a result, data observability specialists such as Acceldata and Monte Carlo have emerged to address the need.

"We have seen a belated -- but much required -- resurgence in the interest in data quality and supporting capabilities," Ghai said. "Generative AI may be one of the factors. But also organizations are realizing that their data is the only sustainable moat, and without data quality, this moat is not that strong."

Tied in with data quality and observability, which fall under the purview of data governance, Ghai added that data privacy continues to gain importance.

"There's a growing focus on building analytics with strong privacy protection via techniques like confidential computing and differential privacy," he said. "Driven by regulations like GDPR and rising consumer expectations, data privacy is important for ethically managing sensitive data."

Essentially, [generative AI] is all about increasing the productivity of data teams, and especially data engineers who have been struggling for a long time to stay on top of their workloads. It can speed up everything they do.
Kevin PetrieAnalyst, Eckerson Group

Like Ghai, Petrie noted that many organizations are prioritizing data quality -- and overall data governance -- as their data volume grows and fuels new use cases, including generative AI.

"AI creates exciting possibilities," he said. "But without accurate and governed data, it will not generate the expected business results. I believe this realization will prompt companies to renew their investments in product categories such as data observability, master data management and data governance platforms."

Meanwhile, driven by the complexity of data management, developing connected ecosystems is a rising trend, according to Farmer.

Different systems need to work together for organizations to develop efficient data pipelines. While it's possible to deploy ingestion, integration, preparation, storage and analytics tools from a single vendor, most organizations want to avoid vendor lock-in and not be completely beholden to a single vendor.

Now, more vendors are attempting to enable that freedom by developing connectors and integrations that allow an integration platform from one vendor to work with an observability tool from another with both connecting easily to multiple cloud data storage platforms.

"There is a lot of good work happening on integration with data lakes and data lakehouses, driven by companies like Snowflake and Databricks," Farmer said. "They're making data available in simpler to consume ways. Analytics companies are building strong relationships with those platforms. We're seeing analytics at the back end becoming more and more integrated."

Beyond integrations, Farmer noted that vendors are developing data fabric and data mesh approaches to data management and analysis that simultaneously enable domains or departments within organizations to manage their own data while keeping the organization's data connected rather than isolated.

Microsoft and Salesforce are among the vendors enabling data fabrics, while Qlik and Starburst are among those enabling data mesh.

"That's a very significant move in the long term," Farmer said.

The months ahead

The data management trends that shape final months of 2023 will be the same ones that have shaped the year to date, according to Ghai.

Vendors will unveil more generative AI product development plans that demonstrate incremental improvement over the tools initially revealed following the launch of the new generation of LLMs. Few, however, will likely be released by the start of 2024.

"I believe we'll see announcements but not a huge flood of production-ready capabilities by end of 2023," Ghai said. "There will be many proofs of concepts and announcements as companies showcase potential but few full-fledged products ready for customers."

Beyond generative AI, data governance will continue to be an emphasis, especially as it relates to LLMs, he continued.

"There will be lots of discussion and debate around the data governance implications of generative AI but no definitive solutions," Ghai said. "By end of 2023, we'll have a clearer sense of what works well versus what remains aspirational when blending data management and generative AI."

Farmer likewise said the data management trends that mark final months of 2023 will largely resemble those that dominated the first months of the year.

However, looming beyond the end of 2023, a wave of innovation is on the horizon, he said.

Venture capital investments in technology companies, including startup data management vendors, have been difficult to come by since the end of 2021 when tech stock prices plummeted.

Eventually, that investment freeze will thaw, according to Farmer. When that time comes, there will be a new wave of data management innovation as startups develop new capabilities that advance the industry as a whole.

"This is a really difficult investment environment for startups. It has rarely been more difficult to raise money," Farmer said. "It means that there are a lot of ideas out there that are not getting funded. These cycles go around. And when the investing phase comes back, there's a pent-up catalog of innovation waiting to be released on the world."

He noted that venture capitalists are investing in AI startups, but that will slow as the reality of generative AI outstrips the hype.

"I think next year is going to be a much better year for startups in the data management space than this year," Farmer said.

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Next Steps

 Informatica adds AI assistant, GenAI development interface

Dig Deeper on Data management strategies