Getty Images/iStockphoto

GenAI demands greater emphasis on data quality

As GenAI explodes and enables enterprise decision-making at previously unseen speed and scale, accurate information to train models and applications becomes increasingly important.

Data quality has perhaps never been more important. And a year from now, then a year beyond that, it will likely be even more important than it is now.

The reason: AI, and in particular, generative AI.

Given its potential benefits, including exponentially increased efficiency and more widespread use of data to inform decisions, enterprise interest in generative AI is exploding. But for enterprises to benefit from generative AI, the data used to inform models and applications needs to be high-quality. The data must be accurate for the generative AI outputs to be accurate.

Meanwhile, generative AI models and applications require massive amounts of data to understand how to respond to a user's query. Their outputs aren't based on individual data points, but instead on aggregations of data. So, even if the data used to train a model or application is high-quality, if there's not enough of it, the model or application will be prone to deliver an incorrect output called an AI hallucination.

With so much data needed to reduce the likelihood of hallucinations, data pipelines need to be automated. Therefore, with data pipelines automated and humans unable to monitor every data point or data set at every step of the pipeline, it's imperative that the data be high-quality from the start and there be checks on outputs at the end, according to David Menninger, an analyst at ISG's Ventana Research.

Otherwise, not only inaccuracies, but also biased and potentially offensive outputs could result.

As we're deploying more and more generative AI, if you're not paying attention to data quality, you run the risks of toxicity, of bias. You've got to curate your data before training the models, and you have to do some postprocessing to ensure the quality of the results.
David MenningerAnalyst, ISG's Ventana Research

"Data quality affects all types of analytics, but now, as we're deploying more and more generative AI, if you're not paying attention to data quality, you run the risks of toxicity, of bias," Menninger said. "You've got to curate your data before training the models, and you have to do some postprocessing to ensure the quality of the results."

In response, enterprises are placing greater emphasis on data quality than in the past, according to Saurabh Abhyankar, chief product officer at longtime independent analytics vendor MicroStrategy.

"We're actually seeing it more than expected," he said.

Likewise, Madhukar Kumar, chief marketing officer at data platform provider SingleStore, said he is seeing increased emphasis on data quality. And it goes beyond just accuracy, he noted. Security is an important aspect of data quality. So is the ability to explain decisions and outcomes.

"The reason you need clean data is because GenAI has become so common that it's everywhere," Kumar said. "That is why it has become supremely important."

However, ensuring data quality to get the benefits of AI isn't simple. Nor are the consequences of bad data quality.

The rise of GenAI

The reason interest in generative AI is exploding -- the "why" behind generative AI being everywhere and requiring that data quality become a priority -- is that it has transformative potential in the enterprise.

Data-driven decisions have proven to be more effective than those not informed by data. As a result, organizations have long wanted to get data in the hands of more employees to enable them to get in on the decision-making process.

But despite the desire to broaden analytics use, only about a quarter of employees within most organizations use data and analytics as part of their workflow. And that has been the case for years, perhaps dating back to the start of the 21st century.

The culprit is complexity. Analytics and data management platforms are intricate. They largely require coding to prepare and query data, and data literacy training to analyze and interpret it.

Vendors have attempted to simplify the use of their tools with low-code/no-code capabilities and natural language processing features, but to little avail. Low-code/no-code capabilities don't enable deep exploration, and the NLP capabilities developed by data management and analytics vendors have limited vocabularies and still require data literacy training to use.

Generative AI lowers the barriers that have held back wider analytics use. Large language models have vocabularies as large as any dictionary and therefore enable true natural language interactions that reduce the need for coding skills. In addition, LLMs can infer intent, further enabling NLP.

When generative AI is combined with an enterprise's proprietary data, suddenly any employee with a smartphone and proper clearance can work with data and use analytics to inform decisions.

"With generative AI, for the first time, we have the opportunity to use natural language processing broadly in various software applications," Menninger said. "That ... makes technology available to a larger portion of the enterprise. Not everybody knows how to use a piece of software. You don't have to know how to use the software; you just have to know how to ask a question."

Generative AI chatbots -- tools that enable users to ask questions using natural language and get responses in natural language -- are not foolproof, Menninger added.

"But they're a huge improvement," he said. "Software becomes easier to use. More people use it. You get more value from it."

Meanwhile, data management and analytics processes -- integrating and preparing data to make it consumable; developing data pipelines; building reports, dashboards and models -- require tedious, time-consuming work by data experts. Even more tedious is documenting all that work.

Generative AI changes that as well. NLP reduces coding requirements by enabling developers to write commands in natural language that generative AI can translate to code. In addition, generative AI can be trained to carry out certain repetitive tasks on its own, such as writing code, creating data pipelines and documenting work.

"There are a lot of tasks humans do," Abhyankar said. "People are overworked, and if you ask them what they are able to do versus what they'd like to be able to do, most will say they want to do five or 10 times more. One benefit of good data with AI on top of it is that it becomes a lever and a tool to help the human being be potentially multiple times more efficient than they are."

Eventually, generative AI could wind up being as transformational for knowledge workers as the industrial revolution was for manual laborers, he said. Just as an excavator is multiple times more efficient at digging a hole than a construction worker with a shovel, AI-powered tools have the potential to make knowledge workers multiple times more efficient.

Donald Farmer, founder and principal of TreeHive Strategy, likewise noted that one of the main potential benefits of effective AI is efficiency.

"It enables enterprises to scale their processes with greater confidence," he said.

However, the data used to train the AI applications that enable almost anyone within an organization to ask questions of their data and use the responses to inform decisions had better be right. Similarly, the data used to train the applications that take on time-consuming, repetitive tasks that dominate data experts' time had better be right.

The need for data quality

Data quality has always been important. It didn't just become important in November 2022 when OpenAI's launch of ChatGPT -- which represented a significant improvement in LLM capabilities -- initiated an explosion of interest in developing AI models and applications.

Bad data has long led to misinformed decisions, while good data has always led to informed decisions.

A graphic lists six elements of data quality: accuracy, completeness, consistency, timeliness, uniqueness and validity.

But the scale and speed of decision-making were different before generative AI. So were the checks and balances. As a result, both the benefits of good data quality and consequences of bad data quality were different.

Until the onset of self-service analytics spurred by vendors such as Tableau and Qlik some 15 years ago, data management and analytics were isolated to teams of IT professionals working in concert with data analysts. Consumers -- the analysts -- usually had to submit a request to data stewards, who would then take the request and develop a report or dashboard that could be analyzed to inform a decision.

The process could often take months and at least took days. And even when the report or dashboard was developed, it often had to be redone multiple times as the end user realized the question they asked wasn't quite right or the resulting data product led to follow-up questions.

During the development process, IT teams worked closely with the data used to inform the reports and dashboards they built. They were hands-on, and they had time to make sure the data was accurate.

Self-service analytics altered the paradigm, removing some of the control from centralized IT departments and enabling end users with the proper skills and training to work with data on their own. In response, enterprises developed data governance frameworks to both set limits on what self-service users could do with data -- to protect against self-service users going too far -- and also give the business users freedom to explore within certain parameters.

The speed and scale of data management and analytics-based decision-making increased, but it was still limited to a group of trained users who, with their expertise, were usually able to recognize when something seemed off in the data and not hastily take actions.

Now, just as generative AI changes who within an organization can work with data and what experts can do with it, it changes the speed and scale of data-informed decisions and actions. To feed that speed and scale with good data, automated processes -- overseen by humans who can intervene when necessary -- are required, according to Farmer.

"It puts an emphasis on processes that can be automated, identifying data-cleaning processes that require less expertise than before," Farmer said. "That's where it's changing. We're trying to do things at much greater scale, and you just can't have a human in the loop at that scale. Whether the process can be audited is very important."

Abhyankar compared the past and present to the difference between a small, Michelin-starred gourmet restaurant and a fast-food chain.

The chef at the small restaurant, each day, can shop for the ingredients of every dish and then oversee the kitchen as each dish gets made. At a chain, the scale of what needs to be bought and the speed with which the food needs to be made make it impossible for a chef to oversee every detail. Instead, a process ensures no bad meat or produce makes it into meals served to consumers.

"[Data quality] is really important in a world where you're going from hand-created dashboards and reports to a world where you want AI to do [analysis] at scale," Abhyankar said. "But you can't scale unless you have a system in place so [the AI application] can be precise and personalized to serve many more people with many more insights on the fly. To do that, the data quality simply has to be there."

Benefits and consequences

The whole reason enterprise interest is rising in developing AI models and applications and using AI to inform decisions and automate processes -- all of which need high-quality data as a foundation -- is the potential benefits.

The construction worker who now has an excavator to dig a hole rather than a shovel can be multiple times more efficient. And in concert with a few others at the controls of excavators, they can dig the foundation for a new building perhaps a hundred times faster than they could by hand.

A construction worker with a cement mixer can follow up and pour the foundation multiple times faster than if they had to mix the cement and pour it by hand. Next, the girders can be moved into place by cranes rather than carried by humans, and so on.

It adds up to an exponentially more efficient construction process.

The same is true of AI in the enterprise. Just as construction teams can rely on the engines and controls in excavators, cement mixers, cranes and other vehicles that scale the construction process, if the data fueling AI models and applications is trustworthy, organizations can confidently scale business processes with AI, according to Farmer.

And scale in the business world -- being able to do exponentially more without having to expand staff -- means growth.

"Data quality enables enterprises to scale their processes with greater confidence," he said. "It enables them to build fine-grained processes like hyperpersonalization with greater confidence. Next-best offers, recommendation engines, things that can be highly optimized for an individual -- that sort of thing becomes very possible."

Beyond retail, another common example is fraud detection, according to Menninger. Detecting fraud amid millions of transactions can be nearly impossible. AI models can check all those transactions, while not even teams of humans have the capacity to look at them all, much less find patterns and relationships between them.

"If accurate data is being fed into the models to detect fraud, and you can improve the detection even just slightly, that ends up having a large impact," Menninger said.

But just as the potential benefits of good-quality data at the core of AI are greater than good data without AI, the consequences of bad data at the core of AI are greater than the consequences of bad data without AI. The speed and scale that AI models and applications enable result in the broader and faster spread of fallout from poor decisions and actions.

Back when IT teams controlled their organizations' data and when a limited number of self-service users contributed to decisions, the main risk of bad data was lack of trust in data-informed decisions and the resulting loss of efficiencies, according to MicroStrategy's Abhyankar. In rare cases, it could lead to something more severe, but there was usually time for someone to step in and stop something from happening before it spread.

Now, the potential exists to not only scale previous problems, but also create new ones.

If AI models and applications are running processes and making decisions without someone checking them before actions are taken, it could lead to significant ethical problems such as baselessly denying an applicant a credit card or mortgage. Similarly, if a human uses AI outputs to make decisions, but the output is misinformed, it could result in serious ethical issues.

"You scale the previous problems," Abhyankar said. "But it's actually worse than that. In scenarios where the AI is making decisions, you're making bad decisions at scale. If you run into ethical problems, it's catastrophically bad for an organization. But even when AI is just delivering information to a human being, you're scaling the problems."

Farmer noted that AI doesn't deliver outputs based on single data points. AI models and applications are statistical, looking at broad swaths of data to inform their actions. As long as most of the data used to train a model or application is correct, the model or application will be useful.

"If a data set is poor quality, you'll get poor results," Farmer said. "But if one piece of data is wrong, it's not going to make much difference to the AI because it's looking at statistics as a whole."

That is, unless it's that fine-grained decision about an individual such as whether to approve a mortgage application. In that case, if the data is wrong, it can lead to serious ethical consequences. Even more catastrophically, in a healthcare scenario, bad data could lead to the difference between life and death.

"If we're using AI to make decisions about individuals -- are we going to give someone a mortgage -- then having high-quality individual data becomes extremely important, because then we have given this system over," Farmer said. "If we're talking about AI making fine-grained decisions, then the data has to be very high-quality."

Ensuring data quality

With data quality so critical to the success of AI, as well as reaping the benefits of broader use of technologies and exponentially increased efficiency, the obvious question is how enterprises can ensure good data goes into models and applications so that good outputs result.

There is, unfortunately, no simple solution -- no fail-safe.

Data quality is difficult. Enterprises have always struggled to ensure only good-quality data is used to inform decisions. In the era of AI, including generative AI, that's no different.

"The problem is still hard," Abhyankar said.

But there are steps that organizations can take to lessen the likelihood of bad data slipping through the cracks and affecting the accuracy of models and applications. There are technologies they can use and processes they can implement.

Ironically, many of the technologies that can detect bad data use AI to do so.

Vendors such as Informatica and Oracle offer tools designed specifically to monitor data quality. These tools can look at data characteristics such as metadata and data lineage, sometimes have master data management capabilities, and in general are built to detect problematic data. Other vendors such as Alation and Collibra provide data catalogs that help enterprises organize and govern data, including descriptions of data, to provide users with information before they operationalize any data.

Still other vendors including Acceldata and Monte Carlo offer data observability platforms that use AI to monitor data as it moves through data pipelines, detecting irregularities as they occur and automatically alerting customers to potential problems. But unlike data quality tools and data catalogs that address data quality while data is at rest before being used to train AI models and applications, observability tools monitor data while it is in motion on its way to a model or application.

"Increasingly, AI is actually in a sense running its own data quality," Farmer said. "Many of those tools work on inferences, work on discovering patterns of the data. It turns out that AI is very good at that and doing it at scale."

More important than any tooling, however, is that humans always remain involved and check any output before it is used to take action.

Just as a hybrid approach emerged as ideal for cloud computing -- including on-premises, private cloud and public cloud -- a hybrid approach that uses technology to augment humans is emerging as the ideal approach to working with the data used to train AI, according to SingleStore's Kumar.

"First and foremost is to allow humans to have control," he said.

Humans simply know more about their organization's data than machines and can better spot when something seems off. Humans have been working with their organization's data from their organization's founding, which in some cases means there are decades' worth of code used to develop and inform dashboards and reports that humans can perfectly replicate, but a machine might not know.

Humans, in a simple example, know whether their company's fiscal year starts on Jan. 1 or some other date, while a model might assume it starts on Jan. 1.

"Hybrid means human plus AI," Kumar said. "There are things AI is really good at, like repetition and automation, but when it comes to quality, there's still the fact that humans are a lot better because they have a lot more context about their data."

If there's a human at the end of the process to check outputs, organizations can better ensure actions taken will have their intended results, and some potentially damaging actions can be avoided.

If there's a person to make sure a mortgage application should be rejected or approved, it will benefit their organization's bottom line. The approved mortgage will result in profits, as well as avoid the serious consequences of mistakenly declining someone's application based on biased data, while the declined mortgage will avoid potential losses related to a default.

If there's a healthcare worker to check whether a patient is allergic to a recommended medication or that medication might interact badly with another medication the patient is taking, it could save a life.

The AI models and applications, fueled by data, can be left to do their work. They can automate repetitive processes, generate code to develop applications, write summaries and documentation, respond to user questions in natural language and so on. They're good at those tasks, when informed by good-quality data.

But they're not perfect, even when the data used to train them is as good as possible.

"There always has to be human intervention," Menninger said.

Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Business intelligence technology