3 growing applications of AI in data management

There are plenty of ways AI can augment data professionals throughout the data pipeline, from sifting through large data sets for duplicates to easing the preparation process.

Science fiction has long anticipated the day artificial intelligence would be created, usually with dystopian results. The reality has proven more prosaic so far, with initial promises of expert systems replacing all manner of human experts turning out to be elusive.

There have been many advances in AI in recent years, leading to a level of excitement about its potential use in areas such as medicine, fraud detection and even generating email marketing subject lines -- an application eBay has already employed. How does this brave new world of AI apply to data management?

There are many applications for AI in data management that make sense to streamline the process. Here are three main applications for the growing technologies.

AI in master data management

An obvious example of AI in data management is in data matching, which is a core element of data quality and master data management tools. 

It is quite common to find 20%-30% duplication in materials master files and other supposed master data sources. In large companies, data related to key subjects such as customers or products is often duplicated across multiple systems. The various versions of a customer name and address record may be incomplete, out of date or just plain wrong. And employees may enter data into assorted sales and marketing systems without realizing that a customer record already exists.

Rooting out duplicates has led to various tools that apply algorithms to detect common misspellings, verify postal codes and recognize that Robert and Bob may be the same person. However, only a certain proportion of records are obvious duplicates, and a proportion of potential duplicate records needs to be reviewed by a human expert.

An expert system can be trained by watching a human expert review many hundreds of such records and devise rules that allow the software to get gradually better and better at mimicking the human expert's behavior. In this way, the software can credibly match records automatically in a much higher percentage of cases.

AI in data catalogs

Another area where AI has promise is in data catalogs or metadata repositories, which have long suffered from getting out of date as the landscape of applications in an enterprise change.

Business term tagging via machine learning can actively learn from expert user input and suggest terms based on previous human actions. The system can recognize the similarity between items in the data catalog and make suggestions on business terms to be assigned.

AI in data preparation

A further area where AI in data management is beneficial is in data preparation, the process of taking raw data and preparing it for further processing and analysis.

Data preparation is an essential exercise as you identify your sources of data, which may overlap; figure out where the data is being used and whether it is trustworthy; decide whether it needs to be linked to other data sources; and possibly enrich it with additional attributes.

AI tools are well suited to analyze of relationships between data sources and apply survivorship rules to decide which sources are most trustworthy. For example, AI programs can determine that an address updated last month may be more reliable than one updated 10 years ago.

Just as with data matching, in many cases things are not clear-cut and require human judgement. By watching the actions of domain experts, an AI program can steadily learn how to mimic the judgement of an expert human.

Challenges with AI in data management

While there are many benefits to AI-driven data management, the technology is still growing and has proven challenging in some environments. Many AI models are black boxes, meaning that they struggle to explain their reasoning in a way that is accessible to humans. This makes trust an issue, especially when there are well-publicized examples where the AI did not deliver as expected.

In 2013, IBM partnered with the University of Texas MD Anderson Cancer Center to use IBM Watson to scour research and patient data to spot patterns that would help doctors fight cancer. An admirable goal, but after five years a review of the system found "multiple examples of unsafe and incorrect treatment recommendations," according to medical specialists on the project.

A 2018 survey of 200 CIOs by Databricks found some major challenges in deploying AI programs. Ninety-eight percent of survey respondents described the preparation of large data sets as challenging, 96% said the same for data exploration and iterative model training, and 90% found deploying AI models into production to be challenging.

Nonetheless, in well-defined areas such as data matching and data catalogs, there is clear potential to automate tasks humans have traditionally found tedious. In many cases, sensible application of AI in data management -- without overselling their capabilities -- may bring real benefits to enterprises.

Dig Deeper on Data management strategies