Sergey Nivens - Fotolia

How AI could save the enterprise data catalog

Enterprise data catalogs record all the databases and files in a corporation, but sometimes it can become out of date. Advances in AI are helping minimize the problem.

The idea of a data catalog isn't new. When I started working for Exxon in 1983 as a bright-eyed entry-level systems programmer, one of my first tasks was working on the IBM Data Dictionary. Even back in those days enterprises struggled to keep track of their data assets, which at that time were at least conveniently all in one place on a mainframe.

A data catalog stores data about data, or metadata. An enterprise data catalog records all the databases and files that exist across a corporation, adding a description and potentially noting relationships between files, for example. It allows a business user to quickly find the sources of information they are seeking -- whether that's asset data, information about company locations, products and suppliers. But a data catalog is only useful if it's kept up to date -- and that can be tough in a rapidly shifting business.

Metadata and catalogs explained

A simple way to understand metadata is to think about a movie. A film is stored in a broadcaster's library of films, but you need to store more than just its title. It's important to know how long the movie is, which actors are in it, who directed it, who wrote it and information about the screenplay. All of that is metadata about the movie.

The terms data catalog, data dictionary and business glossary have been loosely thrown around. While there are subtle differences, a business glossary is aimed at business users and a data dictionary targets a more technical audience. But all three are broadly concerned with metadata -- and they all have similar challenges.

Enterprise data catalog challenges

Early data catalogs focused on technical data, such as how many fields were in a database, whether a field was numeric or character-based, how long it was, and whether it had a range of valid values. Later, the definitions expanded to include information about types of business data and even definitions of that data, such as what is a "customer" or "product" or "asset."

The key problem is that an enterprise data catalog can become out of date. Often, well-meaning employees type in information about the contents of various systems and databases, but there's little incentive to keep this information properly up to date. When new systems are implemented, entire companies are acquired and their systems added to the corporate portfolio, or reorganizations occur, that painstakingly entered descriptive metadata becomes outdated to the point where it's no longer trusted. Then it falls into disuse.

While many companies put a lot of effort into implementing an effective catalog, few continue efforts to keep that enterprise data catalog completely in-sync as the business changes rapidly. Consequently, although several software vendors offered data catalog, data dictionary and business glossary products over the years, they never really took off in terms of broad implementation.

AI and the enterprise data catalog

In an era of mounting volumes and varieties of data, keeping an enterprise data catalog up to date has become increasingly difficult. But technologies like artificial intelligence, however, could transform the data catalog market.

Machine learning programs can sift through enterprise data catalogs and file systems to collect metadata tags automatically. This process is similar to the ways in which Google scours the internet for websites to catalog and index. Applying this technology to a data catalog allows enterprises to populate and update them automatically -- without need for human intervention. This could solve the critical and longstanding problem that's held enterprise data catalogs back: the need for humans to be diligent. And it could help the data catalog market flourish.

Dig Deeper on Data governance