Fotolia
Data catalog software takes on data lakes, privacy laws
Data catalogs form a hub for managing enterprise data. New products focus on machine learning and AI add-ons that help automate aspects of data governance.
GDPR and other data privacy measures did not slow big data application development, but they have reignited enterprises' interest in data governance.
In the case of the data lake, which began life as a dumping place for fast-arriving varieties of web and cloud data, governance has become more important. That, in turn, is driving interest in data catalog software to help bring order to big data environments.
Data catalogs are one of the hotbeds of what some call augmented data management -- a discipline that applies machine learning and AI to make the enterprise data management process more automated and repeatable. That can be especially valuable as data catalogs begin to span more and more departments within companies.
"The data catalog is a way to start curating your data, to find instances of customer information, and to point to where pertinent data is," said Wayne Eckerson, founder and principal consultant at Eckerson Group.
Is data catalog software just a new take on data repositories and data dictionaries -- systems that have formed the basis for many data governance efforts, and which go back to eras that preceded big data? Not really, Eckerson said.
"Data catalogs go out and continually crawl your company's data. They are more dynamic than data dictionaries or earlier products," he said. Moreover, rather than storing any data itself, today's data catalog points to various data sources, he continued.
In fact, data catalog software acts as a hub for metadata -- in effect providing "information about a company's information." That metadata can include data lineage, sourcing and measures of its usefulness.
Data lakes meet privacy concerns
As data lakes fill up with data, some of it personally identifiable, the data catalog provides a way to identify it. That is useful for meeting data privacy strictures like those imposed by the European Union's GDPR law and anticipated with next year's scheduled enactment of the California Consumer Privacy Act.
But, Eckerson said, the data catalog is also a path to making data available more widely across an organization, particularly for line-of-business workers ready to take on roles as citizen data scientists.
"Finding data is a fundamental building block for self-service and data analytics," Eckerson said. "It gives power users the ability to use data."
Data catalog software lineup
A growing assortment of vendors is bringing data catalog software to market. Included among these are Alation, Collibra, Informatica, Io-Tahoe, Tamr, Unifi Software, Waterline Data and others. The vendors are continually adding AI and machine learning enhancements to their products.
Recent AI-flavored improvements include Io-Tahoe's debut last month of its Smart Data Discovery platform, with enhanced PII and sensitive data discovery capabilities. Meanwhile, Waterline Data released a version of its AI-driven Data Catalog that allows users to pull data from different systems and publish them as reusable data objects that co-workers can access. Included as well is a data rationalization dashboard that identifies redundant data.
Wayne Eckersonfounder and principal consultant, Eckerson Group
Such advanced AI features -- which automate functions, test data quality, search data indexes, and make repeatable project templates for end users to follow -- are becoming common to data catalogs.
"The cool thing is that the machine learning is built into the data catalog," Eckerson said. "And, it not only provides a way to find data -- it provides a way to link it to associated data as well."
Eckerson said data catalogs will prove to be useful in instances in which data is scattered widely in an organization. As such, the data catalog can take on a role similar to an integration tool, even though it just points to information instead of restaging it.
Selecting a data catalog
Gaining an understanding of the role data catalog software can play in the organization begins with a look at data related assets within the company, according to Richard Thomas, principal solutions architect at Caserta, a New York-based consultancy.
A first step in data catalog selection requires the data manager to survey the data sources, data types and search criteria that are involved, Thomas said in a recent webinar, "Considerations for Data Catalogs," co-hosted by data catalog vendor Alation.
Questions teams should ask when setting search criteria, Thomas said, include "Will you search for technical metadata?" or "Will you search for business metadata?" It is also important to clarify whether product data will be accessible through the data catalog, or whether the company will fashion its catalog to work with business third parties.
Refining the data lake
Caserta's Thomas also discussed the role the data catalog can play in data lake refining.
"For the data lake, the catalog can be used to trace the data that is coming in, and whether it is going into a structured or modeled pipeline, Thomas said in the webinar." This increases the usefulness of the data lake, by helping team members to understand the process used to turn the raw data into useful information.
Developers and stewards
The data catalog as described by Thomas helps the organization discover whether the data is taking up space in the data lake, whether its format has been transformed, or whether the data is actually becoming part of a modeled analytics environment with a true schema.
As more organizations use data catalogs and vendors find new needs to address beyond those of the data lake developer or GDPR data steward, vendors will keep enhancing the catalogs.
While "data catalog" may conjure the image of wooden card index in a sleepy library, it seems to be evolving into one of the hotter areas of innovation in data.