Fotolia
What are the main features of data catalog software?
Data catalogs are curated data portals for self-service analytics users. Learn about the capabilities of data catalog tools, vendors that sell them and how catalogs work.
Employees who rely on self-service BI and analytics tools to make data-driven business decisions need access to a lot of data, but they typically aren't allowed to just pull raw data out of a data lake or another repository; the data they use must be governed and curated to ensure that it's accurate and appropriate for the intended uses. That's where data catalog software comes in.
A data catalog is a type of metadata management system that is designed to be user-friendly enough for the average business user. Data catalogs enable organizations to build portals in which end users -- including data scientists and analysts -- can find data that has been curated for them by data stewards or other data professionals.
Catalogs may contain data from big data systems and traditional data warehouses, databases and BI systems. They classify it in terms that business users can understand and provide context around the data so it can be used correctly in analytics applications. They also include information on data governance policies and automated policy-enforcement mechanisms to help data stewards and governance managers make sure that the data in a catalog isn't accessed improperly or misused.
Cataloging tools are in high demand as businesses increasingly struggle to inventory all the data they create and collect, as well as to comply with new data privacy and protection rules that have made effective governance of data usage even more important. In particular, that includes the European Union's GDPR mandates and the California Consumer Privacy Act.
Analyst firm Gartner recommends the use of data catalog software to maintain and curate inventories of available data assets and to map information supply chains for both analytics users and data stewards themselves. These tools are now an essential component of corporate data management strategies, according to Gartner.
How data catalog software works
Sharon Graves, enterprise data and BI tools evangelist at web hosting giant GoDaddy, implemented data catalog software from Alation Inc. in 2015 to reduce the time that analytics users spend searching for the right data in the company's systems and ensure that the data they access has been vetted by data stewards.
"There is a problem where we have users who don't know anything about which data source to use or where to find the data. We needed to point users to a tool," she said. "We wanted our analysts to be spending their time doing analysis, and we wanted to support end users doing simple charting and crosstabs."
The data catalog pulls in metadata from various systems -- Hadoop, Amazon Redshift, Apache Hive, Tableau Server, Teradata and other sources -- and gathers it all in a portal where users can search for relevant data. The catalog sorts the data based on a number of factors, including whether a data steward has endorsed it for use in certain applications, and by its popularity with users -- a feature that can be finagled by data experts so the right data surfaces first, Graves said. Data teams can also build unified or packaged data sets that take care of data joins for users into the catalog, she added.
Data catalog features and vendors
Traditional metadata management capabilities are at the core of data catalog software. In addition to the indexed data inventory, such features include business glossaries, which contain definitions of business terms that can be mapped to specific data assets, and data lineage documentation that helps end users understand data and supports root cause analysis and impact analysis -- two key functions for data stewards as part of data governance and data quality initiatives.
Modern data catalog tools combine those core capabilities with advanced features, such as self-generating topic extraction, taxonomy generation, semantic discovery, knowledge graphs and automated cataloging and pattern mapping driven by machine learning, according to Gartner. In a September 2019 report, Gartner analysts Guido De Simoni and Ehtisham Zaidi said so-called augmented data catalogs have become "an enterprise must-have" for data management and analytics teams faced with increasingly distributed and diverse pools of data.
Other common data catalog features include built-in integration with widely used data platforms, search functions for querying a catalog's contents and collaboration tools that let users annotate catalog entries and chat with one another. All in all, data catalogs enable companies to get the most value out of the data that sits in data warehouses, data lakes and other repositories by making it easy to find and apply in business analytics and data science applications.
In addition to Alation, other vendors that offer data catalog software either as stand-alone products or as part of their metadata management and data governance platforms include Ataccama, Alteryx, AWS, Boomi, Cambridge Semantics, Collibra, Data.world, Erwin, Google, IBM, Infogix, Informatica, Microsoft, Oracle, Reltio, SAP, Talend and Waterline Data.