Essential Guide

Browse Sections

Data lake governance: A big data do or die

Data lake governance may not be a sexy undertaking, but it's become a critical component in modern data architecture, according to experts.

When Steve Cretney took a hard look at storage numbers, he noticed something that helped upend the IT strategy at Colony Brands Inc.

"We observed, almost naively, that we have a couple of hundred terabytes of storage in our [storage area network (SAN)]," said Cretney, CIO at the mail-order and electronic retailer.

The bulk of that was from operational systems, some cherry-picked for analysis, but the majority packed away in cold storage where it sat idle. By comparison, Colony Brand's data warehouse contained just 10 to 15 terabytes of data, which was used for specific business analytics and reporting. The discrepancy between the two got Cretney and his team thinking: What might the data science team uncover if it had access to the data stuck in the SAN?

To make cold storage data available and to push the company in a cloud-first direction, Cretney, a big believer in cloud computing before he came to Colony Brands three years ago, turned to Amazon S3, a data storage service, and Amazon Redshift, a cluster database that will replace the company's data warehouse. His plan, set in stages with the first to be completed in April, is to build a data lake, making more data more accessible for more analytics.

Steve CretneySteve Cretney

Data lakes or data hubs -- storage repositories and processing systems that can ingest data without compromising the data structure -- have become synonymous with modern data architecture and big data management. The upside to the data lake is that it doesn't require a rigid schema or manipulation of the data to ingest it, making it easy for businesses to collect data of all shapes and sizes. The harder part for CIOs and senior IT leaders is maintaining order once the data arrives. Without an upfront schema imposed on the data, data lake governance, including metadata management, play vital roles in keeping the data lake pristine, according to experts.

A central repository suited for big data

Historically, analytics and business intelligence workloads are done using a data warehouse, a technology that IT departments have tried -- and, in many cases, failed -- to make a central data repository. "Data warehouses and databases, by their very nature, are too expensive and too constrained by storage and performance to put all of your data in one place," said Phil Shelley, advisor and director at the Hadoop services company DataMetica Solutions in India.

Phil ShelleyPhil Shelley

IT departments began using extract, transform and load (ETL) tools to "break the data into manageable chunks and archive the data after a while," Shelley said. But doing so left analysts with the time-consuming task of having to piece together and track down data sets that might be tucked away in data marts, databases and the data archives. Even then, analysts could quickly get their hands only on data sets deemed worthy of available storage. "If they wanted a longer history or more detail, the data is usually not available in their data warehouses for performance and cost reasons," Shelley said.

Storage options built on cheap commodity hardware such as the Hadoop Distributed File System offered a different approach that was sorely needed as businesses sought to leverage more data -- and more complex data -- than ever before. "We can bring in all of the historical data we have and new data, in near real time, without the use of legacy ETL tools, into one single place," Shelley said.

The resulting data lake also offers another benefit: The lack of a data structure gives data scientists a chance to analyze the data without a predetermined schema. Twenty years ago, the data warehouse was seen as a viable central repository because companies had, as one consultant put it, "control" over the data they used for analysis. 

Joe CasertaJoe Caserta

"When I say that, I mean it's within the walls of your enterprise -- like data from an SAP ERP system," said Joe Caserta, founder and president at Caserta Concepts in New York City. "But now, we're getting data from third parties that we don't know -- and don't have control over." Structuring third-party data upon ingest is tricky because basics -- such as how the data was generated and even the content of the data -- may not be known right away. With a data lake, companies can move away from the rigid structure-ingest-analyze process to a more flexible ingest-analyze-understand process. "Once we understand [the data], then we can structure it," Caserta said.

Governance: Do or die

The relative flexibility provided by data lakes comes at a price: Without data lake governance, businesses could be left without meaningful business intelligence -- or even jeopardize the business.

At the recent Gartner Business Intelligence and Analytics Summit in Grapevine, Texas, analyst Nick Heudecker recounted the story of a customer in the consumer services industry that decided to implement a data lake after experiencing poor performance with its relational database. But the company's project scope was too limited, focused primarily on data ingestion.

"All of the context of the data -- where it came from, why it was created, who created it -- was lost," Heudecker said. "By the time the company fixed this issue and went back to their old platform, they had lost two-thirds of their customers and almost went out of business."

It's an extreme story, to be sure, but the importance of data lake governance, including data cataloging, indexing and metadata management, should not be downplayed, according to CIOs. "It's a huge challenge," Colony Brands' Cretney said. "You lose context unless you have metadata over the top of it." And that's just one piece of the data lake management puzzle. Cretney also advised CIOs think about overall data lake governance, including who pulled the data in, who is responsible for it, what are the definitions of that data to ensure the data is not only properly tagged but properly maintained and used.

David Saul, vice president and chief scientist at State Street Corp. in Boston, Mass., couldn't agree more. "If to start off you don't have a robust set of metadata to describe what the data is, what it represents, and now you've put it into this data lake, you're in a situation that's worse than if you built the data warehouse," he said. "It may be faster, but you have no idea what's in this lake."

Semantic database: 'Metadata on steroids'

Unlike a traditional data warehouse and its predefined schemas, the data lake requires CIOs to walk a fine line between enough management to provide necessary context but not too much management to stifle the flexibility the data lake offers, according to Heudecker.

"This is a lot of work, and it can go badly," he said. "So take your time, figure out what you're going to need to make this work and begin down that path."

At State Street, the data lake path included a semantic database, a conceptual model that leverages the same standards and technologies used to create Internet hyperlinks. The strength of the data lake -- that it doesn't recognize any data structure -- is also its weakness, at least in Saul's eyes. "It does not recognize anything about the semantics, the structure or the relationship of the data," he said. "Hadoop is a parallel file system, and it does that very well, it performs quickly. But you need to know more about what the data actually means than what a mere file system and the location of the data tells you."

David SaulDavid Saul

The semantic database, which Saul referred to as "metadata on steroids," adds a layer of context over the data that defines the meaning of the data and its interrelationships with other data. Using World Wide Web Consortium standards, such as the Resource Description Framework and the Web Ontology Language called OWL, State Street generates semantic information about the data that can be searched using a SQL-like query language known as SPARQL.

"[A] question like: What are my total exposures to a particular entity? Those are easy to do in SPARQL," Saul said. "They're hard to do -- you actually have to write a set of programs -- if the data is stored in Hadoop."

Saul said to think of the semantic database as a card catalog for a library of thousands of books. Without it, pinpointing a specific title becomes an extraordinary task. "That's what Hadoop is like," he said, referring to the popular file system technology that's practically become synonymous with the data lake. "You'd have to go book by book, page by page, word by word looking for what you wanted."

With a robust card catalog, that painstaking task disappears, thanks to the organization of the system and the metadata contained within. "Those are the kinds of things you can do with a semantic model that a file system is not capable of," he said.

For a financial institutional like State Street where regulators are now requiring businesses show data lineage -- where numbers came from and how they were derived -- strong data governance is a must. Yet, keeping data in rigid silos created by more traditional technologies can result in tunnel vision or, worse, in bad analytics. The data lake, a concept State Street is using, provides flexibility to eradicate silos. The semantic database adds a layer of governance and metadata over the data to keep the lake in good working order.

"I believe the data lake has been oversold to CIOs and [chief data officers] as some kind of silver bullet," Saul said. "Like everything in data management, if you don't do it in a thoughtful way, you're not going to get the results that you would expect."

Next Steps

David Saul turns big data into smart data at State Street

A medical doctor drives a semantic data lake endeavor

For the data lake to take hold, more successful use cases are needed

Dig Deeper on IT applications, infrastructure and operations