The idea of creating a data lake is pretty straightforward: You pull various sets of data, structured or not, into a single architecture, so they can be readily accessed and analyzed by data scientists and other end users. Even the term itself sounds pleasant and inviting -- much more so than, say, a data warehouse.
But taking the data lake concept from vision to big data reality isn't so simple. In a May 2017 blog post, TDWI analyst David Stodder somewhat incongruously compared data lakes to the title character of the 1958 science fiction movie The Blob, warning that they can become "an uncontrollable entity that only gets bigger, swallowing precious money and resources" without yielding the expected analytics returns.
Similarly, in a September 2017 report, three Gartner analysts said that data and analytics managers among the consulting company's client base "are feeling the pressure of [dealing with] rapidly increasing amounts of unprocessed data in data lakes." Getting analytical value out of that data remains a challenge in many organizations, they added.
There are technology issues to contend with, particularly if companies are new to Hadoop -- the big data platform most commonly associated with data lakes -- and to Spark, Kafka, Amazon Simple Storage Service and other technologies that are often combined with Hadoop in deployments.
But the biggest challenges involve managing all of the data flowing into a data lake so it can be used effectively. This handbook offers advice on how to handle those challenges and make the data lake concept work in the real world. Doing so will let your users focus on what Ventana Research analyst David Menninger calls the four A's of big data: analytics, awareness, anticipation and action. And that could make your data lake a pleasant place, indeed.