Does your big data project need a data concierge?

Big data needs better questions; local data concierges might be the next information governance trend; and the downside of data: The Data Mill reports.

When CIOs decide to invest in big data, they tend to talk about new technologies or missing skills, but, according to Anthony J. Scriffignano, senior vice president of worldwide data and insight at Dun & Bradstreet, that talk will get businesses only so far. CIOs also need to learn how to ask better questions and invest in the "neural networks between their [employees'] ears," he said. (Aka: human biases.)

The typical ways of dealing with ballooning data problems -- to build bigger hard drives or more distributed storage environments -- provide only a partial solution for leveraging big data. But the data volume, velocity, variety and veracity are evolving at such an accelerated clip, it's nearly impossible to keep up. "The elephant in the room is that the problem is changing faster than we are as a human race," Scriffignano said during his presentation at the Big Data Innovation conference in Boston.

Given our insatiable appetite for newer, better, faster technology, which inevitably generates more and more big data, that's a problem that doesn't seem to be going away anytime soon. Consider the iconic photograph comparing Pope Benedict XVI's installation in 2005 to Pope Francis' installation in 2013. In 2005, it was hard to spot more than a couple of flip phones; eight years later, the crowd is aglow from smart device screens. But rapid adoption of technology produces an unbalanced equation for enterprises: More devices means more data; but more data doesn't necessarily mean better answers, Scriffignano said.

Better questions of big data should include factors that influence the answer but tend to go overlooked. Scriffignano sums this up with his four E's: exogeny, expression, enterprise and environment. Don't just ask the question, but understand the assumptions, beliefs and biases that influence the question you're asking, he said. Before sending in the brilliant data scientists, CIOs should evaluate their data curation practices: Do those analyzing the data know how it was cleaned, collected and can be used? "Unless you understand the context of data, you often don't know how to use it," he said.

Even worse than a lackadaisical metadata program? "You don't assess the impact of the data you don't have that could have informed decisions [italics, mine]," he said. One way Dun & Bradstreet unearths the known unknowns is with the use of proxy data, or data that can act as a stand-in for the stuff that doesn't exist. Think of these questions in terms of the moon landing. "When [NASA] was creating the models for how to land on the moon, no one knew what the surface of the moon was like," he said. "We do this with our data."

As the company gets smarter about data collection and query, Dun & Bradstreet is also aware that the environment is still morphing. Scriffignano pointed to two huge challenges: unstructured data, which accounts for nearly 80% of all data produced today, and foreign language and writing systems, where meaning can get lost in translation.

An example: Boxpark, founded by Roger Wade, is a pop-up mall in Shoreditch, England constructed entirely from shipping containers. Businesses sign short-term leases that are three to 12 months long, making it possible for one shipping container to host four different businesses in a single year. "That's a big, big problem," Scriffignano said, especially for Dun & Bradstreet. The typical third-party data provider collects and sells data about businesses, which are instantiated based on name and address.

Local data concierges curtail false positives

One idea Scriffignano and his team may want to adopt is what John Mattison, chief medical information officer at Kaiser Permanente, refers to as "local data concierges." "What I mean by data concierge is someone who is intimately familiar with how data was collected and the context, which might not be fully documented with the atomic level of the data," Mattison said during his presentation at the Big Data Innovation conference.

This is especially true as businesses focus on breaking down silos to combine data from across the enterprise, a common goal with big data analytics. Mattison pointed to Homeland Security as an example, which struggled to combine CIA and FBI data together. One problem? The CIA knew the FBI would misinterpret its data and vice versa.

"It's not just turf protection," Mattison said. The promise of combining data from different sources to discover new patterns is exciting, but Mattison also noted that the chances of "creating false positives and incidental moments" is higher today than ever. "And that's across all verticals," he said.

Big data is a big e-discovery problem

Data scientists will tell you all data is good and more data is better, but data does have its downsides, said Barclay Blair, founder and executive director of the Information Governance Initiative. "In between those two viewpoints, we have a fertile debate, one that will shape our world in profound ways as we move forward," Blair said during a panel discussion at the Big Data Innovation conference.

One aspect of the debate is cost. Storage is cheap, so businesses are inclined to keep everything. But they don't consider the potential discovery costs if a legal battle ensues. "When a lawsuit happens, both parties have an obligation to find, to preserve and to produce anything potentially relevant to the lawsuit," Blair said.

Today, businesses can purchase sophisticated technology to aid in e-discovery, but before anything is handed over, "a lawyer has to look at it," Blair said. "That process is expensive and data, in that context, is not free."

Case in point: In 2012, a study by the RAND Corporation included information about a company that spent $900,000 "to produce an amount of data that would consume less than one-quarter of the available capacity of an ordinary DVD."

Welcome to The Data Mill, a weekly column devoted to all things data. Heard something newsy (or gossipy)? Email me or find me on Twitter at @TT_Nicole.

Next Steps

Malcolm Gladwell explains why you need attitude

Stress test calls for IT-finance alliance

Struggling to open up data? Look to the public sector

Dig Deeper on IT applications, infrastructure and operations