alex_aldo - Fotolia
Dark data raises challenges, opportunities for cybersecurity
Dark data is the data enterprises didn't know they had. Splunk CTO Tim Tully explains where this data is hiding, why it's important and how to use and secure it.
It's hard enough for enterprises to track and secure the data they know they have, but dark data -- the data an organization creates unwittingly -- poses a different set of challenges. Key among those challenges are figuring out how to gain access to dark data, use it, keep it secure and prevent attackers from using it against the organization.
Determining how much of an enterprise's data is dark poses its own set of challenges, which Splunk, the big data software vendor based in San Francisco, recently attacked by sponsoring research by True Global Intelligence into the prevalence of dark data.
In this Q&A, Tim Tully, senior vice president and CTO at Splunk, explains what dark data is, why there's so much of it and how companies can use data governance and training to do a better job of finding, using and managing that data.
Editor's note: This interview has been edited for length and clarity.
How would you define dark data?
Tim Tully: We define dark data as being unknown, unidentified or unused [data], and the key stat from the report that I found most interesting is that the companies we surveyed felt that 55% of their data globally is dark. That number was higher than I imagined.
The reason I thought it would be much lower is that I was in the data business at Yahoo for about 14 years before I came to Splunk, and all I did was big data. I followed that log collection or log ETL [extract, transform, load] and the consumption of data happen, and in my experience, that number would have been much lower, given that I saw us collect data from hundreds of thousands of servers around the world.
Where does all this dark data come from?
Tully: The way dark data is created falls into one of two categories. One is the data is not collected at all -- it's sort of zombie data. Oftentimes, that will happen when companies bring new servers online, especially in these days of ephemeral servers and serverless. It's so easy to bring these servers online and take them down again very quickly without ever having collected those logs.
The second half of the dark data has to do with people just collecting the data for various reasons -- such as compliance or just helping them sleep better at night -- and then just not consuming it. That would fall into that 'unused' category.
The other thing is everyone, despite the fact that they had a high percentage of dark data, still felt that data skills would be essential. The last part was that there's a global agreement that [using] AI is possibly the way to get a hold of this dark data moving forward.
In light of the surge in data privacy legislation, what should companies be doing once they discover dark data? Is the goal to use it or to destroy it?
Tim TullySenior vice president and CTO, Splunk
Tully: I think it's a combination of both. If you have the data sitting there and it's not being looked at, that's a lost opportunity for companies to do better with security. You want to look at your firewall logs, for example, and know what inbound TCP connections are coming in and know who you are being attacked by. So, it's a lost opportunity to do a better job from a security standpoint.
On the flip side, companies that consume that data can do a better job of building AI-driven models and do a better job of figuring out how to model threat and anomaly detection moving forward. That's certainly something I saw happen at my last company, and there's massive impact from a cybersecurity standpoint.
What kind of impact does this dark data have on cybersecurity?
Tully: The most obvious impact is just not consuming the data. If you've collected the data and you're not doing anything with it, not looking at those logs is probably a horrible mistake. You want to know that attacks are happening. If you don't actually look at the dark data, how do you know that people are trying to attack you? It's sort of a chicken-and-egg problem.
Then, there's a ton of uncollected data in the first place. It's not that you're not looking at it. You're bringing ephemeral servers online. God knows what's happening in those logs. Not collecting them is the first step to not consuming the data. If you're not consuming the data, not even looking at it or collecting it, you're not building a strong cybersecurity posture.
Other than unexamined log files, are there other places people should be looking for dark data?
Tully: For sure. The other one that comes to mind are the litany of devices that people bring online in the enterprise in terms of bring your own device. I personally bring four or five devices to the office every day. They're all online, and given the ephemeral nature of those devices, I think they come online and offline very quickly, and I scratch my head and wonder whether corporations are consuming that data as well.
What dark data would be encompassed in those BYOD devices?
Tully: Well, certainly, your personal devices, your mobile phones, your tablets. I bring a personal laptop from time to time to do some nonwork stuff here and there. But people are traversing the internet; they're downloading things; they're possibly bringing malicious software to the office; and those devices have tons of logs associated with them. You want to be able to detect what those clients on the network are doing and what they're looking at and possibly what malicious viruses they're bringing in.
Is the data identified as dark data vulnerable to being picked up by attackers without being detected by the enterprises hosting the data?
Tully: I think any data an enterprise is logging or collecting, irrespective of whether it's dark or not, is vulnerable to attackers, so I would imagine the answer is yes.
There is a risk factor to this data which is lying dormant, and attackers want to consume it.
What is the first thing people should do about dark data? Is it to identify the data, collate and store it, or should they first decide whether or not they can or should use that data? And, if they don't need it, should they find a way to expunge it?
Tully: It's certainly all of the above. What it really comes down to is better data governance. I was on a couple of panels this week in Washington, D.C., and one of the questions I had was: 'What are the biggest challenges in the big data realm right now?' Outside of having to weld together a number of systems to get a reasonable solution from the open source realm, the people most successful with data overall are people who have strong data governance programs. That is, in terms of understanding which data is being collected, how it's being collected, the PII [personally identifiable information] involved in that data, and then in terms of who's consuming that data and what purposes it's being consumed for, how it's being cataloged and how it's being used.
Data governance can play a humongous and very strong role in helping customers get hold of their dark data.
What should companies be doing to deal with all this dark data?
Tully: The first thing is making sure they're collecting the data. The bulk is data being logged and not being collected, and it becomes zombie data and then gets rolled out over time because of log expiration.
What they should be doing is applying strong data governance to it. There's expiration on the data; there's making sure PII is being applied to that data; and then there's teaching new skills to people within the company to help them be able to cope with it.
One of the questions in the survey mentioned that business leaders said the top obstacles to recovering the dark data is the volume of the data and the lack of necessary skills. Training will be part of that solution. I oftentimes see this firsthand where, irrespective of whether the data is dark or not, just the sheer volume of data can be very overwhelming, and by the time most analysts consume that data, it shows up in dashboard form. Oftentimes, people that consume data in those types of dashboard environments feel a little bit reluctant to dig in and figure out how to go deeper.
A lot of it's about learning new skills and making sure you have strong data governance.
What are the top skills people should learn to deal with this type of data?
Tully: One is getting a better understanding of how the data even got there. Being willing to peel back the covers just a little bit more to understand how the data got there and who is behind the scenes preparing it. Being comfortable speaking to those people and understanding the process will better help them be less reluctant to take on the challenges of wanting to get different formats of the data or different reports.
Programming skills are very important. If you want to see dashboards in different forms, one thing you'll do is take the base data set offline and do some light coding against it. Some lightweight Python, some lightweight R -- even taking the data if it's small enough and putting it into Excel and being able to write macros against it is enough to have a basic approach to dealing with this type of data.