Getty Images
Dark data discovery: How and where to find it
Before an organization can use dark data, it must find it. Here's how an enterprise can build a plan to locate, use and manage dark data and overcome its challenges.
Data is the lifeblood of modern business. Businesses routinely collect vast quantities of data used for an array of tasks that include business analytics and direct monetization. But not all the data collected and stored will be used to benefit the business.
Instead, businesses overlook, ignore, forget or simply underuse some data. This data is referred to as dark data. It's important that IT and business leaders understand the meaning of dark data, recognize the challenges and risks it poses for the business and formulate a meaningful strategy to deal with dark data across the enterprise.
What is dark data?
Simply stated, dark data is information a business collects and stores, but rarely -- if ever -- uses for any business purpose. Dark data is typically unstructured and accumulates for a variety of reasons, such as:
- A business project or initiative designed to use the data gets scrapped, never reaches fruition or loses financial or management support.
- Applications or devices, such as IoT, collect the data by default with little business insight or awareness that the data exists.
- Useful data is collected but becomes outdated because the business lacks the tools and processes to analyze or use all available data in a timely manner.
Dark data includes a wide range of data types and content, including former employee data and log files, such as system, server and customer transactions; customer profile information like geolocation data; financial statements; emails; surveillance and other video; and old document or presentation drafts.
In all cases, if data continues to be collected, stored and left unused, it remains dark. The presence and growth of dark data pose several important challenges for the business.
- Cost of storage. If the data goes unused, the storage resources and energy needed to support them are wasted.
- Security and compliance obligations. IT organizations must secure and retain all business data to meet regulatory guidance. Dark data that is secured improperly or omitted from a data retention and deletion policy can compromise the organization's data security or compliance posture.
- Lost opportunities. Dark data can pose lost revenue or other unrealized business opportunities in what is commonly called an "opportunity" cost.
These factors usually mean that dark data poses more risk and expense to the business than it does value. However, making the effort to find and use dark data can turn those risks back into business value.
As organizations grapple with ever-greater volumes of business data, the challenges of dark data will continue to escalate over time. Thus, addressing dark data is critical and involves three major efforts:
- Business leaders must discover and identify dark data.
- The business must find ways to use the data productively.
- The business must take ongoing steps to manage and mitigate dark data.
Where is dark data?
The first step is dark data discovery. This can be the most challenging step of the process because data can be dark in almost any storage repository. It can exist on hard drives within individual servers, on storage arrays or subsystems, on hosted or colocated remote systems -- even within storage instances across public clouds and SaaS providers. Consequently, dark data rarely reveals itself. It requires a deliberate -- and often manual -- effort to locate, identify and remediate.
Typically, such a deliberate effort involves formal assessments. Although external consultants can help spearhead the project, a business usually will approach the effort internally with IT staff to audit the organization's storage content. An audit could correlate data to applications and devices, which enables a business to form a picture of the data types and quantities available and the applications or devices that produce it.
There are no readily available tools to automate dark data discovery and identification. To accelerate an assessment, IT admins must do the following:
- understand the applications and other data sources present across the business;
- recognize the storage assets provisioned to those data sources; and
- focus on the evaluation of those assets to start.
To locate secondary data, conduct broad searches for common content types, such as log files, documents, images, video and PDFs.
IT teams must also consider how much of the total data content is used for analytics or other identifiable business purposes. Additionally, IT and business leaders should use the audit to evaluate the security, retention and compliance postures of those newly discovered data assets.
Ultimately, an assessment answers an array of business questions:
- How much total data does the business possess?
- Where is that data coming from -- applications, systems or users?
- Where is that data stored -- servers, storage arrays or cloud?
- How much of that data is in use for analytics or monetization?
- How much of that data is not in use?
- How is all discovered dark data secured -- or its access monitored?
- How much dark data do business purposes, such as compliance, require?
- Is dark data covered by data retention and deletion policies?
If the business repeats this assessment periodically, it can recognize data trends more easily, such as total data growth and costs, dark data growth, and gaps in security or compliance policies.
How to use dark data
Ideally, a data assessment will reveal the presence of dark, unused data located on storage assets throughout the enterprise and into the cloud. Once a business identifies dark data and recognizes its sources, business leaders can move on to the second major effort to address dark data -- making prudent decisions about how to use the data.
Often, data becomes dark because the business lacks the tools, internal staff or skill sets needed to process it. IT leaders save the data because it's there, and business leaders hope to use the data in the future, but those plans don't materialize because of technological, skill or budgetary limitations.
One initial step to use discovered dark data is the integration of that data into everyday business processes. Business and IT leaders can examine the dark data from the standpoint of business processes, security, privacy and compliance. If the dark data is not secured properly, IT leaders can take steps to implement appropriate security and access measures.
The organization's retention policies can also include the dark data. Consider that data from IoT devices might be useful for only a few days. If the data has been dark for six months, there's little -- if any -- meaningful reason to save or use that data, and it can be deleted securely in accordance with prevailing data destruction policies, allowing the organization to recover the storage resources involved.
Ultimately, the uses of dark data are vast, but focus on the application of big data analytics and AI to help the business develop and refine its strategies. Add the dark data to existing analytics to help refine insights, or even surface new insights or opportunities for the business.
For example, a business collects operational data from IoT devices located within machinery across a manufacturing floor, but that IoT data becomes dark because it's left unused. A business could use that IoT data to analyze the manufacturing process for predictive equipment maintenance and then use those results to implement preemptive maintenance downtime at more convenient and cost-effective time periods -- this saves the business money in lost time and product materials.
Similarly, IT infrastructure produces enormous volumes of data from servers, firewalls, network monitoring tools and other sources across the data center. Much of this data becomes dark because nobody reviews it unless there's a specific problem or alert. But by analyzing the dark data, organizations understand infrastructure utilization, workloads, common issues and trends to improve infrastructure resilience and performance.
How to manage dark data
The third challenge to addressing dark data is mitigation. Dark data is an ocean of unrelated and unstructured data that traces back to data produced by countless applications, device logs and other tributaries across the enterprise. The trick for modern businesses is to make prudent decisions about current and future dark data assets. There are four general steps to dark data mitigation:
- Understand sources. All dark data should be traceable to a source. The audit used to discover dark data should reveal its sources, which can be customer transaction records, system and network logs or IoT device streams.
- Determine importance. Frankly, not all data is useful to the business. Even data that is useful has a finite shelf life. Keeping all data forever is not a prudent policy for security, compliance or infrastructure, so business and IT leaders must determine which data sets to keep and for how long.
- Set retention and deletion procedures. Businesses use data retention tools to create and enforce storage and security policies for prescribed periods and delete data securely when its retention periods expire. Such tools should include dark data and its sources to save storage costs, to ensure that analytics tools process only timely and relevant data, and to keep the business secure and compliant by destroying expired data according to a consistent established policy.
- Turn off unwanted sources. Just because a device or application produces data, that doesn't mean the business needs or wants that data. Collecting and storing data because it might be needed someday isn't a prudent compliance, security or infrastructure strategy. If the business doesn't need a given data set, disable the corresponding data source. For example, applications and IoT devices have configuration options to disable certain actions, such as logging.
Dark data will remain a fact of life as business data sets grow, but businesses and IT leaders must take constructive action to locate, use and manage dark data now and well into the foreseeable future.