Sergey Nivens - Fotolia

8 tips to improve the data curation process

A data curation and modeling strategy can ensure accuracy and enhance governance. Experts offer eight best practices for curating data. First, start at the source.

The immense benefits of big data are attainable only when organizations can find ways to manage a massive volume of varied data.

"Most Fortune 500 companies are still struggling to manage their data, or what is called data curation," said Kuldip Pabla, senior vice president of engineering at K4Connect, a technology platform for seniors and individuals living with disabilities.

Data modeling complements the data curation process by creating a framework to guide how data sets can efficiently and accurately be integrated into new analytics applications.

Pabla said he sees data curation as the management of data throughout its lifecycle, from creation or ingestion until it is archived or becomes obsolete and is deleted. During this journey, data passes through various phases of transformation; data curation ensures that the data is securely stored and that it can be reliably and efficiently retrieved.

It's important to establish a data curation process that ensures accuracy and data governance, provides security, and makes it easier to find and use data sets. Although technology can help, it's better to start with a solid understanding of your goals rather than focusing on a particular tool.

1. Plan for accuracy at the source

To ensure accuracy, it's much easier to validate data at the source rather than to assess its accuracy later. You may need to use different practices for data gathered in-house and data from other sources.

One approach to enduring data accuracy is to ask users to validate their own data; another is to use sampling and auditing to estimate accuracy levels.

2. Annotate and label

It's easier to manage data sets and troubleshoot problems if the data sets are annotated and labeled as part of the data curation process. This can include simple enrichments, like adding the time and location of an event.

However, "while tagging enriches the data, inaccurate metadata will lead to inaccuracies during transformation or processing of data," Pabla said.

How to improve the data curation process

3. Maintain strong security and privacy practices

Large curated data sets can also pose a risk if they are compromised by hackers or insiders. Good security practices include encryption, de-identification and a strong data governance model.

"At the minimum, CIOs and CTOs can use strong encryptions to encrypt a piece of data in flight and at rest, along with [using] a stronger firewall to guard their cloud infrastructure or data centers," Pabla said.

Enterprises should also consider separating personally identifiable information from the rest of the data. This makes it easier to safely distribute curated data sets to various analytics teams. Hybrid analytics and machine learning models could even be run between a user's smartphone or set-top box in a way that provides insight while keeping users in control of their data, Pabla said.

Another way to provide stronger security is to create a strong and effective governance model that outlines who has access to what data -- especially raw personal data. The fewer human eyes that have access to data, the more secure it is, Pabla said.

4. Look ahead

It's important to start the data curation process with the end in mind. Managers need to track how analytics and machine learning apps are using data sets and work backward to improve how the data is aggregated, said Josh Jones, manager of analytics at Aspirent, an analytics consulting service. This includes maintaining at least three periods of time for trending data.

It's also good to build repeatable, transparent processes for how you clean the data. This enables you to reuse those processes later.

To start, create an inventory of basic steps to identify duplicates and outliers.

It's important to start the data curation process with the end in mind.

"Make sure these basics are applied to each data set consistently," Jones said.

It's also important to think about at what point you want to clean the data. Some organizations prefer to do it at the point of intake, while others find it works better right before reporting.

Another practice is to curate data with the tools in mind. For example, if your organization uses specific tools, like Tableau, certain data formats can facilitate faster dashboard development.

5. Balance data governance with agility

Organizations need to strike a balance between data governance and business agility.

"I'm seeing businesses shifting away from the Wild West of self-service data wrangling to team-based, enterprise data preparation and analytics solutions that support better search, collaboration and governance of curated data sets," said Jen Underwood, senior director at DataRobot, an automated machine learning platform.

Proper data curation and governance provides a management framework that can enable availability, usability, integrity and security of data usage in an enterprise. It improves visibility, control of and trust in data, and by ensuring the safety and accuracy of data, it promotes greater confidence in the resulting insights and analytics.

Some practices that can help strike this balance include engaging users, sharing experiences and focusing on the most-used data first. If users have a tool that encourages them to centralize their data securely, they are more likely to follow secure practices.

A centralized platform can also help users identify data, processes and other information that might be relevant to their analytics or machine learning project. Machine learning can be used to identify trends in usage, as well as potential risks.

6. Identify business needs

Data provides value only when its use satisfies a business need. Daniel Mintz, chief data evangelist at Looker, a data modeling platform, recommends starting with one question.

"What does the business need out of these data sets?" he said. "If you don't ask that upfront, you can end up with just a mess of data sources that no one actually needs."

It's important to pull in the business owners and the business subject-matter experts early. These people are your users. Not pulling them in at the start is just as bad as building software without talking to the intended audience.

"Always avoid curating a bunch of data without talking to the intended audience," Mintz said.

7. Balance analytics users and data stewards

A centralized data governance effort is important. But it's also a good idea to include the analytics users as part of this process, said Jean-Michel Franco, senior director of data governance product at Talend.

"They need to contribute to the data governance process, as they are the ones that know their data best," he said.

One strategy is to adopt a Wikipedia-like approach with a central place where data is shared, and where anyone can contribute to the data curation process under well-defined curation rules.

More centralized data stewardship roles can complement these efforts by implementing well-defined data governance processes covering several activities, including monitoring, reconciliation, refining, deduplication, cleansing and aggregation, to help deliver quality data to applications and end users.

8. Plan for problems

Developing a robust data curation process and data modeling strategy requires admins to account for imprecision, ambiguity and changes in the data.

"Spitting out some numbers at the end of a complex data pipeline is not very helpful if you can't trace the underlying data back to its source to assess its fitness for purpose at every stage," explained Justin Makeig, director of product management of MarkLogic Corp., an operational database provider.

Confidence in a source of data, for example, is a key aspect of how MarkLogic's Department of Defense and intelligence customers think about analytics. They need the ability to show their work but, more importantly, they need to update their findings if that confidence changes. This makes it easier to identify when decisions made in the past relied on a data source that is now known to be untrustworthy. It's only possible to identify the impact of untrustworthy data by keeping all the context around with the data in a queryable state.

Dig Deeper on Data management strategies