Sergey Nivens - Fotolia
How to improve data governance for self-service analytics
Citizen data scientists and self-service analytics are on the rise as the data scientist shortage continues. Here are some data management best practices to support them.
The movement for improved self-service analytics to enable citizen data scientists is at a tipping point. According to analysts with Gartner, 2020 will be the year citizen data scientists surpass analytics professionals and data scientists in the amount of advanced analysis they produce.
However, in order to get the most out of this analytical groundswell, organizations are going to have to prioritize data governance for self-service analytics. Those seeking to scale out self-service capabilities without implementing a corresponding set of data governance and management best practices likely will find business users struggling to make decisions due to inconsistent data, stale data and hidden data, among other problems. Additionally, organizations open themselves up to a world of risk when it comes to data privacy and security issues when they open up data sources to self-service analytics users without appropriate data governance.
According to analytics and data management experts, organizations need to focus on the following best practices for governing and managing data in order to better support citizen data scientists in the coming years while addressing these issues.
Balancing data quality and timeliness
Data quality has increasingly moved to the epicenter of a number of analytics trends -- including self-service analytics, according to Emily Washington, executive vice president of product management at data governance software vendor Infogix Inc., who said that it will be a key goal for organizations in 2020.
"As organizations continue to push the limits with data storage and processing, we see data quality as the underlying theme to ensure they're leveraging data they can trust," Washington said.
The problem enterprises face is not just providing consistent, clean data, but also doing it in real time. A recent survey by Actian Corp. showed that while 94% of IT decision-makers today said it's important to receive current data to power a data-driven enterprise culture, over half of them admitted they're forced to use stale data at least some of the time.
As a result, organizations are seeking innovative ways to serve up data. As Washington said, traditional batch processing where data is sent on a schedule from system to system isn't meeting the needs of today's real-time analytics environments.
Emily WashingtonExecutive vice president of product management, Infogix
To meet these demands, many companies are moving to event-driven architectures to handle large volumes of streaming data, leaning on distributed streaming platforms like Apache Kafka, ActiveMQ, Apache Pulsar and Amazon Kinesis. They're seeking to not only help citizen data scientists make decisions more quickly but also to open up more analytics use cases.
"There are exciting new analytics use cases, like customer-360 and hyper-personalized real-time offers, that simply don't work with stale data," said Jack Mardack, a vice president at database and data integration vendor Actian. "This blurs the lines between traditionally separate transactional databases and data warehouses and places new demands on the data management infrastructure, where real-time availability is now a requirement."
The point is, though, that real-time data becomes a liability rather than an asset if it's not properly validated and governed at speed.
Putting an emphasis on data governance
Establishing strong data governance for self-service analytics is at the heart of addressing data quality issues that hamper effective citizen data scientists. It's also crucial for ensuring that the activities of citizen data scientists don't devolve into security and compliance nightmares.
"When enabling and managing citizen data scientists, governance should be a high priority," said Jen Underwood, an independent consultant and former senior director at machine learning vendor DataRobot, where she was in charge of product marketing for citizen data science uses. "For organizations in highly regulated industries -- financial services, pharmaceutical or biotechnology and energy -- effective data management solutions for supporting legal and regulatory compliance, mitigating risk and improving efficiency are simply not negotiable."
The good news is that data access policies for citizen data scientists don't have to be revolutionary. These policies can evolve from and mirror similar policies that organizations have been rolling out for enterprise self-service BI functions.
The trick is adapting policies to new use cases that are cropping up, such as accounting for how machine learning data access practices need to change in light of data privacy laws like GDPR and the California Consumer Privacy Act.
Amp up data discovery and data prep with augmented analytics
Organizations are increasingly turning to the machine learning capabilities of augmented analytics to automate how organizations discover and prepare the data that citizen data scientists need to glean insights from organizational information.
Data discovery, also aided by the deployment of data catalogs, is a crucial piece of this data management puzzle when it comes to getting the most out of self-service analytics.
"Acknowledged as important glue to enterprise software, delivery of a common catalog for finding, provisioning, securing and understanding data and other objects is important to customers," said Todd Wright, senior product marketing manager of data management and data privacy solutions at analytics software vendor SAS Institute. "Further, this discovered insight through application of advanced analytics delivers the ability to automate mundane data management tasks and find value in data that previously had been too difficult to discern."
In the meantime, augmented and smart analytics can help drastically reduce the amount of effort organizations must spend to clean up data sets. According to Krzysztof Surowiecki, managing partner at analytics services firm Hexe Data, the extract, transform and load (ETL) process takes up 80% of data analysts' time in preparing data for use. Augmented analytics and AI stand to slash time-consuming data prep activities like ETL, he said.
Wright agreed, stating that this approach to data governance for self-service analytics will unlock the kind of data democratization that organizations need to empower citizen data scientists.
"To expand data manipulation activities to a wider audience, development of advanced data transformation using AI to automate cleansing and blending will empower nontechnical users," Wright said.