Blue Planet Studio - stock.adobe

Human oversight enables automated data governance

Automating data governance makes tedious, time-consuming tasks more efficient. The human element remains critical to keeping automated systems in compliance with regulations.

Not all aspects of data governance should be automated, but you can deploy AI and ML to automate repetitive and time-consuming compliance checks with careful human oversight to avoid compliance violations.

Given the sheer scale and complexity of today's system architectures, automation is not just a benefit but a necessity for modern data governance. It can free up time for your data or IT teams to focus on important duties by automating repetitive tasks such as data discovery and classification. Automated systems can monitor any data changes and flag possible compliance issues. Depending on the complexity of the issue, an automated system might be able to correct it on its own or notify the relevant team to take a deeper look.

Despite automation's potential for more efficient data governance, you must carefully balance it with human oversight to ensure your policies and practices are effective and valuable.

Why you should use automated data governance

A digital enterprise is any business that has digitized its processes extensively and relies on data for operations and analytics. An organization like this typically manages data across multiple SaaS applications and on-premises systems, which may involve multiple cloud platforms, IoT sensors and edge devices.

Maintaining policies across different geographies, systems and organizational boundaries creates complexity. In the world of data management, it has long been a rule of thumb that about 80% of the effort on any project is spent on data preparation and governance rather than on analysis, presentation or decision support. If the complexity of governance increases, that leaves less time to spend gathering business value from data.

Failure to address complexity can lead to errors in routine management tasks, such as data classification and data quality. It also affects regulatory compliance by increasing the likelihood that policies aren't consistently applied across the data estate.

The final major consequence is financial. According to the IBM "Cost of Data Breach Report 2024," the average cost of a data breach without AI assistance is $4.88 million.

Key components of automated data governance

Automation can help you apply operational processes and governance practices at scale across your organization. It's also important to understand what you can automate and the importance of human oversight in the process.

As enterprise data architecture grows more diverse and complex, so does the variety of governance tools and techniques in use. Automation helps in three key areas: data discovery, data quality and policy management.

Data discovery and classification

Data discovery is the process of scanning your entire data infrastructure to identify what data exists in databases, file systems, cloud storage, SaaS applications or edge devices. Classification categorizes or catalogs discovered data based on its type, sensitivity and business context. Data teams use classification tools to inventory all data assets and map the relationships between different data elements.

Automated processes can perform both discovery and classification consistently and continuously. Automation applies correct categorization and mapping as you add new data sources. Businesses that innovate or expand rapidly via merger and acquisition need automation so their IT teams can keep on top of any changes.

Knowing what data exists is only half the problem for governance. You need to detect and tag sensitive data, such as personally identifiable information (PII), financial records, or health information, with governance tags and labels to ensure correct handling. One approach is to categorize data according to compliance requirements, such as the GDPR in Europe or HIPAA for healthcare, and label critical data elements.

Data discovery and classification are critical capabilities: You can't automate governance if you don't know what data you have. Data catalog applications use machine learning (ML) to ensure the work is thorough, accurate and consistent across large, distributed data architectures.

Data quality and lineage

Knowing what data you have is an important first step in data governance, but data can vary greatly in quality, even within a single application. Common data quality problems raise questions that you need to answer:

  • Duplication. Does a record have multiple copies?
  • Range validation. Does the data fall within a predetermined range, such as customer ages?
  • Pattern matching. Does data, such as a phone number, fit an expected pattern?
  • Completeness and null values. Are there empty fields?
  • Consistency. Does the data stay consistent and accurate across all your databases? For example, the "State" field should contain standard two-letter abbreviations.
  • Timeliness. How recent is the data? When was its last update?

Once you establish rules for data quality, specialized tools or automated scripts can efficiently test large volumes of data. Data quality tools might also include AI features that can learn new patterns in data. AI can suggest patterns as new rules or flag unexpected records as anomalies for review.

An important process in data quality is lineage tracking, which enables an auditor or administrator to see data's routines, and where it comes from. It's important to know when, where and why any data changes -- even a correction -- occur.

Data quality tools or extract, transform and load tools often include automated lineage analysis.

Policy enforcement and compliance

Data discovery and quality work well with automation because they involve well-defined, constantly repetitive actions. As AI and ML evolve, some actions might be less clear, but AI can infer and apply needed rules.

Policies are a little different. They are more complex than simple rules, often including multiple options. A policy typically includes a definition of what the policy applies to, such as PII. It has rules about allowable actions. For example, you might be able to move certain data if it's not encrypted or only certain people can access it.

In addition to rules, a policy might include actions to take if a violation happens. Policies might also have a certain scope. For example, a policy might apply only in a particular geography or for a specific period.

Policies might seem less suitable for automation because of the increased complexity. You can deconstruct a policy into steps and automate each one. Monitor and generate audit trails for the complete policy.

You can automate policies in different layers of architecture. Simpler rules about data access might apply in the data storage layer -- the database, data warehouse or data lake. However, policies often concern not only access to data, but its use. You might have permission to use customer data for analysis, but not for direct marketing. Some policies might apply in the data catalog or analytics catalog. More complex policies -- for example, those involving alerts, approvals or coordination across multiple systems -- might require specialized policy management or compliance management software.

The human element: Best practices for automation

Automation can handle routine tasks efficiently, but human judgment remains critical for setting data governance strategies and priorities in advance. It is also important to confirm that governance aligns with business objectives and meets the needs of legislation.

Automation can handle routine tasks efficiently, but human judgment remains critical for setting data governance strategies and priorities in advance.

Many data governance decisions require contextual understanding and ethical considerations that machines cannot handle. For example, in healthcare, some cases involve experimental protocols that need validation by human expertise. Related billing or insurance processes might need a human override.

Automated data governance has three common patterns of human involvement:

  • Human-in-the-loop. Actively includes humans in the decision-making process. Humans often make final decisions based on automated suggestions. The goal is to maintain ethical and contextually appropriate actions, such as in health diagnostics, with human insight and oversight to prevent the system from operating unchecked.
  • Human-out-of-the-loop. Automated systems operate independently, without human interaction. The approach is ideal for high-volume, low-risk actions, such as some automated classification or data quality tasks. A lack of human oversight can lead to unchecked problems in complex situations.
  • Human-over-the-loop. Humans monitor operations, intervening if needed to apply policies and standards. It balances the efficiency of automation with essential human oversight. Like the human-in-the-loop pattern, it lets automated processes act within ethical and operational boundaries, but without constant intervention. Choosing the right pattern for your needs is an important first step in automating governance.

Start small and scale gradually

Choosing the right pattern for your needs is an important first step in automating governance. All three patterns of human engagement start with the automation of well-defined, repetitive governance tasks before moving on to more involved processes. Regularly evaluate and adjust automated practices to determine their effectiveness.

You might also find your needs change as you implement automation. For example, you might start with a human-in-the-loop pattern, but as your confidence in automation grows, you might move to applying human-over-the-loop.

An important component in any automated process is to define clear escalation paths for issues so that you have a feedback loop between automated systems and your governance team. Even in a highly automated system, some initial steps, such as generating business definitions, might still be manual.

Automated governance risks and mitigation strategies

Automation carries risk. Organizations that are too reliant on automated systems can potentially miss nuanced or complex governance issues, such as intellectual property infringement, which an automated process might not be able to evaluate.

Don't underestimate the technical challenges of integration across diverse systems and data sources at scale. If governance in the circumstance is "already beyond human complexity," then automation helps and makes better governance possible, but it remains a complex process that needs management.

The future outlook

AI has the potential to improve more complex data governance issues. For example, data catalogs can now use generative AI to annotate and describe data elements in detail, saving many hours of repetitive human work.

AI might automate some decisions that currently require a human-in-the-loop after it trains on historical examples of human actions. However, the regulatory landscape is only becoming more complex as international, national and state-level regulations expand to cover emerging consumer demands and new technologies. It is a challenge to keep automated decisions on pace with constantly evolving regulations.

Automated data governance is a powerful tool for managing modern data landscapes, but the future of the technologies and practices involved still lies in finding the right balance between automation and human expertise.

Donald Farmer is a data strategist with 30+ years of experience, including as a product team leader at Microsoft and Qlik. He advises global clients on data, analytics, AI and innovation strategy, with expertise spanning from tech giants to startups. He lives in an experimental woodland home near Seattle.

Dig Deeper on Data governance