Identifying data quality issues via data profiling, reasonability

In a book excerpt, author Laura Sebastian-Coleman explores data profiling, data issue management and using reasonability checks in assessing quality.

This is part two of an excerpt from Chapter 4: Data Quality and Measurement, from the book Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework by Laura Sebastian-Coleman. Sebastian-Coleman is a data quality architect at Optum Insight, which provides analytics, technology and consulting services to health care organizations. In this section of the chapter, she explores data profiling and data quality issues management, and discusses reasonability checks and their importance to an overall quality framework. In the first part of the excerpt, she explains data quality assessment terminology and discusses quality measurement concepts.

Data Profiling

Data profiling is a specific kind of data analysis used to discover and characterize important features of data sets. Profiling provides a picture of data structure, content, rules and relationships by applying statistical methodologies to return a set of standard characteristics about data -- data types, field lengths and cardinality of columns, granularity, value sets, format patterns, content patterns, implied rules, and cross-column and cross-file data relationships and cardinality of those relationships.

Profiling also includes inspection of data content through a column profile or percentage distribution of values. Distribution analysis entails counting all the records associated with each value and dividing these by the total number of records to see what percentage of the data is associated with any specific value and how the percentages compare to each other. Understanding the percentages is useful, especially for high-cardinality value sets and for data sets with a large number of records. Unless you calculate proportions as percentages of the total, it can be difficult to comprehend differences between the individual measurement results.

Copyright info

This excerpt is from the book Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework, by Laura Sebastian-Coleman. Published by Morgan Kaufmann Publishers, Burlington, Mass. ISBN 9780123970336. Copyright 2013, Elsevier BV. To download the full book for 25% off the list price until the end of 2014, visit the Elsevier store and use the discount code PBTY14.

Profiling results can be compared with documented expectations, or they can provide a foundation on which to build knowledge about the data. Though it is most often associated with the beginning of data integration projects, where it is used for purposes of data discovery and in order to prepare data for storage and use, data profiling can take place at any point in a data asset's lifecycle. Most data assets benefit from periodic reprofiling to provide a level of assurance that quality has not changed or that changes are reasonable given the context of changes in business processes. Section three will discuss periodic measurement and reassessment of the data environment as one of the three assessment scenarios supported by the data quality assessment framework (DQAF).

Data Quality Issues and Data Issue Management

Profiling and other forms of assessment will identify unexpected conditions in the data. A data quality issue is a condition of data that is an obstacle to a data consumer's use of that data -- regardless of who discovered the issue, where or when it was discovered, what its root cause(s) are determined to be, or what the options are for remediation. Data issue management is a process of removing or reducing the impact of obstacles that prevent effective use of data. Issue management includes identification, definition, quantification, prioritization, tracking, reporting and resolution of issues. Prioritization and resolution depend on data governance. To resolve a problem means to find a solution and to implement that solution. A resolution is a solution to a problem. Issue resolution refers to the process of bringing an issue to closure through a solution or through the decision not to implement a solution. Any individual issue will have a specific definition of what constitutes its resolution.

Reasonability Checks

Depending on the nature of the data being profiled, some results may provide reasonability checks for the suitability of data for a particular data asset, as well as its ability to meet requirements and expectations. A reasonability check provides a means of drawing a conclusion about data based on knowledge of data content rather than on a strictly numeric measurement. Reasonability checks can take many forms, many of which use numeric measurement as input. All such checks answer a simple question: Does this data make sense, based on what we know about it? The basis for judging reasonability ranges from simple common sense to deep understanding of what the data represents.

More on data profiling and strategy

Learn more about using data profiling techniques and tools

Review the Whatis.com definition for data profiling here

Get insights into creating a sound data quality strategy

Reasonability checks are especially necessary during initial data assessment, when little may be known about the data. The answer to the question of whether the data makes sense determines the next steps in any assessment. If the data does make sense, you should document why and continue through the assessment. If it does not make sense, then you should define why it does not and determine what response is needed.

Because measurement adds to knowledge about the object being measured, what is reasonable may change as measurement produces deeper understanding of the data. For many consistency measurements, where the degree of similarity between data sets is being measured, initial reasonability checks can evolve into numeric thresholds. If, for example, historical levels of defaulted data range from 1% to 2% and the reasons for defaults being assigned are documented and understood, then 2% defaults might be considered a reasonable level and 2.5% might be cause for investigation. Even with such numbers in place, it is always the role of the data quality analyst to ask the question: Does this data make sense?

Many of the DQAF measurement types are characterized as reasonability measures. They compare one instance of measurement results from a data set with the history of measurements of the same data set in order to detect changes. Changes do not necessarily mean the data is wrong. They indicate only that it is different from what it had been. Within the framework, reasonability checks are contrasted with controls, which can produce measurement results that confirm data is incomplete or incorrect. Controls can be used to stop data processing.

Dig Deeper on Data governance