Using data profiling techniques -- and estimating the effort required
Data profiling is a key part of data quality efforts. Here's a simple formula for calculating the amount of time needed to profile a data set.
Data profiling has emerged as a necessary component of every data quality analyst's arsenal. Data profiling tools track the frequency, distribution and characteristics of the values that populate the columns of a data set; they then present the statistical results to users for review and drill-down analysis. There are a number of valuable usage scenarios for data profiling techniques, most frequently involving data quality assessment and metadata discovery.
The results of a data profile typically capture key aggregate measures associated with the values in individual columns of data, including the number and percentage of populated versus null fields; the number of unique values; frequency counts for each value and for patterns in the data; the maximum and minimum values; and information on data types and the length of character strings in data elements. More sophisticated data profiling software will provide additional details about dependencies between columns and relationships across tables. The goal is to identify data anomalies or shared characteristics among data elements.
The danger is that the allure of the exploration aspect of the data profiling process can draw the data quality practitioner into a never-ending spiral of analysis. Any questionable aggregate value can trigger a new thread of detective work, with the analyst seeking to better understand the details, or context, or root cause, of a potential data quality issue. Of course, some of these reviews may identify critical problems. But most either certify the validity of data or, worse, fail to determine whether the perceived issue is a data flaw or not.
And that highlights one of the operational risks of data profiling: a lack of planning. The freedom to explore data often conflicts with the need to adequately plan for the process, including estimating the amount of time and resources needed to complete a data profiling task. Preemptively timeboxing the process, by asserting that it must be completed within a specific period, may stifle the ability of analysts to do the right amount of exploration. But not setting boundaries on the project time frame could allow the work to continue dribbling on after the most significant discoveries have been exposed.
Data profiling process: Pretty predictable
Fortunately, when one considers the expected outcomes of any of the most frequently-performed data profiling activities, estimating the required time and level of effort is actually quite predictable. We can look at both of the usage scenarios already suggested to frame an approach for making such estimates.
The allure of the exploration aspect of the data profiling process can draw the data quality practitioner into a never-ending spiral of analysis.
In a data quality assessment, the objective is to identify potential anomalies that might reflect questionable data values. Some general examples of the types of problems that can be found through the use of data profiling techniques include:
- Values missing when they should be present.
- Values present when they should be absent.
- Values appearing with unexpectedly high or low frequency.
- Values that don't match an expected pattern or format.
- Outlier values that are much lower than or far exceed a defined range.
The goal of metadata discovery and analysis is to document the structural and, potentially, semantic characteristics of each analyzed data attribute. In addition to data type and the size of data elements, characteristics that are commonly sought include the distribution of values throughout a data set, compliance with document data types and the identity of database keys and foreign key relationships.
In those two scenarios, it's useful to differentiate those data elements that appear to comply with expectations, or whose metadata characteristics are easily documented, from those requiring additional analysis. For example, segregating questionable data elements and scheduling them for a subsequent deeper dive enables the data quality analyst to have the right level of concentration at the right phase of the assessment process. Alternatively, anomalies can be farmed out to other analysts to perform the deeper dive.
Do the math on data profiling techniques
Once a profiling tool has been applied to the data, a cursory review of each data element involves walking through a compliance checklist, which in most cases can be done in not more than 10 minutes per element. Data elements flagged as questionable require more investigation, which in my experience increases the average review time to around 90 minutes. Assuming that 20% of the elements in a data set typically need a deeper look, we can estimate the overall average time required per data element as follows:
More on data profiling and data quality management
Find out why effective data profiling and data quality start with business processes
More from David Loshin: Five steps for improving data quality initiatives
Why a solid data quality strategy is central to business intelligence success
(0.8 x 10) + (0.2 x 90) = 26 minutes
Let's be conservative and round that number up to 30 minutes. That means a good rule of thumb for estimating the amount of time needed to evaluate an entire data profile is to multiply the number of data elements by 30 to get a total number of minutes, which can then be divided by 60 to convert the estimate to hourly form. Or to simplify matters even further, you can get the same end result by dividing the number of data elements by two.
This easy math exercise should provide a relatively accurate way to scope the level of effort required for a data profiling project -- without a lot of up-front effort on your part. You then can verify the accuracy of the 30-minute average by starting out with a subset of the data elements and keeping track of the time it's actually taking to analyze them. If you find that the assumptions are somewhat off, revise the estimate using updated variables -- and proceed with the profiling.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting, training and development services company that works with clients on business intelligence, big data, data quality, data governance and master data management initiatives. He also is the author of numerous books, including The Practitioner's Guide to Data Quality Improvement. Contact him at [email protected].
Email us at [email protected] and follow us on Twitter: @sDataManagement.