Getty Images
6 dimensions of data quality boost data performance
Poor data quality can lead to costly mistakes and bad decision making. The six dimensions of data quality ensure accurate, complete, consistent, timely, valid and unique data.
Higher levels of data quality produce better results, but determining quality is a challenge for many organizations. The six dimensions of data quality provide key measurements that identify a dataset's quality.
High-quality data is better able to serve its specified purpose, providing more value. Low-quality data can cause issues such as incorrect customer information entering a system, missed sales opportunities or regulatory fines due to inaccurate data reporting. Poor data quality costs organizations an average of $12.9 million annually because it results in poor decision-making, according to Gartner.
Dimensions of data quality can measure how good or bad data is. A dimension is a measurement that evaluates certain characteristics of a piece of data. For example, a dimension can measure how recent data is, which can help determine how relevant it is for use in customer behavior analysis. Customer behavior data from a decade ago isn't going to be as insightful or accurate as data from a month ago.
Six dimensions of data quality
Not all data dimensions are equal, and the characteristics might vary in importance depending on business goals. The six dimensions of data quality are the most common measurements many organizations use. It's important to understand each dimension, how to measure them and why they're valuable.
1. Accuracy
Data accuracy refers to the degree to which information is correct. The better the data values reflect truth, the more correct and useful it is. Accuracy is one of the most critical building blocks of data quality. If data is not accurate, it is functionally useless.
In practice, accurate data means having correct data entries. For example, a customer profile with their name, phone number and other relevant information should be correct. If their phone number is one digit off, the data is incorrect. Even miniscule inaccuracies can cause issues, such as not being able to reach the right person.
Inaccuracies can occur for many reasons. During the information gathering process, they could happen due to human error, such as someone misspelling a name. Using a third-party dataset carries the risk of outdated information.
Organizations should regularly monitor, audit and cross-check data to ensure accuracy. When possible, compare data against another source of truth to verify accuracy. Regularly review and update datasets to correct discrepancies and errors.
In highly regulated industries, such as finance, data accuracy is paramount. One incorrect digit -- an extra zero mistakenly added on the end of a payment amount -- can result in financial loss, security risks and even regulatory fines. Audits are commonplace to monitor for inaccurate mistakes.
2. Completeness
Data completeness refers to having all necessary information. The more data in a dataset, the better informed decisions will be because they incorporate more variables. Completeness does not necessarily mean having every data field filled out. It's more important to have the required elements for a specific purpose.
For example, consider a purchase form for goods. Required data would be payment information and shipping address, and optional data could include a company name. Without a shipping address, the customer would never receive the goods. However, they could still receive the delivery without a company name.
Working off incomplete datasets or missing critical data can lead to inaccurate analyses and mistakes that waste time and money. A comprehensive dataset can paint a bigger, more accurate picture of reality. Share data across the organization to keep dataset records consistent.
To ensure data completeness, set mandatory fields for required information when gathering data so participants can't submit work without inputting specific data. Regular reviews and audits can scan for missing information. Completeness checks ensure data is fully represented.
In sales and marketing, data completeness can ensure customer profiles are comprehensive. Outreach might be more effective, messaging might resonate more and relationships might be easier to foster with more knowledge about the customer.
3. Consistency
Data consistency refers to the degree to which information is consistent across formats, datasets and systems. When data taxonomies align throughout the organization regardless of data source, users don't have to worry about labeling, and they know they can base decisions on the right information.
For example, a customer inputs their formal name for an invoice but uses a nickname when speaking to a salesperson. The salesperson then inputs that name into their CRM platform. When the names don't match, the data is inconsistent, which can raise questions about which information is correct.
Formatting can cause inconsistencies, such as using a month/day/year format for the date in one system but a day/month/year format in another. The variance could result in a customer receiving a delivery on the wrong date because the format was different in the ordering system than it was in the fulfillment system.
Another aspect of consistency is data values. The values of data should stay within certain ranges to identify inconsistencies. For example, an employee is paid $1,000 weekly, but then one week they are suddenly paid $10,000. A system designed to flag anomalies can identify the error immediately.
To ensure that data is consistent, introduce standardizations for data entry, compare data stored across systems and synchronize data on a regular basis. Data observability tools and data cleansing processes can provide transparency for labeling, identify data contradictions and reveal inconsistencies.
4. Timeliness
Data timeliness refers to the recency of information and its availability for use. Up-to-date data is more likely to be accurate and more immediately available to use to make faster decisions.
A common method of measuring timeliness compares when data is expected versus when it arrives. Closing that gap is key to increasing the overall value and quality of data. Old, stale data is not as accurate or useful as fresh information delivered in real time. Data with immediate access can help users act sooner, which might be critical in certain industries and scenarios.
For example, consider the global supply chain. Extreme weather events can occur suddenly and unexpectedly, which can majorly affect the transportation of goods and cause delays down the supply chain. When logistics and transportation receive weather updates in real time, they can make decisions to avoid delays.
To ensure data timeliness, use tools that simplify data entry, enhance data collection speed and delivery, minimize latency, and automatically refresh data according to time-based parameters. Depending on the industry and organizational goals, the importance of timeliness can vary. Measure the effectiveness of timeliness against organizational expectations.
5. Validity
Data validity identifies whether data conforms to its defined syntax. Syntax refers to the type, format, range or other parameter applied to a data item.
For example, a data item could represent the time of day in the format of a 12-hour clock. Validity would assess whether data conforms to that syntax. A time entered as "11:02 PM" is valid, but a time entered as "23:02" is not because it's using a 24-hour clock format.
Consistency is similar in that it can establish a usual range for data to fall in and identify anomalies. Validity is more about establishing hard limits -- data either is or is not valid.
Data must be valid. If data isn't valid, it's unusable. More than any other dimension, validity is tied directly to data quality. If data isn't valid, it isn't valuable. A dataset that doesn't have 100% validity for every piece of data still has uses, but an organization must establish acceptable validity levels.
To ensure high levels of validity, set all validation rules at the top level so data entry must comply. Validation rules might set a minimum and maximum range for data, establish allowable values and identify acceptable formats for data.
Data cleansing tools can identify data that's not valid. It might require some manual work to resolve issues with invalid data, but fixing simple errors might have automated options.
6. Uniqueness
Data uniqueness means data only appears once -- the information is one of a kind. Each instance of data should have one record. More than one instance means the data was duplicated by accident or datasets are overlapping. Duplicated data reduces the accuracy of analyses and misrepresents statistics.
For example, a doctor's office creates a patient file, and when the patient needs to update their information a year later, the staff accidentally creates a new file instead of updating the old one. A doctor might use the outdated file and work off old information. A study of test results might count the patient twice, skewing the insights.
A high level of data uniqueness builds trust in the value of data. It can also be a sign that data is organized, well-structured and reliable.
To ensure data uniqueness, regularly compare datasets or data stored in multiple locations to identify and eliminate duplicates. Deduplication tools can merge duplicates and maintain uniqueness. Only save the most up-to-date and relevant information.
The larger the dataset, the higher the chance of duplicates. Data matching tools can apply a rules-based approach to remove duplicates, improving uniqueness and overall data quality.
Additional dimensions
Other dimensions of data quality exist beyond these six. An organization can create its own dimensions to meet their unique goals. Additional dimensions might include the following:
- Representativeness. How well does data reflect reality?
- Transparency. How easy is it to trace data back to its source and original purpose?
- Flexibility. Can the data be repurposed or reused for other goals?
- Auditability. How easy is it to navigate data history and track changes over time?
- Integrity. How well-organized is the data, and how well do datasets connect with one another?
Achieving perfect data is a near-impossible goal. It often comes down to prioritizing dimensions that suit specific organizational goals or key performance indicators. For example, a healthcare institution might prioritize accuracy and consistency for security and regulatory purposes, and a sales organization might prioritize uniqueness and timeliness to make the most of current customer trends.
To decide which dimensions to prioritize, leaders across the organization should identify the most critical goals and align data quality needs accordingly.
Jacob Roundy is a freelance writer and editor, specializing in a variety of technology topics, including data centers and sustainability.