data integrity
What is data integrity?
Data integrity is the assurance that digital information is uncorrupted and can only be accessed or modified by those authorized to do so.
Data integrity describes data that's kept complete, accurate, consistent and safe throughout its entire lifecycle in the following ways:
- Complete. Data is maintained in its full form and no data elements are filtered, truncated or lost. For example, if 100 tests are performed, complete data reflects the results of all 100 tests. Tests that failed or yielded undesirable results aren't omitted from data requests.
- Accurate. Data isn't altered or aggregated in any way that affects data analytics. For example, test results aren't rounded up or down, and any test criteria or conditions are well-documented and understood. Repeating tests should return the same results.
- Consistent. Data remains unchanged regardless of how, or how often, it's accessed and no matter how long it's stored. For example, data accessed a year from now will be the same data that's generated or accessed today.
- Safe. Data is maintained in a secure manner and can only be accessed and used by authorized applications or individuals. Further, safe data can't readily be exploited by malicious actors. Data security involves considerations such as authentication, authorization, encryption, backup or other data protection, and access logging.
Data integrity is a broad discipline that influences how data is collected, stored, accessed and used. The idea of integrity is a central element of many regulatory compliance frameworks, such as the General Data Protection Regulation (GDPR).
Data integrity isn't a single product, platform or tool. Instead, it's a comprehensive environment that's created through an array of applicable standards, rules, processes and procedures that are implemented across an organization's infrastructure and observed by its employees, partners and users.
Data corruption occurs when any unwanted or unexpected changes to data take place during storage, access or processing -- all of which represent a failure or loss of data integrity. Data corruption can be caused by hardware failures, human error, a malicious action or failure of data security.
Why is data integrity important?
Where traditional businesses often focused on the construction and distribution of physical products, today's businesses typically prosper through the delivery of digital products and services. This transition demands an enormous amount of data, which has become the new raw material of the digital economy. This manifests itself in three major ways:
- Business analytics. A traditional axiom of early computing was garbage in/garbage out. This is certainly true of modern business analytics for business decision-making and product development. This makes data integrity critical to analytical results, as missing or inaccurate data might result in poor business decisions or product behaviors.
- Customer interactions. Businesses collect and use an enormous amount of customer data, including sensitive or personally identifiable data. Data integrity ensures that customers are treated correctly, such as receiving proper account crediting and reporting. Data security must keep that sensitive data safe from loss of theft.
- Compliance. Businesses are typically obligated to retain data for a period of time to ensure that business processes are followed in accordance with prevailing industry standards and government regulations. Data integrity is vital for complete, accurate and consistent reporting for all compliance purposes; otherwise, the business may be out of compliance and subject to fines and other legal remedies.
Consequently, data integrity fills the same essential role as any physical quality control effort needed within a traditional business, ensuring that the raw material is correct, secure and suited for its intended purpose.
Types of data integrity
Data integrity involves both physical and logical issues:
Physical integrity. This includes issues related to storing and retrieving data -- primarily the storage devices, memory components and any associated hardware. For example, if a hard drive or memory device is damaged, its stored data is unavoidably affected. There are many threats that can affect the data integrity of, or even damage, physical storage hardware:
- hardware faults and failures;
- design oversights and failures;
- natural deterioration such as corrosion;
- power disruptions and outages;
- natural disasters; and
- radiation and environmental extremes such as temperature and pressure.
Organizations can enhance data's physical integrity by implementing hardware infrastructure, including redundant storage subsystems such as RAID, with battery-protected write cache; using advanced error-correcting memory devices; implementing clustered and distributed file systems; and using error-detecting algorithms to detect data changes in transit. Organizations often adopt a variety of hardware devices and techniques to enhance data's physical integrity.
Logical integrity. Even when the hardware devices and infrastructure are working flawlessly, there are several considerations that affect the correctness or sensibility of data within its respective context. Does the data make sense, or has it changed unexpectedly? Logical integrity can be affected by poor software design and software bugs as well as human error and malfeasance. There are four principal types of logical integrity:
- Entity integrity. This ensures that no data element is repeated and that no critical data entry is blank or null. This is a common logical integrity consideration in relational database systems.
- Referential integrity. These rules define how data is stored and used in a database and that only authorized changes, additions or deletions can occur. These rules prevent duplicate data, ensure data accuracy or eliminate inapplicable data.
- Domain integrity. This reflects the format, type, amount and value range or scope of acceptable data values within a database. For example, if data is supposed to be numerical, an alphanumeric data element may be rejected.
- User-defined integrity. These are additional rules and constraints that are implemented in accordance with the organization's specific needs and aren't otherwise covered by the first three integrity types.
Physical and logical integrity are defined separately but can often be related. For example, a null data stream might violate logical or entity integrity, but the cause of the null data may be traced to a failed internet of things sensor.
What are data integrity risks?
Data integrity can be lost due to a variety of reasons:
- Human errors. Data may be accidentally deleted, entered or altered inaccurately -- such as a wrong customer address -- or left incomplete.
- Transfer errors. Data may be damaged or lost -- or even stolen -- in transit between two systems, such as a network failure or incorrect storage destination, or between two physical locations, such as transporting a storage device filled with data to another location.
- Malicious acts. Malware, hacking and other cyber threats can steal, alter or destroy valuable data. This can result in a loss of both data integrity and data security.
- Improper infrastructure configurations. Poor network and infrastructure security can expose flaws that attackers can exploit to steal, alter or destroy data. Proper, well-documented and strongly enforced configuration standards are essential in all data integrity and data security situations.
The consequences of data integrity loss can range from a minor annoyance to a major business catastrophe -- depending on the amount of loss and the nature of the data involved. Business and technology leaders invest considerable time and resources to understand and prevent data integrity loss.
How to ensure data integrity compliance
Data integrity isn't a straightforward concept, and it can't be ensured with any single software tool or regulatory law. Rather, data integrity is a broad field of endeavor that involves people, processes, rules and various tools to provide guardrails and support. While there's no single universal solution for data integrity, there are numerous tactics that can help to build an environment that supports data integrity. Common tactics include the following:
- Employee training. Organizations typically create policies and procedures designed to govern the collection, access and protection of business-related data. Regular training sessions will help new employees understand their roles and responsibilities in data integrity and keep that knowledge fresh and updated as changes take place over time.
- Establish an integrity culture. Data integrity is more meaningful when business leaders and managers care about it as a business goal. Collaboration and top-down buy-in to integrity concepts can drive better data integrity efforts.
- Validate the data. Take the time to check or validate the data. This can be especially important when data is acquired from third-party sources. For example, if a business is running analytics on several different data sets, it's worth checking that all the data sets show consistent – or at least sensible -- data. A data set with wildly disparate data may be suspect.
- Process data sensibly. Check and remove duplicate entries in data sets and ensure that any data pre-processing -- such as data normalization or aggregation -- doesn't affect the base data.
- Protect data. Regular data backups can help to ensure data integrity by copying data to a second location. This ensures that data is always available in the event of hardware, infrastructure or security violations.
- Implement strong security. Ensure that data access is protected using strong authentication and authorization controls. Log all data access and retain logs to audit activity. Encryption for data at rest and in-flight can help to protect it from unauthorized access or theft.
Data integrity vs. data security vs. data quality
The terms integrity, security and quality are sometimes improperly used as interchangeable terms. Although the three ideas are closely related, they possess unique attributes that distinguish them from their companion terms.
Data quality refers to the reliability of data. Good quality data must be accurate, complete, unique with no duplicates and timely enough to be useful.
Data security is the infrastructure, tools and rules used to ensure that only authorized applications and users can access data; that the data is used in a business-compliant manner; and that data is preserved or backed up against loss, theft or malfeasance.
Data integrity then provides a broader umbrella that embraces aspects of data quality and security, ensuring proper retention, appropriate destruction, and adequate compliance with relevant industry and government regulations.
It's essential for a data quality strategy be aligned to an organization's core business goals. Learn about the six dimensions of data quality and how organizations can benefit from a data quality strategy.