Definition

data de-identification

Stephen J. Bigelow

By

Stephen J. Bigelow, Senior Technology Editor

What is data de-identification?

Data de-identification is decoupling or masking data, to prevent certain data elements from being associated with the individual.

De-identification does not limit or prevent collecting and storing personally identifiable information (PII). However, de-identification ensures that collected and stored PII cannot be linked to specific individuals. This allows organizations to use and share data while minimizing the risks related to business governance, regulatory compliance and potential PII data breaches or misuse.

De-identification is normally associated with healthcare regulations -- such as HIPAA -- and is often discussed in the context of protected health information (PHI). However, the value of de-identification can readily be applied to other regulatory and data management frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act. Organizations looking to enhance data security or privacy protection should consider implementing a data de-identification scheme.

For example, a business might continue to collect PII such as date of birth, Social Security number or biometric identifiers -- such as fingerprints. Then, by employing de-identification techniques, the business can mask this data from the associated individual, making it safer to use and share.

De-identification employs several possible techniques to mask data from individuals, including safe harbor and expert determination methodologies.

How does data de-identification work?

All data de-identification begins with a primary data source, such as a database, containing all the collected information. Business applications typically do not possess data directly. For example, an application -- such as a CRM platform -- will access and query a separate enterprise database. The application then processes the data returned by the database.

The key to successful data de-identification is to prevent the sensitive or restricted data elements within the database from being accessed, correlated or presented together.

For example, a medical research team wants to perform statistical data processing on data collected from patients with certain age and race characteristics participating in a drug trial. The researchers can enter their desired parameters into a query, but they receive only pertinent details of the medical records. Information in each file, such as names, might be replaced with a generic number. Other details, such as addresses, phone numbers, dates of birth and other PII, might be masked or blocked entirely. So, the data retrieved from the database cannot be directly related to any specific individual.

Data de-identification techniques operate on direct and indirect identifiers. Direct identifiers are readily correlated to specific individuals, such as passports or social security numbers. Indirect identifiers are not immediately correlated to specific individuals but can potentially be used to narrow down or infer an individual's identity. Indirect identifiers can include factors such as height, weight, eye color or race.

Data tagging

The first step in data de-identification is to select the direct and indirect data elements to be de-identified. Data teams will typically apply tags to important data elements within the database, allowing a high degree of automation in the de-identification process. For example, data teams might apply tags to data elements such as first and last names or birth dates. Data de-identification techniques can then process the data store to mask each tagged data element.

Access controls

De-identification involves more than masking data. Proper de-identification and data security also require the implementation of data access controls which define who can access data, what data can be accessed, and for what legitimate purposes. Access controls include technical measures such as identity and access management and comprehensive business policies surrounding the proper storage, safeguarding, access and use of business data.

De-identification

Tagged data can then be effectively de-identified or masked. Data is first typically preprocessed using pseudonymization, which replaces tagged data elements with placeholders or artificial identifiers. The key to pseudonymization is that data can be re-identified. Think of pseudonymization as a kind of encryption used to mask data elements such as a salted hash technique. The data can be "unencrypted" if the salt (the key) is known.

Pseudonymization is different from anonymization, which is intended to prevent data re-identification. In effect, anonymization is the same kind of "encryption" which has no key, making it irreversible. Anonymization is generally not applied in business de-identification environments where data may need to be re-identified but is applied in security situations where data should be rendered inaccessible, such as the GDPR's "right to be forgotten."

Pseudonymization is only one of several common masking techniques. Other de-identification masking techniques include varied forms of the following:

Encryption. This traditional technique scrambles data according to an algorithm based on a single variable or key.
Noise. Deliberately introducing data errors by randomly misclassifying certain variables.
Perturbation. This places fake or random data in place of the masked data elements.
Pseudonymization. Data elements are replaced with placeholder or artificial data.
Swapping or shuffling. This exchanges data between records so that users cannot know which individuals the real data belongs to.

Identifiers removed during data de-identification

Once the underlying concept and need for data de-identification is understood, a business must consider precisely what data is potentially subject to de-identification. Healthcare and other public health organizations typically rely on two general HIPAA guidelines -- the Expert Determination method and the Safe Harbor method.

Expert Determination

This risk-based method relies on the knowledge and expertise of a data science professional who can meet three general criteria:

The expert knows how to apply data de-identification techniques successfully.
The expert can attest that information selected for de-identification will reduce the risk that the data could be used (alone or with other data) to identify an individual.
The expert can readily document the methodology and analysis used to reach their determinations.

The use of the Expert Determination method requires a highly skilled professional with strong scientific, data science and analytical expertise. These experts should seek regularly updated guidance from related industry sources or government resources such as the Health and Human Services website. However, the Expert Determination method provides a versatile environment that allows organizations to adapt de-identification to the specific information that is collected, stored and used.

Safe Harbor

This method offers a more rigid but straightforward approach to de-identification, which removes 18 specific data elements from a data set including the following:

Individual names.
Contact information (including telephone numbers, fax numbers and email addresses).
Locations (including geographic subdivisions smaller than a state).
All months and days of date data (all date data for individuals over 89).
Identifying numbers (including Social Security numbers, medical record numbers, and vehicle identifiers including license plate numbers and VINs).
Biometric identifiers including fingerprints and voice prints.
Digital identifiers including web URLs and Internet Protocol (IP) addresses.
Health plan beneficiary numbers.
Full-face photographs and comparable images.
Account numbers.
Any other unique identifying number, characteristic or code (often a variety of indirect identifiers).
Certificate and license numbers.

There is an additional consideration -- any remaining information cannot be used alone or together with other data to identify an individual. Consequently, additional information might be de-identified if necessary to meet this additional requirement. Safe Harbor is highly prescriptive and less flexible than Expert Determination, but it foregoes the need to engage a data science expert and mitigates the risk-based expert determination approach.

Non-healthcare methodologies

Organizations outside of the healthcare industry and not subject to HIPAA or other healthcare regulations can still opt to implement a data de-identification mechanism based on Expert Determination or Safe Harbor methodologies. In the strictest terms, a non-healthcare organization cannot collect, store or use all 18 data elements involved in a Safe Harbor methodology. Consequently, the business might choose to adopt more of an expert determination approach where de-identification is applied to sensitive PII that the business uses. The resulting methodology will then vary from business to business.

Any data de-identification initiative should be developed into a well-considered policy that complements existing data security, data protection and data management policies. In most cases, data de-identification policies will involve participation from business, technology, governance and compliance leaders within the business.

De-identification technique: Why is it important?

Data security is a serious concern for all business types and sizes. Modern businesses routinely collect, store, use and share sensitive PII about individuals. The potential for malicious use of stolen or improperly shared PII can result in serious harm to the individual exposed. Data de-identification is a relatively new technique intended to help safeguard PII.

De-identification renders a data set unable to identify a specific individual. This brings two important benefits to business data:

A business using de-identified data might no longer be required to report data breaches. This improves data security, reduces business risks and protects individual privacy.
De-identified data can more easily be shared (even monetized) without placing individuals at risk. For example, medical research -- which might not be able to access data due to patient privacy concerns -- can readily access and utilize data for research and analytical purposes without placing individual patient data at risk of breach or misuse.

Data de-identification is fundamentally a business practice and technical implementation. De-identification does not guarantee the fair or ethical use of any data. It's the responsibility of the organization possessing and providing the de-identified data to consider the ways that data is used and assess the outcomes.

This was last updated in June 2024

Continue Reading About data de-identification

Data privacy challenges and how to fix them

Data masking vs. data encryption: How do they differ?

Business benefits of data protection and GDPR compliance

Privacy enhancing technology types and use cases

U.S. data privacy protection laws guide

Dig Deeper on Data management strategies

Search Business Analytics

AWS boosts Q in QuickSight with AI-powered scenario analysis
Driven by customer feedback, the BI platform now enables nontechnical and expert users alike to model data and perform deep ...
The importance of data products
Treating data as a product enables organizations to turn raw information into actionable insights through intentional design, ...
Databricks partners with Anthropic to aid GenAI development
With the Claude line of models natively available in the Data Intelligence Platform, developers can securely combine data and AI ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Content Management

8 examples of document version control
Document version control can help organizations improve their content management strategies if they choose the right approach, ...
How to incorporate social media into SEO strategies
While social media marketing and SEO seem like two separate practices, when used together, they can enhance any organization's ...
7 biggest document management challenges
A document management strategy helps organizations protect and retrieve files. Yet, content managers often struggle with ...

Search Oracle

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

Search SAP

Process mining vendor Celonis sues SAP over data access
Celonis slaps SAP with a lawsuit over third-party access to data for its process mining applications.
SAP BDC strives to be an AI catalyst, but clarity is needed
The new SAP Business Data Cloud promises to provide customers with a data platform that helps unlock enterprise AI value, but ...
SAP data cloud, Databricks integration aims to unify AI data
SAP unveiled a Business Data Cloud platform and Databricks partnership to support customers in AI projects, which analysts bill ...

Close