Definition

data de-identification

What is data de-identification?

Data de-identification is decoupling or masking data, to prevent certain data elements from being associated with the individual.

De-identification does not limit or prevent collecting and storing personally identifiable information (PII). However, de-identification ensures that collected and stored PII cannot be linked to specific individuals. This allows organizations to use and share data while minimizing the risks related to business governance, regulatory compliance and potential PII data breaches or misuse.

De-identification is normally associated with healthcare regulations -- such as HIPAA -- and is often discussed in the context of protected health information (PHI). However, the value of de-identification can readily be applied to other regulatory and data management frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act. Organizations looking to enhance data security or privacy protection should consider implementing a data de-identification scheme.

For example, a business might continue to collect PII such as date of birth, Social Security number or biometric identifiers -- such as fingerprints. Then, by employing de-identification techniques, the business can mask this data from the associated individual, making it safer to use and share.

De-identification employs several possible techniques to mask data from individuals, including safe harbor and expert determination methodologies.

How does data de-identification work?

All data de-identification begins with a primary data source, such as a database, containing all the collected information. Business applications typically do not possess data directly. For example, an application -- such as a CRM platform -- will access and query a separate enterprise database. The application then processes the data returned by the database.

The key to successful data de-identification is to prevent the sensitive or restricted data elements within the database from being accessed, correlated or presented together.

For example, a medical research team wants to perform statistical data processing on data collected from patients with certain age and race characteristics participating in a drug trial. The researchers can enter their desired parameters into a query, but they receive only pertinent details of the medical records. Information in each file, such as names, might be replaced with a generic number. Other details, such as addresses, phone numbers, dates of birth and other PII, might be masked or blocked entirely. So, the data retrieved from the database cannot be directly related to any specific individual.

Data de-identification techniques operate on direct and indirect identifiers. Direct identifiers are readily correlated to specific individuals, such as passports or social security numbers. Indirect identifiers are not immediately correlated to specific individuals but can potentially be used to narrow down or infer an individual's identity. Indirect identifiers can include factors such as height, weight, eye color or race.

Data tagging

The first step in data de-identification is to select the direct and indirect data elements to be de-identified. Data teams will typically apply tags to important data elements within the database, allowing a high degree of automation in the de-identification process. For example, data teams might apply tags to data elements such as first and last names or birth dates. Data de-identification techniques can then process the data store to mask each tagged data element.

Access controls

De-identification involves more than masking data. Proper de-identification and data security also require the implementation of data access controls which define who can access data, what data can be accessed, and for what legitimate purposes. Access controls include technical measures such as identity and access management and comprehensive business policies surrounding the proper storage, safeguarding, access and use of business data.

De-identification

Tagged data can then be effectively de-identified or masked. Data is first typically preprocessed using pseudonymization, which replaces tagged data elements with placeholders or artificial identifiers. The key to pseudonymization is that data can be re-identified. Think of pseudonymization as a kind of encryption used to mask data elements such as a salted hash technique. The data can be "unencrypted" if the salt (the key) is known.

Pseudonymization is different from anonymization, which is intended to prevent data re-identification. In effect, anonymization is the same kind of "encryption" which has no key, making it irreversible. Anonymization is generally not applied in business de-identification environments where data may need to be re-identified but is applied in security situations where data should be rendered inaccessible, such as the GDPR's "right to be forgotten."

Pseudonymization is only one of several common masking techniques. Other de-identification masking techniques include varied forms of the following:

  • Encryption. This traditional technique scrambles data according to an algorithm based on a single variable or key.
  • Noise. Deliberately introducing data errors by randomly misclassifying certain variables.
  • Perturbation. This places fake or random data in place of the masked data elements.
  • Pseudonymization. Data elements are replaced with placeholder or artificial data.
  • Swapping or shuffling. This exchanges data between records so that users cannot know which individuals the real data belongs to.

Identifiers removed during data de-identification

Once the underlying concept and need for data de-identification is understood, a business must consider precisely what data is potentially subject to de-identification. Healthcare and other public health organizations typically rely on two general HIPAA guidelines -- the Expert Determination method and the Safe Harbor method.

Expert Determination

This risk-based method relies on the knowledge and expertise of a data science professional who can meet three general criteria:

  • The expert knows how to apply data de-identification techniques successfully.
  • The expert can attest that information selected for de-identification will reduce the risk that the data could be used (alone or with other data) to identify an individual.
  • The expert can readily document the methodology and analysis used to reach their determinations.

The use of the Expert Determination method requires a highly skilled professional with strong scientific, data science and analytical expertise. These experts should seek regularly updated guidance from related industry sources or government resources such as the Health and Human Services website. However, the Expert Determination method provides a versatile environment that allows organizations to adapt de-identification to the specific information that is collected, stored and used.

Safe Harbor

This method offers a more rigid but straightforward approach to de-identification, which removes 18 specific data elements from a data set including the following:

  • Individual names.
  • Contact information (including telephone numbers, fax numbers and email addresses).
  • Locations (including geographic subdivisions smaller than a state).
  • All months and days of date data (all date data for individuals over 89).
  • Identifying numbers (including Social Security numbers, medical record numbers, and vehicle identifiers including license plate numbers and VINs).
  • Biometric identifiers including fingerprints and voice prints.
  • Digital identifiers including web URLs and Internet Protocol (IP) addresses.
  • Health plan beneficiary numbers.
  • Full-face photographs and comparable images.
  • Account numbers.
  • Any other unique identifying number, characteristic or code (often a variety of indirect identifiers).
  • Certificate and license numbers.

There is an additional consideration -- any remaining information cannot be used alone or together with other data to identify an individual. Consequently, additional information might be de-identified if necessary to meet this additional requirement. Safe Harbor is highly prescriptive and less flexible than Expert Determination, but it foregoes the need to engage a data science expert and mitigates the risk-based expert determination approach.

Non-healthcare methodologies

Organizations outside of the healthcare industry and not subject to HIPAA or other healthcare regulations can still opt to implement a data de-identification mechanism based on Expert Determination or Safe Harbor methodologies. In the strictest terms, a non-healthcare organization cannot collect, store or use all 18 data elements involved in a Safe Harbor methodology. Consequently, the business might choose to adopt more of an expert determination approach where de-identification is applied to sensitive PII that the business uses. The resulting methodology will then vary from business to business.

Any data de-identification initiative should be developed into a well-considered policy that complements existing data security, data protection and data management policies. In most cases, data de-identification policies will involve participation from business, technology, governance and compliance leaders within the business.

De-identification technique: Why is it important?

Data security is a serious concern for all business types and sizes. Modern businesses routinely collect, store, use and share sensitive PII about individuals. The potential for malicious use of stolen or improperly shared PII can result in serious harm to the individual exposed. Data de-identification is a relatively new technique intended to help safeguard PII.

De-identification renders a data set unable to identify a specific individual. This brings two important benefits to business data:

  • A business using de-identified data might no longer be required to report data breaches. This improves data security, reduces business risks and protects individual privacy.
  • De-identified data can more easily be shared (even monetized) without placing individuals at risk. For example, medical research -- which might not be able to access data due to patient privacy concerns -- can readily access and utilize data for research and analytical purposes without placing individual patient data at risk of breach or misuse.

Data de-identification is fundamentally a business practice and technical implementation. De-identification does not guarantee the fair or ethical use of any data. It's the responsibility of the organization possessing and providing the de-identified data to consider the ways that data is used and assess the outcomes.

This was last updated in June 2024

Continue Reading About data de-identification

Dig Deeper on Data management strategies

Business Analytics
SearchAWS
Content Management
SearchOracle
SearchSAP
Close