AlfaOlga/istock via Getty Images

Understanding de-identified patient data, how to use it

Healthcare data de-identification provides significant opportunities to bolster medical research and patient care, but the process is not without its pitfalls.

Data de-identification has become an important tool in medical research and for providers looking to enhance patient care.

While data sharing between different organizations could violate the Health Insurance Portability and Accountability Act of 1996 (HIPAA), the de-identification process makes sharing information HIPAA-compliant by removing protected health information (PHI) and personally identifiable information (PII) during data processing.

De-identified data sharing can then assist medical researchers in advancing tools and treatments through the use of analytics. Additionally, HIPAA-compliant data de-identification holds promise for improving interoperability and bolstering healthcare outcomes.

What is de-identified data in healthcare?

The process of de-identification involves removing PII, such as name and social security number, as well as PHI, like medical history and insurance information, before that data can be shared for healthcare analytics or research purposes.

Removing all direct identifiers from patient data can allow healthcare organizations to share it without the potential of violating HIPAA, but stakeholders must understand the differences between PII and PHI to maintain HIPAA compliance.

While direct identifiers are removed from the data to keep a patient's identity confidential per the HIPAA Privacy Rule, indirect identifiers -- including race, age and gender -- can remain, as long the Privacy Rule's de-identification standard for PHI is met, to allow researchers to study data trends.

According to HHS, de-identification also "supports the secondary use of data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors."

De-identification is a crucial part of the healthcare data lifecycle and plays a vital role in advancing medical research while also protecting patient privacy.

The benefits and drawbacks of de-identified patient data

Data sharing enables healthcare and life sciences stakeholders to create better tools and treatments to advance patient care. However, HIPAA stipulates that patient information must be protected and cannot be shared with other entities without the patient's knowledge and consent.

By de-identifying data, providers can share information with other organizations to advance medical research and treatment. Additionally, de-identifying the data removes some liability regarding HIPAA violations.

Furthermore, the use of de-identified data can enhance collaborative research efforts in healthcare. In 2021, a group of providers came together to form Truveta, a company focused on using healthcare-driven big data analytics to enhance insights for researchers and clinicians.

By combining de-identified data from each healthcare provider's tens of millions of patients and from thousands of care facilities across the United States, Truveta can make large data sets available for use in medical research. As of December 2024, the collective boasts 30 health system members.

However, the use of de-identified data is nuanced and can be fraught with potential pitfalls. Experts indicate that the advent of technologies like connected devices and AI has changed the way healthcare organizations conceptualize patient privacy and data sharing.

Data can be de-identified to various degrees, with basic de-identification obscuring information such as name or date of birth. HIPAA requires healthcare organizations to take this a step further by hiding or removing both PII and PHI to ensure that patient privacy is protected.

However, removing this information doesn't necessarily eliminate the risk of patient re-identification. Information like an individual's IP address or the device ID associated with a pacemaker, for example, could be used to re-identify a patient.

AI technologies in healthcare are also known to be capable of re-identifying individuals, even though they are trained on de-identified data, raising questions about which privacy approaches should be utilized to address this phenomenon. Researchers recommend that HIPAA be amended to account for the use of machine learning (ML) on healthcare data.

Obscuring specific data elements that can be tied to an individual, as directed by HIPAA, is one aspect of data de-identification, but the second is more complicated, dealing with how combinations of factors within one or more datasets relate to one another.

For example, an analysis investigating the impact of social determinants of health (SDOH) on U.S. patients with a specific cancer type could contain enough data elements for some patients to be re-identified, even if the project has robust data de-identification protocols.

The combination of choice of cancer treatment regimen, timeframe and income could be used alongside additional information, such as social media posts, to re-identify a patient in this cohort. If there was a wealthy individual in the patient pool who could afford to receive a new treatment that was largely cost-prohibitive for the majority of patients, and their diagnosis was public knowledge and coincided with the timeframe of the analysis, bad actors could theoretically home in on and re-identify them.

To prevent this, healthcare stakeholders can transform data -- cryptographically, mathematically or otherwise -- at the individual data point level to make it non-visible to the data user or ensure that the analytics being performed on the data are not designed, consciously or unconsciously, to identify individuals.

Some tools, such as privacy-enhancing technologies (PETs), can assist healthcare stakeholders with these goals. There are three main types: algorithmic PETs, which alter how data are represented; architectural PETs, which focus on the structure of the data or computation environments; and augmentation PETs, which involve using historical data distributions to generate realistic synthetic data sets.

By developing a de-identification protocol that complies with HIPAA while taking additional privacy considerations into account, providers can share patient data to assist in medical advances while also maintaining patient privacy.

How do healthcare stakeholders use de-identified patient data?

De-identified data is often leveraged in research to build advanced analytics tools for healthcare.

In a 2021 study, researchers used de-identified data to develop an AI tool to predict 30-day mortality risks in patients with cancer. Using the tool, medical professionals could discover patients who are at high risk of death and provide early intervention for reversible complications.

Additionally, the tool can identify patients who are approaching end of life (EoL) and refer them to early palliative and hospice care.

In this case, the use of de-identified data can support improved quality of life and symptom management for the patient. The study's authors noted that early referral for these services could transform cancer care by reducing the unnecessary and expensive treatments at EoL, which can conflict with patient preferences and lower their quality of life.

De-identified data can also be used in developing predictive analytics tools.

To address healthcare gaps created by the COVID-19 pandemic, UnitedHealthcare developed one such tool that used de-identified data to address SDOH and enhance care quality.

Leaders of the initiative indicated that SDOH can have a greater influence on a person's health than their access to healthcare services or genetics, making tackling social determinants key to advancing population health management.

To eliminate care gaps, UnitedHealthcare created an advocacy system to assist members who might be struggling due to their social environment. Through predictive analytics and an ML model, the advocacy system can evaluate de-identified data from members and determine the need for social services.

Data is then loaded into an agent dashboard used by UnitedHealthcare advocates. When a member calls in, advocates can connect the caller to community resources at low or no cost.

Alongside de-identified patient data, real-world and synthetic data are also critical for informing a variety of research and patient outcome improvement efforts.

Shania Kennedy has been covering news related to health IT and analytics since 2022.

Next Steps

Breaking down the types of health informatics

High-value use cases for synthetic data in healthcare

Dig Deeper on Health data governance