AlfaOlga/istock via Getty Images

Comparing real-world, synthetic and de-identified data

Real-world, synthetic and de-identified data all play a role in healthcare analytics efforts, but knowing which type of data to use is key for success.

Data has become increasingly important in the healthcare industry as stakeholders pursue value-based care. The rise of EHR systems has helped facilitate the use of clinical data to inform a variety of research and patient outcome improvement efforts, and emerging technologies like AI hold significant promise to bolster this work further.

Spearheading the healthcare data analytics projects that underpin strategic initiatives requires organizations to effectively select, process and analyze the appropriate data sources. However, accessing data that is complete, high-quality and relevant to a given project can be difficult.

For this reason, researchers and health systems often rely on three types of data -- real-world, synthetic and de-identified. But each is distinct, with its own pros and cons that healthcare stakeholders must understand prior to embarking on an analytics initiative.

This primer will explore each type of data, including benefits, pitfalls and use cases.

What is real-world data?

The FDA conceptualizes real-world data (RWD) as "data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources." RWD is closely linked to real-world evidence (RWE) -- "the clinical evidence about the usage and potential benefits or risks of a medical product derived from analysis of RWD."

RWD is typically derived from EHRs, claims and billing information, disease registries, medical device registries, patient-reported outcomes, pharmacy and medication data, digital health technologies -- such as wearables and sensors -- and other sources, which can provide insights into the health status of a patient or population.

RWD plays a key role in generating RWE, which is then used by stakeholders like the FDA to inform the development and regulation of therapeutic products and interventions, such as medical devices. The emergence of technology-driven services and enhanced data management capabilities in the healthcare and life sciences industries have contributed significantly to the rapid generation and availability of RWD.

A November 2022 BMC Medical Research Methodology paper emphasizes that RWD presents a host of opportunities to bolster evidence-based decision-making. The RWE generated through RWD analysis is particularly valuable for randomized controlled trial (RCT) design and drug development.

RWD was also critical for improving outcomes during the COVID-19 pandemic, as researchers sought to quickly and effectively gain insights into hospital capacity and utilization, intervention success and high-risk patient populations.

Precision medicine is another area in which the exploration of RWD has the potential to drive meaningful innovation. To date, much of the work in this area has focused on combining clinical insights with genomic data to inform cancer care and therapeutic development. Further, RWD has shown promise in reducing cancer care disparities.

But to successfully utilize RWD for these purposes, stakeholders have to address various challenges, such as data availability and completeness. One of the major potential hurdles that healthcare organizations and researchers face is identifying when RWD is useful and appropriate for an analytics project.

Doing so requires users to determine whether another data type -- like synthetic or de-identified data -- might be a better fit.

RWD vs. synthetic data

Synthetic data can be understood as a counterpoint to RWD, as it is artificially generated, rather than pulling directly from real-world sources.

Because the quality of RWD varies, and stakeholders must be able to harmonize and standardize the data while also protecting patient privacy, synthetic data provides a valuable alternative in certain use cases.

Synthetic healthcare data is often designed to mimic the statistical characteristics and correlations of RWD, but it does not contain protected health information (PHI). This prevents patient privacy pitfalls without sacrificing the value of the information contained in the dataset.

The privacy preservation aspect of synthetic data makes it desirable for a host of high-value healthcare use cases, including algorithm training, application development, clinical research and digital twin modeling.

However, utilizing this type of data requires stakeholders to consider the pros and cons of synthetic healthcare data. Preserving patient privacy and preventing potential data re-identification are major boons in an era where covered entities must comply with the HIPAA Privacy Rule, but synthetic data can create challenges related to data quality and bias, which can negatively impact analytics.

Factors like data leakage -- which occurs when information from the test dataset is utilized during an algorithm's training -- are also significant hurdles to synthetic data use for healthcare AI projects, as these can contribute to AI model collapse.

Additionally, synthetic data generation can fall short in cases requiring the creation of patient cohorts. These data generation models often excel at generating information for a single synthetic patient, but struggle when tasked with doing so for an entire patient population.

These hurdles are not impossible to overcome, but they might discourage healthcare organizations from turning to synthetic data, depending on the use case. This is where de-identified data can come in.

The role of de-identified healthcare data

Data de-identification refers to the process of masking or decoupling data elements to prevent them from being associated with an individual, and healthcare organizations understand its importance, as it is related to HIPAA compliance.

Healthcare data de-identification involves removing PHI and personally identifiable information (PII) when processing, or before sharing, that data. This enables HIPAA-compliant data sharing among healthcare organizations, which can enhance medical research and patient care.

Removing only direct identifiers -- such as name and Social Security number -- to prioritize confidentiality, while leaving some indirect identifiers -- including age, gender and race -- in the data can help researchers study real-world trends within a patient population or demographic.

As long as PHI is de-identified according to the HIPAA Privacy Rule, healthcare data can be further de-identified to various degrees depending on the use case, allowing some flexibility in how that information can be used for an analytics project.

To date, researchers have successfully developed an automated EHR tool that produces de-identified data and launched initiatives to use de-identified information to improve health equity research.

Despite these successes, de-identified data is not infallible. Removing direct identifiers does not necessarily guarantee that a patient cannot be re-identified via information like their IP address or medical device ID number.

This re-identification risk is compounded by the growing use of emerging technologies like AI and machine learning (ML), alongside tools like connected devices. AI models have demonstrated that they can re-identify patients, even when trained on de-identified data, creating privacy concerns that regulatory bodies are struggling to address.

The rapid advancement of these tools has led some to call for updates to HIPAA that would account for the use of AI and ML in healthcare.

De-identification also cannot prevent data from being re-identified in other ways, such as the interaction of multiple variables. The relationship between data points like income, treatment regimen and timeframe could provide enough information to re-identify an individual, requiring users to both de-identify and transform the data before using it for analytics.

These issues can be avoided by developing a HIPAA-compliant data de-identification protocol that also considers the additional privacy concerns that currently fall outside of HIPAA's purview.

By understanding the benefits, pitfalls and best practices associated with RWD, synthetic data and de-identified data use in healthcare, stakeholders can make more informed decisions around their analytics efforts and improve care.

Shania Kennedy has been covering news related to health IT and analytics since 2022.

Dig Deeper on Health data governance