How a synthetic data approach is helping COVID-19 research

Israeli researchers are using a system developed by big data vendor MDClone to collaborate on COVID-19 data to help find ways to deal with the pandemic.

As medical researchers around the world race to find answers to the COVID-19 pandemic, they need to gather as much clinical data as possible for analysis.

A key challenge many researchers face with clinical data is privacy and the mandate to protect confidential patient information. One way to overcome that privacy challenge is by using synthetic data, an approach that creates data that is not linked to personally identifiable information. Rather than encrypting or attempting to anonymize data to protect privacy, synthetic data represents a different approach that can be useful for medical researchers.

With synthetic data there are no real people, rather the data is a synthetic copy that is statistically comparable, but entirely composed of fictional patients, explained Ziv Ofek, founder and CEO of health IT vendor MDClone, based in Beer Sheba, Israel.

Other popular methods of protecting patient privacy, such as anonymization and encryption, aim to balance patient privacy and data utility. However, a privacy risk still remains because embedded within the data, even after diligent attempts to protect privacy, are real people, Ofek argued.

"There are no real people embedded within the synthetic data," Ofek said. "Instead, the data is a statistical representation of the original and the risk of reidentification is no longer relevant, even though it may appear as real people and can be analyzed as if it were and yielding the same conclusions."

Synthetic Data Engine from MDClone
MDClone Synthetic Data Engine creates anonymous data statistically identical to the original.

Synthetic data in the real world

MDClone's synthetic data technology is being used by Sheba Medical Center in Tel Aviv as part of its COVID-19 research.

Synthetic data provides an opportunity to get quick answers to data-related questions ... [and] allows users to work on the data in their own environment, something we do not allow with real data.
Eyal Zimlichman, M.D.Deputy director general, Sheba Medical Center

The MDClone system is critical to his organization's data efforts to gain more insights into COVID-19, the disease caused by the novel coronavirus, said Eyal Zimlichman, M.D., deputy director general, chief medical officer and chief innovation officer at Sheba Medical.

By regulation, synthetic data is not considered patient data and therefore is not subject to the IRB process. As opposed to real patient data, Ofek noted that synthetic data can be accessed freely by researchers, so long as the institution agrees to provide access.

"Synthetic data provides an opportunity to get quick answers to data-related questions without the need for an IRB approval,"Zimlichman said. "It also allows users to work on the data in their own environment, something we do not allow with real data."

Zimlichman added that data science groups both within and outside the hospital are using the MDClone system to help predict COVID-19 patient outcomes, as well as to aid in determining a course of action for therapy.

Synthetic data accelerates time to insight

The MDClone platform includes a data engine for collecting and organizing patient data, the discovery studio for analysis and the Synthetic Data Engine for creating data. The vendor on April 14 released the MDClone Pandemic Response Package, which includes a predefined set of visualizations and analyses that are COVID-19-specific. The engine enables clients and networks to ask questions of COVID-19-related data and generate meaningful analysis, including cohort and population-level insights.

In the event a client wants to use their data to share, compare and collaborate with others, they can convert their original data into a synthetic copy for shared review and insight development.

"A synthetic collaboration model allows for that conversation to take place with data flows and analysis performed across both systems without patient privacy and security risks," Ofek said.

Ofek added that the synthetic model and platform access capability enables clients to invite research and collaboration partners into their data environment rather than simply sharing files on demand. With MDClone, the client's research and collaboration partners are able to log in to the MDClone data lake and then get access to the data and exploration tools with synthetic output.

"In the context of the pandemic, organizations leveraging the platform can offer partners unfettered synthetic access to accelerate exploration into new avenues for treatment," Ofek said. "Idea generation and data reviews that enable real-world analysis is our pathway to finding and broadcasting the best healthcare professionals can offer as we combat the disease."

Dig Deeper on Data governance