Getty Images/iStockphoto

How EHR Data, Survey Responses Can Mitigate Clinical Research Bias

The All of Us Research Program provides a novel clinical research repository that includes participants' EHR data and survey responses.

Disagreement between EHR data and survey responses can help identify possible missing health records and mitigate potential clinical research biases, according to a study published in JAMIA.

The All of Us Research Program provides a novel research repository that includes participants' EHR data and survey responses.

The program aims to enroll participants that reflect the country's diversity, including populations traditionally underrepresented in biomedical research like individuals from racial and ethnic minority groups, those who are 65 years or older, and those who identify as sexual or gender minorities.

Individuals aged 18 years or older can enroll in the program online or through a participating healthcare provider organization. Participants may share their EHR data, physical measurements, biospecimens, and complete surveys through an electronic portal or computer-assisted phone interview.

The program sends out seven surveys in total. Participants receive three surveys upon enrollment focused on basic demographics, lifestyle, and overall health. Then, the program sends out three follow-up surveys related to medical history (MH), family MH, and healthcare access.

The medical history survey includes self-report questionnaires about over 150 medical conditions diagnoses organized into 12 disease categories. The 7th survey collects information about participants' well-being during COVID-19.

The 4th release of the All of Us dataset included data from 314,994 individuals. Just over 28 percent of participants completed medical history surveys, and 65.5 percent contributed EHR data.

The researchers identified the three most and least frequent self-reported diagnoses for each disease category and retrieved their analogs from EHRs.

The survey's hearing and vision category had the highest number of responses but had the second-lowest positive agreement with the EHR (0.21).

The infectious disease category had the lowest positive agreement with the EHR (0.12), while cancer conditions had the highest positive agreement (0.45) between the two data sources.

"Conditions that are usually undocumented in EHRs had low agreement scores, demonstrating that survey data can supplement EHR data," the study authors wrote.

Additionally, the absence of a diagnosed condition is typically not documented in EHRs. However, the study authors noted that clinical researchers could confirm the absence using survey answers and proceed to use that data to define control cohorts in phenotype algorithms.

"Identifying concordance and discordance between the two sources can aid researchers in developing more accurate phenotype models," they emphasized.

The authors noted that condition-level and patient-level factors might impact the agreement between self-reported and EHR-based medical history.

On the condition-level, disagreement might occur from a lack of coding for some diseases in the EHR due to documentation processes driven by billing.

Disagreement might also occur from high levels of EHR code aggregation and generalized disease names in surveys.

"Using more specific EHR codes with more descriptive survey responses might reduce disagreement between EHR and survey," the researchers suggested.

At the patient-level, disagreement between the data sources might occur due to low EHR density, which is the measurement for the quantity and temporal distribution of clinical data over time.

"Participants who have negative agreement had lowest proportions of EHR density between 0.8 and 0.9, suggesting that those participants might be healthier, while the lower values of EHR density for disagreement cases might indicate data missingness," the authors wrote.

"Disagreement between survey and EHR might occur since EHR can lack the full patient record due to receiving care at multiple hospitals coupled with lack of interoperability or socioeconomic factors that prevent patients from seeking care such as income, insurance, and distance," they added.

The authors noted that this analysis could help the program identify participants who may have incomplete EHR data or participants who enrolled at a site that does not have their primary record.

"Some researchers might exclude participants due to the low amount of information without accounting for their race, gender, sexual orientation, or the completeness of their records," they wrote. "Although missingness is one of the biggest challenges in EHR, leaning toward excluding those participants might cause bias in the models or recommendations towards participants who have more EHR data."

The authors suggested that assessing EHR missingness using MH surveys might help create models that are not biased toward participants with more data.

Next Steps