Getty Images

Machine Learning Model Shows Higher COVID-19 Cases Than Reported

A machine learning model estimated that the number of US COVID-19 cases is nearly three times greater than reported.

Since the pandemic began, experts have looked to daily counts of laboratory-confirmed COVID-19 cases and deaths in an effort to contain the virus. Now, a machine learning algorithm has revealed that these numbers may be higher than reported.

In a study published in PLOS ONE, researchers estimate that the number of COVID-19 cases in the US since the pandemic started is nearly three times that of confirmed cases. The machine learning algorithm provides daily updated estimates of total infections to date, as well as how many people are currently infected across the US and in 50 countries hardest hit by the pandemic.

According to the model, as of February 4, 2021 more than 71 million people in the US had contracted COVID-19. This is significantly greater than the 26.7 million publicly reported number of confirmed cases.

Of those 71 million Americans estimated to have had COVID-19, seven million had current infections and were potentially contagious on February 4, the algorithm showed.

The study is based on calculations completed in September. At that time, the number of actual cumulative cases in 25 of the 50 hardest-hit countries was five to 20 times greater than the confirmed case numbers then suggested.

The current information available on the online algorithm shows that the estimates are now closer to the reported numbers, but still a lot higher. On February 4, Brazil had more than 36 million cumulative cases as estimated by the algorithm – almost four times more than the 9.4 million confirmed cases reported.

France had 14 million versus the 3.2 million reported, while the UK had almost 25 million instead of about four million. The machine learning algorithm also showed that Mexico had nearly 15 times its reported number of cases, at 27.6 million cases instead of 1.9 million confirmed cases.

"The estimates of actual infections reveal for the first time the true severity of COVID-19 across the US and in countries worldwide," said Jungsik Noh, PhD, a UT Southwestern assistant professor in the Lyda Hill Department of Informatics and first author of the study.

To run its daily updates, the model uses COVID-19 death data from Johns Hopkins University and The COVID Tracking Project, a volunteer organization that aims to help track COVID-19.

The algorithm uses the number of reported deaths, which is thought to be more accurate than the number of lab-confirmed cases, as the basis of its calculations. The model then assumes an infection fatality rate of 0.66 percent, based on an earlier study of the pandemic in China, and considers factors like the average number of days from the onset of symptoms to death or recovery.

The algorithm also compares its estimate with the number of confirmed cases to calculate a ratio of confirmed-to-estimated infections.

Experts are still uncertain about the death rate of COVID-19, so the algorithm’s estimates are rough. However, researchers believe that the model’s estimates are more accurate and leave out fewer cases than the confirmed ones currently used to guide public health policies. The team noted that it’s critical to have a more comprehensive estimate of the prevalence of the disease.

"These are critical statistics about the severity of COVID-19 in each region. Knowing the true severity in different regions will help us effectively fight against the virus spreading," said Noh.

"The currently infected population is the cause of future infections and deaths. Its actual size in a region is a crucial variable required when determining the severity of COVID-19 and building strategies against regional outbreaks."

The study showed that in the US, infections vary significantly by state. The algorithm’s projections for February 4 revealed that California has had almost seven million infections since the pandemic started, compared with 5.7 million in New York. Additionally, the model estimated that California had 1.3 million active cases on that date, impacting 3.4 percent of the state’s population.

Researchers checked their findings by comparing results with existing prevalence rates found in several studies that used blood tests to check for antibodies to the virus causing COVID-19. For most of the areas tested, the algorithm’s estimates of infections closely corresponded to the percentage of people who had tested positive for the antibodies.

The team expects that their machine learning model can help inform public health policies during the pandemic.

“Our framework estimates the actual fraction of currently infected people in each region. To our knowledge this is the first model to provide this prediction. The estimated number of current infections can serve as an initial target in planning effective contact tracing,” researchers concluded.

“Since the developed pipeline requires simple input, it is widely applicable to more granular analyses of specific regions or communities, for which the number of confirmed cases and deaths are being tracked.”

Next Steps

Dig Deeper on Artificial intelligence in healthcare