Digital phenotype introduces new kinds of data to clinical setting

John Brownstein of Boston Children's Hospital explains how his team is 'putting the public back in public health' by mining troves of nonclinical data culled from social media.

John Brownstein, chief innovation officer at Boston Children's Hospital, is mining the digital phenotype. 

"It's this idea that all of the data you generate through your interactions with technology -- whether it's social media or with the devices -- all of those digital breadcrumbs can actually bring in unique insights about a patient," Brownstein said during his talk at the recent Harvard Institute for Applied Computational Science's annual symposium.

A patient's digital phenotype can be culled from search queries, internet traffic data, and virtual social settings such as Facebook and Twitter. Digital data sets like these can act as signals for potential infectious disease or foodborne illness outbreaks. They can also provide broader visibility into chronic disease, drug abuse and drug-diversion activity, where medication isn't used by the person to whom it's prescribed.

"We always say we're putting the public back in public health," he said.

Building the digital phenotype

John Brownstein, chief innovation officer, Boston Children's HospitalJohn Brownstein

Public health research that analyzes nonclinical data sets like these, a field of study known as computational or digital epidemiology, is a big data use case. Part of the digital phenotyping process is to ingest large data sets "from as many sources that we can identify or scrape on the web," said Brownstein, who is also a professor of biomedical informatics at Harvard Medical School.

The nonclinical internet data comes from sources such as news sources and blog posts; social media sites, such as Twitter, Facebook and Instagram; as well as sites such as Yelp and even OpenTable, an application used to make a reservation at a restaurant. Brownstein said he also leverages data from traditional sources, such as electronic medical records.  

These troves of nonclinical internet data not only provide new early signals about public health events and populations, they also give researchers access to data at a global scale, according to Brownstein. "You can imagine that the data we have across clinical settings is very geographically refined," he said.

Once the data has been collected, tools are developed to organize the data by location and by keyword, which is then mapped to taxonomies and used as a structured database for analysis, according to Brownstein. This is where Brownstein and his team rely on machine learning tools to separate potential signals from noise.

Making sense of the digital phenotype is no easy task. Part of the complexity is because people talk about medications or their symptoms in unexpected ways that don't map to medical and even nonmedical taxonomies. Examples of features that make the data hard to organize include typos, spelling variations, invented words and hashtags.

"It takes a huge amount of curation and development to get to a place where we can start to organize this content and take the ways in which people talk about illness and code them to more traditional taxonomies," he said.

An early warning system

This kind of digital phenotyping has already proved successful. Brownstein has built public health "surveillance" tools such as HealthMap, which launched in 2006. It is a patient-facing public health system that he co-created, and it uses internet data such as aggregated news stories, blog posts, government websites and social data for "disease outbreak monitoring and real-time surveillance of emerging public health trends," according to its website.

"It's a global tracking system that basically ties as many data sources as we can get access to across hundreds of thousands of websites and 15 different languages," Brownstein said. In 2014, HealthMap picked up on the deadly Ebola outbreak in West Africa a week before an official announcement was made. The early warning signal came from a news story of a "mystery hemorrhagic fever" killing eight in Guinea.

And, as the years tick by, social media data and internet data are helping produce more robust digital phenotypes. Brownstein and his team are monitoring traffic data of, for example, spikes in Wikipedia's influenza page to gauge the state of global health. They're also looking at sites such as OpenTable -- reservation cancellations could be a signal of a potential influenza outbreak -- and Yelp, which crowdsources reviews of businesses and restaurants. "I'm not sure if people know this, but 10% of Yelp reviews are food-poisoning-related," he said.

In fact, reviewers often mention specific ingredients they believe caused the disease. When Brownstein and his team shared that information with the Centers for Disease Control and Prevention (CDC), the agency was incredulous, he said. After doing its own analysis, the CDC found that the reviewers were surprisingly accurate.

"From our perspective, the consumer, the patient is much smarter than we give them credit for," Brownstein said.

Dig Deeper on IT applications, infrastructure and operations