Getty Images
Data Analytics in Healthcare: Defining the Most Common Terms
As data analytics becomes essential for healthcare organizations, stakeholders need to understand the basic vocabulary related to the process.
Big data analytics play an increasingly significant role in improving care quality, patient outcomes, operational efficiency, and cost reduction for healthcare stakeholders. But navigating the healthcare data analytics landscape requires understanding a few key concepts.
The American Health Information Management Association (AHIMA) characterizes data analytics as “analysis of the data in some way using quantitative and qualitative techniques to be able to explore for trends and patterns in the data,” noting that the concept is closely related to health informatics, which involves using that data to improve healthcare decision-making.
Health systems across the United States are at various points in their data analytics journey, with some working to develop advanced technologies like artificial intelligence (AI) and others still struggling with the transition to EHRs.
However, no matter where on the spectrum a healthcare organization finds itself, there is no question that leveraging patient information, claims, and other data is becoming critical within the healthcare sector.
Data analytics tools for population health management and to support accountable care organizations’ (ACOs) value-based care strategies have already demonstrated that healthcare data can help health systems optimize to better achieve key performance indicators (KPIs).
But for those looking to establish a data analytics strategy for the first time, the multifaceted nature of healthcare analytics can seem daunting. Implementing a strategy or tool in this area often requires stakeholders to balance financial, technological, clinical, legal, ethical, and HIPAA compliance considerations.
These decisions are best left to a healthcare organization’s leadership, but acknowledging the need for a common understanding of key data analytics terms to foster communication and collaboration is an important first step.
Below, HealthITAnalytics will outline some of the most common data analytics terms that healthcare stakeholders can expect to come across as they explore how to develop or deploy their analytics strategies.
After mastering these concepts, learning more about the four types of healthcare big data analytics and the most common machine learning (ML) terms in healthcare can help health systems move toward utilizing advanced analytics.
DATA COLLECTION AND PREPARATION
Before data, or pieces of information, can be analyzed, they need to be collected and prepared.
In healthcare, data is collected from a multitude of sources, including EHRs, claims, social determinants of health (SDOH) information, public health surveillance, disease registries, peer-reviewed research and literature, administrative data, and patient surveys.
Data collection or extraction is the process of systematically gathering this information to answer research questions or evaluate a given outcome. Healthcare organizations may collect SDOH data from patients in order to inform a health equity strategy or gauge their progress on health equity goals.
Data can be qualitative or quantitative, depending on how the information is measured. The data can also be structured, unstructured, or semi-structured based on how the data are coded and formatted. Synthetic data can be used in some applications where appropriate real-world data is unavailable.
Regardless of data type, ensuring that the data is accurate and of high quality is critical to making this information usable in healthcare.
Data quality measures help stakeholders determine the completeness, accuracy, reliability, and consistency of the data. Data integrity involves monitoring data quality across formats and over time.
Part of this process entails converting the data to be analyzed to a common format, which is where data standardization comes in. This stage of the analytics process assists stakeholders with making the data consistent, which is crucial for analysis later.
If datasets are not standardized, they cannot be compared to one another, even if the information within them quantifies the same metric or outcome. For example, one dataset may contain unstructured clinical notes related to patient diagnoses, while another contains International Classification of Diseases (ICD) codes.
In theory, some of the data within these sets are comparable. Still, meaningful insights cannot be gleaned from the comparison since the data is in disparate formats, meaning it may be different in other ways that could negatively impact the analysis.
Also, when done correctly, the adoption of health data standards can enable interoperability.
Throughout the data preparation and standardization process, healthcare organizations will also need to clean and normalize their data.
Data cleaning involves finding, defining, and addressing abnormalities in the data. These could be related to missing data, detecting statistical outliers, or data errors. When discussing data cleaning, experts emphasize that the process cannot fix a poorly designed study or analytics project, arguing that practices like requiring data cleaning reports are needed to prevent unwanted data transformation or manipulation.
Data normalization takes standardization one step further to help make data less ambiguous and more usable across systems.
During the data collection and preparation process, healthcare organizations should also take note of data de-identification best practices to protect patient data privacy and ensure HIPAA compliance.
DATA MANAGEMENT AND STORAGE
Data management follows collection and preparation, and the process helps organizations obtain value from their data.
Oracle defines data management as “the practice of collecting, keeping, and using data securely, efficiently, and cost-effectively. The goal of data management is to help people, organizations, and connected things optimize the use of data within the bounds of policy and regulation so that they can make decisions and take actions that maximize the benefit to the organization.”
Health systems can use a data management framework combined with data architecture — which provides a blueprint for the data managed, including how it flows through storage systems — to better utilize their data to guide strategic decision-making.
Such a framework also helps organizations choose which data storage options may be right for them.
Interrelated data or collections of data are referred to as databases. These can be housed in various types of storage infrastructure. One of the most basic is a repository.
Data repositories are centralized locations that hold, organize, and make data available for use. Data centers serve a similar function, acting as the location within an enterprise where data and backend IT systems, such as databases, servers, and mainframes, live.
A data warehouse is a larger, centralized repository. Data warehousing allows organizations to store data from both internal operational databases and external sources. In healthcare, it can be beneficial to differentiate between repositories and warehouses depending on the intended use case.
A clinical data repository, for example, is designed to consolidate information from different clinical sources to provide data on individuals, while a data warehouse contains data from systems across the enterprise to support analytics efforts across a patient population or for multiple use cases.
Data lakes are similar to data warehouses in that both serve as repositories to store large volumes of data. But there are key differences between them that warrant consideration, as they can be used separately or included in a shared analytics ecosystem.
A data lake is designed to capture both relational and non-relational data, whether the data are structured, semi-structured, or unstructured. This means that the data lake can hold a variety of data types in which data points may or may not be related to one another.
The schema, or structure, of the data doesn’t need to be defined until it is read or pulled for processing. This allows raw and unfiltered data to be stored while prioritizing low-cost flexibility and scalability. Use cases for data lakes typically center on real-time and predictive analytics in addition to ML tasks.
Data warehouses, on the other hand, deal with structured, relational data in which the structure is defined when the data is written. The data contained in a warehouse have also already been processed and transformed for a specific purpose or use case.
These features make data warehouses a good option for a predefined business use case or business intelligence (BI) analysis, as they are more expensive and difficult to scale.
Data lakehouses marry data lakes and data warehouses into one repository platform. By combining the two, a data lakehouse allows data to be migrated or moved between repositories via a data pipeline without sacrificing enhanced data management capabilities, which helps break down data silos.
In a data lakehouse, the storage flexibility and ability to hold unstructured data typically associated with a data lake are integrated with the ability to implement schema and data governance protocols usually found in a data warehouse.
Data governance refers to the procedures, policies, responsibilities, and structure associated with an organization’s data management strategy. Data governance models vary across industries and businesses, but the goals are often shared. Healthcare organizations can leverage data governance to provide a framework for data quality, security, privacy, and stewardship.
DATA PROCESSING AND ANALYSIS
After establishing governance, management, and storage strategies, organizations can focus on utilizing their data for analytics and insight generation.
This often begins with data mining, a process stakeholders can use to find trends and patterns within large datasets. By identifying patterns that may not have been immediately obvious to the user, data mining can help highlight previously unknown areas of interest and potential queries for a future analytics project.
Data modeling helps organize these data elements and communicate the relationships between them. Data models are often built around specific business needs or use cases to help create a visual representation of the data types, their attributes and formats, and how each group of data points is related to another.
Modeling can aid data visualization, which refers to the creation of graphical representations of data. Visualization techniques are especially useful for exploring vast data quantities or complex patterns within a dataset, as these methods rely on humans’ propensity for environmental pattern recognition to clearly communicate information.
There are a plethora of data visualization methods to choose from, but healthcare stakeholders often take advantage of big data dashboards.
In the past, health systems have used these dashboards to reduce chemotherapy side effects, optimize cancer clinical decision support, track Medicare Advantage (MA) plans’ health equity performance, monitor COVID-19 spread among college students, and provide information on vaccine rollouts.
Federal healthcare stakeholders are also leveraging this visualization tool, as evidenced by the launch of the Heat-Related Illness EMS Activation Surveillance Dashboard, or EMS HeatTracker, by the United States Department of Health and Human Services (HHS) earlier this month.
Regardless of the technique used, visualization is one of the most important aspects of the analytics process because data aren’t usable or actionable unless the target audience can conceptualize and understand them.
Throughout the data analytics process, and particularly at the end, metadata will be generated. This type of data is often characterized as “data about data,” but the term specifically denotes information about how or when the data were collected, from where, by whom, and for what purpose. Typically, metadata can streamline the use and management of data. This can be particularly useful for healthcare organizations with vast amounts of data.