Data Management/Data Warehousing Definitions

This glossary explains the meaning of key words and phrases that information technology (IT) and business professionals use when discussing data management and related software products. You can find additional definitions by visiting WhatIs.com or using the search box below.

  • #

    5V's of big data

    The 5 V's of big data -- velocity, volume, value, variety and veracity -- are the five main and innate characteristics of big data.

  • A

    ACID (atomicity, consistency, isolation, and durability)

    In transaction processing, ACID (atomicity, consistency, isolation, and durability) is an acronym and mnemonic device used to refer to the four essential properties a transaction should possess to ensure the integrity and reliability of the data involved in the transaction.

  • Apache Hadoop YARN

    Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework.

  • Azure Data Studio (formerly SQL Operations Studio)

    Azure Data Studio is a Microsoft tool, originally named SQL Operations Studio, for managing SQL Server databases and cloud-based Azure SQL Database and Azure SQL Data Warehouse systems.

  • What is Apache Flink?

    Apache Flink is a distributed data processing platform for use in big data applications, primarily involving analysis of data stored in Hadoop clusters.

  • What is Apache Spark?

    Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers.

  • What is Azure SQL Data Warehouse?

    Azure SQL Data Warehouse is a managed Data Warehouse-as-a Service (DWaaS) offering provided by Microsoft Azure.

  • B

    big data

    Big data is a combination of structured, semi-structured and unstructured data that organizations collect, analyze and mine for information and insights.

  • big data engineer

    A big data engineer is an information technology (IT) professional who is responsible for designing, building, testing and maintaining complex data processing systems that work with large data sets.

  • big data management

    Big data management is the organization, administration and governance of large volumes of both structured and unstructured data.

  • C

    C++

    C++ is an object-oriented programming (OOP) language that is viewed by many as the best language for creating large-scale applications. C++ is a superset of the C language.

  • columnar database

    A columnar database (column-oriented) is a database management system (DBMS) that stores data on disk in columns instead of rows.

  • compliance

    Compliance is the state of being in accordance with established guidelines or specifications, or the process of becoming so.

  • conformed dimension

    In data warehousing, a conformed dimension is a dimension that has the same meaning to every fact with which it relates.

  • consumer privacy (customer privacy)

    Consumer privacy, also known as customer privacy, involves the handling and protection of the sensitive personal information provided by customers in the course of everyday transactions.

  • CRUD cycle (Create, Read, Update and Delete Cycle)

    The CRUD cycle describes the elemental functions of a persistent database in a computer.

  • customer data integration (CDI)

    Customer data integration (CDI) is the process of defining, consolidating and managing customer information across an organization's business units and systems to achieve a "single version of the truth" for customer data.

  • What is corporate performance management (CPM)?

    Corporate performance management (CPM) encompasses the processes and methodologies used to align an organization's strategies and goals to its plans and actions as a business.

  • D

    data analytics (DA)

    Data analytics (DA) is the process of examining data sets to find trends and draw conclusions about the information they contain.

  • data catalog

    A data catalog is a software application that creates an inventory of an organization's data assets to help data professionals and business users find relevant data for analytics uses.

  • data classification

    Data classification is the process of organizing data into categories that make it easy to retrieve, sort and store for future use.

  • data cleansing (data cleaning, data scrubbing)

    Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set.

  • data de-identification

    Data de-identification is decoupling or masking data, to prevent certain data elements from being associated with the individual.

  • Data Dredging (data fishing)

    Data dredging -- sometimes referred to as data fishing -- is a data mining practice in which large data volumes are analyzed to find any possible relationships between them.

  • data engineer

    A data engineer is an IT professional whose primary job is to prepare data for analytical or operational uses.

  • data integration

    Data integration is the process of combining data from multiple source systems to create unified sets of information for both operational and analytical uses.

  • data lake

    A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications.

  • data lakehouse

    A data lakehouse is a data management architecture that combines the key features and the benefits of a data lake and a data warehouse.

  • data mesh

    Data mesh is a decentralized data management architecture for analytics and data science.

  • data modeling

    Data modeling is the process of creating a simplified visual diagram of a software system and the data elements it contains, using text and symbols to represent the data and how it flows.

  • data observability

    Data observability is a process and set of practices that aim to help data teams understand the overall health of the data in their organization's IT systems.

  • data pipeline

    A data pipeline is a set of network connections and processing steps that moves data from a source system to a target location and transforms it for planned business uses.

  • data preprocessing

    Data preprocessing, a component of data preparation, describes any type of processing performed on raw data to prepare it for another data processing procedure.

  • data profiling

    Data profiling refers to the process of examining, analyzing, reviewing and summarizing data sets to gain insight into the quality of data.

  • data quality

    Data quality is a measure of a data set's condition based on factors such as accuracy, completeness, consistency, reliability and validity.

  • data stewardship

    Data stewardship is the management and oversight of an organization's data assets to help provide business users with high-quality data that is easily accessible in a consistent manner.

  • data structure

    A data structure is a specialized format for organizing, processing, retrieving and storing data.

  • data transformation

    Data transformation is the process of converting data from one format, such as a database file, XML document or Excel spreadsheet, into another.

  • data virtualization

    Data virtualization is an umbrella term used to describe an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data.

  • data warehouse

    A data warehouse is a repository of data from an organization's operational systems and other sources that supports analytics applications to help drive business decision-making.

  • data warehouse as a service (DWaaS)

    Data warehouse as a service (DWaaS) is an outsourcing model in which a cloud service provider configures and manages the hardware and software resources a data warehouse requires, and the customer provides the data and pays for the managed service.

  • database (DB)

    A database is a collection of information that is organized so that it can be easily accessed, managed and updated.

  • database administrator (DBA)

    A database administrator (DBA) is the information technician responsible for directing and performing all activities related to maintaining and securing a successful database environment.

  • database as a service (DBaaS)

    Database as a service (DBaaS) is a cloud computing managed service offering that provides access to a database without requiring the setup of physical hardware, the installation of software or the need to configure the database.

  • database management system (DBMS)

    A database management system (DBMS) is a software system for creating and managing databases.

  • database replication

    Database replication is the frequent electronic copying of data from a database in one computer or server to a database in another -- so that all users share the same level of information.

  • DataOps

    DataOps is an Agile approach to designing, implementing and maintaining a distributed data architecture that will support a wide range of open source tools and frameworks in production.

  • Db2

    Db2 is a family of database management system (DBMS) products from IBM that serve a number of different operating system (OS) platforms.

  • dimension

    In data warehousing, a dimension is a collection of reference information that supports a measurable event, such as a customer transaction.

  • dimension table

    In data warehousing, a dimension table is a database table that stores attributes describing the facts in a fact table.

  • disambiguation

    Disambiguation is the process of determining a word's meaning -- or sense -- within its specific context.

  • What are data silos and what problems do they cause?

    A data silo is a repository of data that's controlled by one department or business unit and isolated from the rest of an organization, much like grass and grain in a farm silo are closed off from outside elements.

  • What is a data architect?

    A data architect is an IT professional responsible for defining the policies, procedures, models and technologies used in collecting, organizing, storing and accessing company information.

  • What is a data fabric?

    A data fabric is an architecture and software offering a unified collection of data assets, databases and database architectures within an enterprise.

  • What is a data flow diagram (DDF)?

    A data flow diagram (DFD) is a graphical or visual representation that uses a standardized set of symbols and notations to describe a business's operations through data movement.

  • What is a data mart (datamart)?

    A data mart is a repository of data that is designed to serve a particular community of knowledge workers.

  • What is dark data?

    Dark data is digital information an organization collects, processes and stores that is not currently being used for business purposes.

  • What is data activation?

    Data activation is a marketing approach that uses consumer information and data analytics to help companies gain real-time insight into target audience behavior and plan for future marketing initiatives.

  • What is data aggregation?

    Data aggregation is any process whereby data is gathered and expressed in a summary form.

  • What is data architecture? A data management blueprint

    Data architecture is a discipline that documents an organization's data assets, maps how data flows through IT systems and provides a blueprint for managing data, as this guide explains.

  • What is Data as a Service (DaaS)?

    Data as a Service (DaaS) is an information provision and distribution model in which data files (including text, images, sounds, and videos) are made available to customers over a network, typically the Internet.

  • What is data egress? How it works and how to manage costs

    Data egress is when data leaves a closed or private network and is transferred to an external location.

  • What is data governance and why does it matter?

    Data governance is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal standards and policies that also control data usage.

  • What is data management and why is it important? Full guide

    Data management is the process of ingesting, storing, organizing and maintaining the data created and collected by an organization, as explained in this in-depth guide.

  • What is data management as a service (DMaaS)?

    Data management as a service (DMaaS) is a type of cloud service that provides enterprises with centralized storage for disparate data sources.

  • What is data validation?

    Data validation is the practice of checking the integrity, accuracy and structure of data before it is used for or by one or more business operations.

  • What is data?

    In computing, data is information translated into a form that is efficient for movement or processing.

  • What is database normalization?

    Database normalization is intrinsic to most relational database schemes. It is a process that organizes data into tables so that results are always unambiguous.

  • What is denormalization and how does it work?

    Denormalization is the process of adding precomputed redundant data to an otherwise normalized relational database to improve read performance.

  • What is deterministic/probabilistic data?

    Deterministic and probabilistic are opposing terms that can be used to describe customer data and how it is collected. Deterministic data is also referred to as first party data. Probabilistic data is information that is based on relational patterns and the likelihood of a certain outcome.

  • E

    entity relationship diagram (ERD)

    An entity relationship diagram (ERD), also known as an 'entity relationship model,' is a graphical representation that depicts relationships among people, objects, places, concepts or events in an information technology (IT) system.

  • What is Extract, Load, Transform (ELT)?

    Extract, Load, Transform (ELT) is a data integration process for transferring raw data from a source server to a data system (such as a data warehouse or data lake) on a target server and then preparing the information for downstream uses.

  • F

    fact table

    In data warehousing, a fact table is a database table in a dimensional model. The fact table stores quantitative information for analysis.

  • flat file

    A flat file is a collection of data stored in a two-dimensional database in which similar yet discrete strings of information are stored as records in a table.

  • What is feature engineering?

    Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning.

  • G

    Google BigQuery

    Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets.

  • Google Bigtable

    Google Bigtable is a distributed, column-oriented data store created by Google Inc. to handle very large amounts of structured data associated with the company's Internet search and Web services operations.

  • H

    Hadoop

    Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers.

  • Hadoop data lake

    A Hadoop data lake is a data management platform comprising one or more Hadoop clusters.

  • Hadoop Distributed File System (HDFS)

    The Hadoop Distributed File System (HDFS) is the primary data storage system Hadoop applications use.

  • hashing

    Hashing is the process of transforming any given key or a string of characters into another value.

  • I

    information

    Information is the output that results from analyzing, contextualizing, structuring, interpreting or in other ways processing data.

  • M

    MariaDB

    MariaDB is an open source relational database management system (DBMS) that is a compatible drop-in replacement for the widely used MySQL database technology.

  • master data

    Master data is the core data that is essential to operations in a specific business or business unit.

  • Microsoft SQL Server

    Microsoft SQL Server is a relational database management system (RDBMS) that supports a wide variety of transaction processing, business intelligence (BI) and data analytics applications in corporate IT environments.

  • Microsoft SQL Server Management Studio (SSMS)

    Microsoft SQL Server Management Studio (SSMS) is an integrated environment to manage a SQL Server infrastructure.

  • Microsoft SSIS (SQL Server Integration Services)

    Microsoft SSIS (SQL Server Integration Services) is an enterprise data integration, data transformation and data migration tool built into Microsoft's SQL Server database.

  • MongoDB

    MongoDB is an open source NoSQL database management program.

  • MPP database (massively parallel processing database)

    An MPP database is a database that is optimized to be processed in parallel for many operations to be performed by many processing units at a time.

  • What is a multimodel database?

    A multimodel database is a data processing platform that supports multiple data models, which define the parameters for how the information in a database is organized and arranged.

  • What is master data management (MDM)?

    Master data management (MDM) is a process that creates a uniform set of data on customers, products, suppliers and other business entities from different IT systems.

  • What is Microsoft Visual FoxPro (VFP)?

    Microsoft Visual FoxPro (VFP) is an object-oriented programming (OOP) environment with a built-in relational database engine.

  • N

    What is NoSQL (Not Only SQL database)?

    NoSQL is an approach to database management that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats.

  • O

    OLAP (online analytical processing)

    OLAP (online analytical processing) is a computing method that enables users to easily and selectively extract and query data in order to analyze it from different points of view.

  • OPAC (Online Public Access Catalog)

    An OPAC (Online Public Access Catalog) is an online bibliography of a library collection that is available to the public

  • P

    primary key (primary keyword)

    A primary key, also called a primary keyword, is a column in a relational database table that's distinctive for each record.

  • What is a pivot table?

    A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data in a spreadsheet or database table to obtain a desired report.

  • Q

    query

    A query is a question or a request for information expressed in a formal manner.

  • R

    raw data (source data or atomic data)

    Raw data is the data originally generated by a system, device or operation, and has not been processed or changed in any way.

  • RDBMS (relational database management system)

    A relational database management system (RDBMS) is a collection of programs and capabilities that enable IT teams and others to create, update, administer and otherwise interact with a relational database.