Definition

tensor processing unit (TPU)

Stephen J. Bigelow, Senior Technology Editor

Published: Jul 16, 2024

What is a tensor processing unit (TPU)?

A tensor processing unit (TPU) is an application-specific integrated circuit (ASIC) specifically designed to accelerate high-volume mathematical and logical processing tasks typically involved with machine learning (ML) workloads.

Google designed the tensor ASIC, using TPUs for in-house neural network ML projects as early as 2015 with Google's custom TensorFlow software. Google released the TPU for third-party use in 2018. Today, the evolving TPU chips and TensorFlow software framework are ML infrastructure mainstays, including the Google Cloud Platform (GCP).

How do TPUs work?

TPUs provide a limited number of features and functionalities that are directly useful to ML and artificial intelligence (AI) tasks but are not necessarily useful for everyday general computing. ML models and the AI platforms that use them, such as deep learning and neural networks, require extensive mathematical processing. While it's possible to execute these tasks in ordinary central processing units (CPUs) or more advanced graphics processing units (GPUs), neither is optimized for such tasks.

Just as GPUs arose to speed the math processing required for gaming and data visualization, TPUs now accelerate the mathematical tasks used for neural networks and other ML models. This includes multiplication and accumulation, or addition, operations.

A TPU employs one or more extensive arrays of multiply-and-accumulate arithmetic logic units (ALUs) configured as a matrix. This matrix processing solves extensive mathematical tasks much faster and with far lower power consumption than more traditional processor types. In short, a TPU takes input data, breaks down the data into multiple tasks called vectors, performs multiplication and addition on each vector simultaneously and in parallel, and then delivers the output to the ML model.

Recent TPU designs automatically adjust performance depending on the supported application type. TPUs also handle low-level dataflow graphs and tackle sophisticated graph calculations that tax traditional CPUs and GPUs. TPUs support 16-bit floating point operations and use high-bandwidth memory; late-model TPUv5p chips list a memory bandwidth of 2,765 GB/s.

How do TPUs compare to GPUs and CPUs?

Every processor does the same fundamental job: execute a set of instructions designed to move and perform operations on data. Performing these jobs on a hardware chip increases speed. A processor typically performs a task in hardware within a few billionths of a second, a fast and efficient job.

However, if a processor is not designed or optimized to perform a certain task, that task becomes difficult, even impossible, to perform. Instead, a constructed software application uses the processor's available instruction set to emulate its intended function. Unfortunately, software emulation almost always results in poor or inefficient processor performance because the processor needs a lot more time to accomplish a lot more work.

Virtualization, for example, requires a continuous translation between physical and virtual hardware resources. Early virtualization software used emulation to process such translations, severely limiting the performance and number of virtual machines (VMs) supported on a computer. When processor designers added virtualization instruction sets to modern processor designs, the system suddenly and dramatically improved, allowing computers to handle many VMs simultaneously at near-processor speeds. Processors are often tailored and updated in this way to handle new processing problems.

Conversely, a processor is sometimes selected for its simplicity or suitability. Consider an automatic coffee maker. While programmable, it typically uses a small subset of processor-type instructions to function, avoiding a wasteful and expensive general-purpose processor. The resulting ASIC provides a stripped-down processing chip, allowing much faster performance and far lower power demands.

Ultimately, the correct CPU, GPU or TPU is the one that's best suited for the computing problem at hand.

CPU

The CPU is a general-purpose device designed to support more than 1,500 different instructions in hardware, or on chip. There might be several chips, or cores, incorporated into the same processor package that plugs into the computer's motherboard.

CPUs process instructions and data one at a time along an internal pipeline. This speeds up the individual processes but limits the number of simultaneous operations. CPUs can indeed support many ML models and are best applied when the model has the following properties:

Requires high flexibility or is expected to change, which is common in prototype models.
Does not demand significant training time.
Is small and uses small batches to perform training.
Uses limited system input/output (I/O) and network bandwidth.

GPU

The GPU provides high levels of parallel processing and supports detailed mathematical tasks that general-purpose CPUs do not handle without emulation. Such characteristics are typically useful for visualization applications, including computer games, math-intensive software applications, and three-dimensional (3D) and other rendering tools, such as AutoCAD. Since they typically do not possess basic instructions, GPUs are paired with CPUs on the same computer system.

Yet the GPU is not simply a CPU with more instructions. Instead, it's a fundamentally different approach to solving specific computing problems. The limited number of functions performed by a GPU means each core is far smaller, but its highly parallel architecture allows thousands of cores to manage massive parallel computing tasks and high data throughput. Still, the GPU cannot multitask well, and it generally has limited memory access.

GPUs are well suited to many demanding ML models and are best employed when the model has the following properties:

Is not flexible or unlikely to change significantly.
Uses operations that are specific to a GPU.
Is a medium or large size and requires larger batch sizes for training during which high parallelism is beneficial.

TPU

The TPU is much closer to an ASIC, providing a limited number of math functions, primarily matrix processing, expressly intended for ML tasks. A TPU is noted for high throughput and parallelism normally associated with GPUs but taken to extremes in its designs.

Typical TPU chips contain one or more TensorCores. Each employs matrix-multiply units (MXUs), a vector unit and a scalar unit. Every MXU incorporates an array of 128 x 128 multiply-accumulator ALUs, and each MXU performs 16,000 multiply-accumulate operations per clock cycle using floating point math.

TPUs are primarily purpose-built chips ideally suited for ML models with the following properties:

That rely primarily on matrix processing.
That do not rely on custom TensorFlow or other training software operations.
That require extended periods -- weeks or even months -- to finish.
That use large batch sizes, such as convolutional neural networks.

What are the best use cases and applications for TPUs?

As with any type of processor, the TPU chip does nothing without software capable of employing the TPU's functions. TensorFlow software provides the framework that delivers data to the TPU and then returns results to associated ML models. TPUs are used in a variety of tasks. The most popular include the following:

Machine learning. The TPU's matrix processing prowess vastly accelerates algorithms and ML models commonly associated with demanding tasks, such as natural language processing, image recognition and speech recognition.
Data analytics. TPUs are primarily suited for math tasks using matrix processing. Any data analytics or other data processing that includes matrix processing, whether related or not to ML or AI projects, benefits from TPUs.
Edge computing. Edge computing is attractive when data must be processed at or near the data source, such as a factory or autonomous vehicle. TPUs are worthwhile when an edge computing environment must support high-throughput matrix processing, such as training or updating ML models in the field using real-time IoT device data.
Cloud computing. TPUs are used as the foundation of Google TensorFlow cloud services applied to ML and AI workloads in the GCP, including chatbots, recommendation engines, code generation platforms, generative AI systems, speech generation, computer vision and other ML/AI projects.

History of TPU hardware and product development

The TPUs are proprietary ASIC devices developed by Google and used in GCP data centers since 2015. The chip supports Google's TensorFlow symbolic math software platform and other ML tasks involving matrix processing mathematics. Google also produces TPUs for commercial use and makes commercial TPU-based services available through GCP.

Google TPUs have undergone five major iterations since their initial introduction. The latest v5 TPU is available as both a v5e low-power economy model and a v5p fully-realized performance model.

Feature	TPUv1	TPUv2	TPUv3	TPUv4	TPUv5e (economy)	TPUv5p (performance)
Year introduced	2016	2017	2018	2021	2023	2023
Performance (floating point)	23 TFLOPs	45 TFLOPs	123 TFLOPs	275 TFLOPs	197 TFLOPs	459 TFLOPs
Memory access	8 GB	16 GB	32 GB	32 GB	16 GB	95GB
Memory bandwidth	34 GBps	600 GBps	900 GBps	1,200 GBps	819 GBps	2,765 GBps
Chips per pod	unspecified	256	1,024	4,096	256	8,960