REPORT: The NPU – The Newest Chip on the Block

May 19, 2024 / Ben Bajarin and Max Weinbach

Table of Contents

The New Chip on the Block

The Neural Processing Unit (NPU) has evolved significantly since the introduction of deep learning models like AlexNet in 2012. NPUs are specialized hardware accelerators designed to efficiently process neural network operations, such as the numerous multiplies and accumulations required for deep learning models. These units offer optimized control and arithmetic logic targeting extensive computing operations, making them particularly adept at handling the computational complexity and large datasets characteristic of machine learning and AI applications.

Neural networks, the foundation of modern AI, excel at identifying patterns within datasets, often outperforming human capabilities even with incomplete or noisy data. Initially, neural network models were implemented on Graphics Processing Units (GPUs), which provided a robust platform for training and early prototyping. However, the shift towards NPUs was driven by the need for more area and power-efficient solutions suitable for high-volume, cost-sensitive applications like mobile devices. NPUs offer a balance between being programmable for flexibility and optimized for the intensive mathematics of neural networks.

Apple first introduced the Neural Engine in the A11 Bionic chip that debuted in the iPhone 8, 8 Plus and iPhone X in September 2017. This first-generation Neural Engine could perform up to 600 billion operations per second and was used for Face ID, Animoji and other machine learning tasks . Since then, Apple has included an increasingly powerful Neural Engine in each generation of its A-series chips for iPhone and iPad, as well as its M-series chips for Mac. 

In 2017, Qualcomm started integrating dedicated NPU hardware into the Hexagon DSP, with the first Hexagon Tensor Accelerator appearing in the Snapdragon 845 in 2018. Subsequent generations have significantly boosted AI performance. 

While Apple and Qualcomm have been integrating NPUs for some time, the rest of the industry is now taking the trend seriously and all new client device SoCs from Intel and AMD are being introduced with NPUs specific for AI acceleration. 

Why the Need for an NPU, isn’t the CPU or GPU Good Enough?

While CPUs excel at general-purpose computing and GPUs have proven valuable for parallel processing, they fall short in meeting the specific thermal requirements of AI workloads running on mobile devices. Below we look at some of the high-level differences of CPUs and GPUs that help set the stage for what makes the NPU architecture unique. 

CPU Cores – Versatile but Limited for AI: CPUs have been the backbone of general-purpose computing for years. They are designed to handle a wide range of tasks, from simple arithmetic operations to complex decision-making processes. CPU cores are optimized for sequential processing, focusing on executing a broad range of instructions with high accuracy and reliability. This versatility makes CPUs suitable for a variety of applications, from everyday computing to running operating systems and complex software.

However, when it comes to AI and ML workloads, CPUs face limitations. While they can execute AI algorithms, they lack the specialized hardware necessary for efficient parallel processing, which is crucial for handling the massive amounts of data and computations involved in AI and ML tasks. As a result, CPUs may struggle to keep pace with the demands of modern AI applications. For example, benchmarks of LLM AI workloads on the CPU will consistently yield much slower time to first token (TTFT) and much slower tokens per second (TPS), or for visual workloads (LVMs or Computer Vision models) lower frames per second (FPS).

GPU Cores – Parallel Processing Powerhouses: GPUs, originally designed for rendering graphics, have found a new purpose in the era of AI and ML. Their architecture is optimized for parallel processing, allowing them to handle multiple tasks simultaneously. GPUs excel at matrix and vector computations, which are prevalent in deep learning algorithms. By leveraging their large number of cores and high memory bandwidth, GPUs can significantly accelerate the training and inference of deep neural networks.

However, GPUs are primarily oriented towards high-throughput floating-point operations, which can be more than necessary for certain AI tasks. This focus on floating-point precision may lead to suboptimal efficiency for AI workloads that can tolerate lower precision arithmetic. Additionally, GPUs may not be as power-efficient as specialized AI accelerators, which can be a concern for edge devices and resource-constrained environments. The GPU will remain a relevant part of AI workloads, but our view is there will be specific workloads better suited for the NPU.  GPU benchmarks continue to yield fast TTFT and TPS but it comes at the price of thermals as the spike in performance on AI workloads also causes the GPU to run at much higher wattages.  This is why we feel the GPU will still play a role in specific more visual compute AI workloads and those that benefit from a burst of performance rather than AI workloads that require persistence inference running in the background on device.

NextWhat Makes the NPU Unique

Join the newsletter and stay up to date

Trusted by 80% of the top 10 Fortune 500 technology companies