REPORT: The NPU Wattage Advantage

May 20, 2024 / Max Weinbach and Ben Bajarin

Table of Contents

While all compute cores are capable of handling AI workloads, the NPU stands out the most here as its architecture allows it to run these AI workloads at significantly lower wattage than CPU and GPU cores. NPUs have power management technique advantages over the CPU and GPU, specifically on AI workloads. NPUs often include advanced power management techniques, such as dynamic voltage and frequency scaling (DVFS), which adjust the power consumption based on the workload requirements. They can run at lower power when full performance is not needed. Ultimately, this scaling is a primary benefit in NPUs that have a greater TOPS per watt advantage than other cores on the block. 

Given the controversy over TOPS, and our acknowledgement it is not effective in real-world workload measurement for client devices, especially when it is a mobile device, for our benchmark methodology, we are measuring tokens per Joule (TPJ).  The benefit of using this metric for local AI workloads is:

  • Tokens per joule (TPJ) thus accounts for the total energy used to process a certain number of tokens, which provides a more comprehensive view of energy efficiency over the entire duration of the workload.
  • When running AI workloads, the total energy cost (measured in joules) is more relevant for understanding the impact on battery life for mobile devices or overall energy costs for stationary systems.
  • TPJ takes into account the time it takes to complete the workload, giving a more accurate representation of the total energy consumption.
  • TPJ gives a clearer picture of how long a device can sustain a workload before depleting its energy resources.
  • While TPJ is a good metric for transformer models like SLM/LLMs, which will be relevant in the near future, Stable Diffusion and Diffusion models do not use tokens as a metric for generation, so instead we are showing Joules of energy consumption per image generated as that is a better metric of the overall energy resources needed for this AI workload.

For our primary test we ran Stable Diffusion V2.1 which we ran across 25 workloads and normalized for averages. Below our our results, where the lower the number the better the score. 

On M3 MacBook Air, 8-core CPU 10-core GPU with 16GB RAM spec, we see averages of 87.63 Joules (Energy (Joules) = Power (Average Watts of NPU + CPU) * Time (Seconds)), used per image generated. 1On macOS and Apple Silicon, the stack that allows for ML models to run on NPU is Apple CoreML Unlike Microsoft’s DirectML or QNN from Qualcomm, there is no way to run a model directly on NPU alone. The CoreML runtime will utilize two blocks of the SoC, in our tests of Stable Diffusion 2.1 8-bit quantized, these happened to be on CPU and NPU, so we calculated the Joules of energy used per generation by tracking the CPU wattage corrected for idle draw added to the NPU wattage times second per generation.

For the Snapdragon X Elite system, we used a prototype Surface Laptop 15-inch with 16GB RAM, with the X1E78100 SKU of Snapdragon X Elite. We see an average of 41.23 Joules (Energy (Joules) = Power (Average Watts of NPU) * Time (Seconds)) used per image generated. 2The Stable Diffusion 2.5 demo was optimized for the Qualcomm Hexagon NPU and the entire workload ran 100% on the NPU.

 

Analysis

  • Energy Efficiency: The Snapdragon X Elite system is more energy-efficient, using only 41.23 Joules per image compared to the M3 MacBook Air’s 87.63 Joules. This indicates that the Snapdragon X Elite system, running the task on the NPU, uses less than half the energy of the M3 MacBook Air for generating an image.
  • Time Efficiency: The Snapdragon X Elite system also completes the task faster, with an average time of 17.59 seconds per image, compared to 20.89 seconds for the M3 MacBook Air.

 

Run M3 CoreML Energy (Joules) M3 CoreML Time (seconds) Snapdragon X Elite Energy (Joules) Snapdragon X Elite Time (seconds)
Run 1 108.13 20.51 41.33 18.05
Run 2 92.11 20.31 41.89 17.90
Run 3 92.45 20.29 40.66 17.27
Run 24 90.05 20.34 41.91 18.22
Run 25 90.71 20.26 41.52 17.67
Averages 87.63 Joules 20.89 seconds 41.23 Joules 17.59 seconds
Additional Averages CPU Power: 0.99 Watts ANE Power: 3.23 Watts

 

While running Stable Diffusion is only one use for AI workloads, it is one that is extremely compute intensive, and one that consumes a lot of compute resources which makes it a useful test for showing the efficiency of these workloads when offloaded to the NPU.

Currently, Stable Diffusion is the best example of how efficient the NPU can be compared to other blocks of the SoC. As developers build more applications to take advantage of the NPU in their AI workloads, we can show this advantage using TPJ or total joules per generation in other applications. 

Next ❯ Conclusions and What’s Next

  • 1
    On macOS and Apple Silicon, the stack that allows for ML models to run on NPU is Apple CoreML Unlike Microsoft’s DirectML or QNN from Qualcomm, there is no way to run a model directly on NPU alone. The CoreML runtime will utilize two blocks of the SoC, in our tests of Stable Diffusion 2.1 8-bit quantized, these happened to be on CPU and NPU, so we calculated the Joules of energy used per generation by tracking the CPU wattage corrected for idle draw added to the NPU wattage times second per generation.
  • 2
    The Stable Diffusion 2.5 demo was optimized for the Qualcomm Hexagon NPU and the entire workload ran 100% on the NPU.

Join the newsletter and stay up to date

Trusted by 80% of the top 10 Fortune 500 technology companies