Geekbench AI and the State of the NPU
Key Takeaways:
- Geekbench AI 1.0 is a new cross-platform benchmark suite designed for machine learning, deep learning, and AI-centric workloads.
- It provides three main scores: Single Precision, Half Precision, and Quantized, reflecting different precision levels used in AI tasks.
- The benchmark includes both computer vision and natural language processing workloads.
- Accuracy measurements are incorporated alongside performance metrics for each test.
- Geekbench AI supports various AI frameworks across different platforms, including OpenVINO, ONNX, and vendor-specific TensorFlow Lite Delegates.
- The benchmark uses extensive datasets to better reflect real-world AI use cases.
- All workloads run for a minimum of one second to account for performance tuning and real-world usage patterns.
What’s Important:
- The benchmark aims to provide a standardized way to measure AI performance across different devices and platforms.
- It considers both speed and accuracy, recognizing that AI performance isn’t just about how quickly a task is completed.
- The inclusion of different precision levels (single, half, and quantized) allows for a more comprehensive evaluation of AI hardware capabilities.
- Geekbench AI’s workloads are designed to reflect real-world applications of AI, making the benchmark results more relevant to actual use cases.
- The benchmark’s support for various AI frameworks makes it adaptable to different development ecosystems.
- The longer runtime for each test (minimum 1 second) helps to better capture sustained performance, which is crucial for real-world AI applications.
- Geekbench AI is designed to evolve with the rapidly changing AI landscape, with plans for regular updates to keep pace with new developments in the field.
- The benchmark is already being used by major tech companies, indicating its potential to become an industry standard for AI performance measurement.
Inference Workloads:
- Image Classification (using MobileNetV1)
- Image Segmentation (using DeepLabV3+ with MobileNetV2 backbone)
- Pose Estimation (using OpenPoseV2 with VGG19 backbone)
- Object Detection (using SSD with MobileNetV1 backbone)
- Face Detection (using RetinaFace with MobileNetV2 backbone)
- Depth Estimation (using ConvNets with EfficientNet-Lite3 backbone)
- Image Super Resolution (using Residual Feature Distillation Network – RFDN)
- Style Transfer (using Fast Real-Time Style Transfer approach)
- Text Classification (using Compressed BERT – BERT-Tiny)
- Machine Translation (using Transformer architecture)
Let’s put this simply: Geekbench AI v1.0 was just announced by Primate Labs and is on track to be one of the best standardized benchmarks for AI. Geekbench AI is a torture test for the NPU, testing not only the silicon but the software stack around it that allows models to run. This new update also provides three precisions options: full precision fp32, half precision fp16, and quantized int8. At the risk of spoiling some of the results below, the only usable data types for these devices are full precision and half precision, the quantized int8 lose too much quality to be useful. While the int8 quantized numbers may be impressive, these int8 quantized models aren’t useful in production.
Geekbench AI v1.0 focuses on smaller models with very common system and application-level processes, listed above. If you’re Instagram or Snapchat looking to do real-time filters, you’ll use the models above. If you’re a developer trying to index photos for editing software or do on-device search, you’re going to use the text classification/translation, you’re going to use one of the models above. Each of these models is more or less the industry-standard open-source model for each task and uses the best practices for inference across each platform for the best performance and efficiency.
GenAI is noticeably missing from this list, mostly because there really is no set of best practices for inference yet. Most implementations, like Apple Intelligence and Gemini Nano, are proprietary and do expose some LLM access to developers, making it easier just to use those. There are also memory requirements, and not every device can run these models. For now, local GenAI is just too up in the air to really benchmark fairly or properly.
One of the most important things to note with Geekbench AI is that all of these performance scores are at a moment in time. Just like GPU drivers improving performance on PCs, NPU compiler and runtime software will improve performance for AI models. The way the models are run on the NPU, data type and conversion layers used (like running fp calculations on silicon designed for int data types), and further optimization to the AI inference engines can all improve performance. Just like GPU drivers with performance optimizations for games, a simple update can improve performance across models and hardware.
Another point to mention is how these updates happen. On Windows using ONNX and DirectML/QNN, both the applications and models will need to be updated and recompiled using the latest updates to ONNX, QNN, or DirectML to get improved performance. It is worth noting that right now, there is a bug with ONNX on Windows which is resulting in decreased and inconsistent performance on Snapdragon X Elite machines. While this likely will be fixed soon, this is simply the reality for developers looking to use edge models. The performance may not be great now; it may be worth looking into different APIs or optimizing for specific platforms. Qualcomm offers you the option to run QNN compiled models or ONNX, so while ONNX might be cross-platform and support GPU, CPU, and NPU across AMD, Intel, Nvidia, and Qualcomm, it might be worth using Intel’s OpenVINO, Nvidia’s CUDA, or Qualcomm’s QNN until ONNX is stable enough to provide consistent performance.
On Android, it’s a little up in the air. Google recently announced the transition from NNAPI (Neural Net API) for AI model acceleration to a new TensorFlow Lite in Play Services architecture, and it’s yet to be seen how this will work. At the moment, the new hardware acceleration service in TensorFlow Lite only supports CPU and GPU, so there is no way to use the DSP or NPU for model acceleration outside of the silicon provider’s API. This means, as far as I can tell in public documentation, there is no way for developers to use TPU/NPU acceleration on the new Pixel 9 series.
Samsung has ENN for Exynos devices, and Qualcomm has QNN for Qualcomm devices; both are free and have good documentation. MediaTek has a NeuroPilot SDK for hardware acceleration of AI models, but you need to apply to MediaTek for access to precompiled models and basic documentation on how to use it. Right now, it looks like Samsung and Qualcomm-powered devices are the only ones that can even harness the NPU on Android.
CoreML for iOS, macOS, and iPadOS, on the other hand, is bundled with the operating system. For example, iOS 18 beta saw a 25% increase in Neural Engine performance with an up to 40% increase in some workloads. The same model, just a system update. This isn’t to say updates to coremltools, Apple’s model conversion software, won’t help, but the majority of the work is done at the system level. This makes it easier for developers to just use one model and let the OS and device handle the rest. These models should also work cross-device, so one model can run on macOS, iOS, and iPadOS.
With all of this in mind, where do we go from here? What is this really useful for? I view this as a great way to track the improvements over time for not only pure silicon performance but also inference backends and model architecture. Sure, Snapdragon 8 Gen 4 will have better performance than the Snapdragon 8 Gen 3, the A18 will be better than the A17 Pro, and so on. These improvements are to be expected YoY as processing nodes improve as well as silicon architecture. Geekbench AI’s design also allows us to see improvements as Microsoft updates DirectML and ONNX or Google TensorFlow Lite in Play Services; we’ll be able to see the performance improvements over time on the same silicon.
All of this sounds confusing, and to be honest, it is. There is no easy option for developers of any size; you have to optimize for each silicon vendor on each platform and hope for the best. The end result is developers will likely just choose the platforms with the largest user base and optimize for that, or choose suboptimal inference pipelines and hope it improves with time.
These are still the early days for AI and edge inference, but a lot of work needs to be done to improve the experience and accessibility of information for developers. No company is perfect, and there is a lot of work to be done, but this is a good first step. As I said above, we can now basically watch the progression of model inference and different technologies at both a hardware and software layer progress; this really isn’t something that’s been easy to do historically.