More Core, More Power: The Apple M4 Pro Advantage

November 1, 2024 / Max Weinbach

Key Takeaways:

  • Focus on performances cores for CPU and increased memory bandwidth.
  • Several configurations available for the M4 series chips: M4, M4 Pro, and M4 Max.
  • CPU core count varies from 8 to 16 cores, with combinations of performance and efficiency cores, but focus on performance cores at higher end SKUs.
  • GPU options range from 8 cores to 40 cores.
  • Memory bandwidth varies significantly across models, from 120GB/s to 546GB/s.

Key Specs of Apple M3 vs. M4 Chip Series

Chip Model CPU Core Count GPU Core Count Memory Bandwidth (GB/s) RAM Capacities per SoC NPU TOPS (int4 Datatype)
Apple M3 8 (4P + 4E) 10 102.4 8GB, configurable to 16GB, 24GB 36 TOPS
Apple M4 8 (4P + 4E) 8 120 16GB or 24GB 38 TOPS
Apple M4 10 (4P + 6E) 10 120 16GB or 24GB 38 TOPS
Apple M3 Pro 11 (5P + 6E) 14 153.6 18GB, configurable to 36GB 36 TOPS
Apple M3 Pro 12 (6P + 6E) 18 153.6 18GB, configurable to 36GB 36 TOPS
Apple M4 Pro 12 (8P + 4E) 16 273 24GB, configurable to 48GB 38 TOPS
Apple M4 Pro 14 (10P + 4E) 20 273 24GB, configurable to 48GB 38 TOPS
Apple M3 Max 14 (10P + 4E) 30 307.2 36GB, configurable to 96GB, 128GB 36 TOPS
Apple M3 Max 16 (12P + 4E) 40 409.6 48GB, configurable to 64GB, 128GB 36 TOPS
Apple M4 Max 14 (10P + 4E) 32 410 36GB 38 TOPS
Apple M4 Max 16 (12P + 4E) 40 546 48GB, configurable to 64GB, 128GB 38 TOPS

This week Apple unveiled the M4 Pro and M4 Max chips, which join the existing M4 chip to expand its lineup for personal computers. These chips use TSMC’s N3E 3-nanometer node, improving performance and efficiency with memory bandwidth up to 546GB/s compared to earlier versions. The new M4 chips also add Thunderbolt 5 support for faster data transfer. The updated chips are now available in Apple’s iMac, Mac mini, and MacBook Pro, enhancing performance for a range of professional and creative uses.

The M4 Pro and M4 Max are built for users with demanding tasks like 3D rendering, AI workloads, and video production. The M4 Pro has a 12-core CPU and a 16-core GPU, while the M4 Max offers a 16-core CPU and a 40-core GPU. Both chips include an upgraded Neural Engine, which is twice as fast as the previous generation, enhancing AI processing capabilities. These chips are featured in the latest iMac, Mac mini, and MacBook Pro, delivering improved performance for both everyday tasks and complex creative work.

The base model M4 chip features an 8-core CPU with four performance cores and four efficiency cores, providing a balance between power and efficiency. With an 8-core GPU and 120GB/s memory bandwidth, the M4 delivers solid graphics performance for a variety of tasks, from everyday computing to creative work. The Neural Engine also enhances AI-related tasks, making it a strong option for users seeking consistent performance in an efficient package.

The M4 Pro and M4 Max MacBook Pro models now have improved battery life, moving from a 22-hour rating to a 24-hour rating, even with the addition of more performance cores in the M4 Pro. This improvement is largely due to Apple’s efficient core design, which continues to lead in performance per watt. The performance cores are now more efficient, and the efficiency cores have been made more capable, allowing the devices to handle intensive tasks without significantly impacting power consumption. This results in better battery life, demonstrating Apple’s focus on optimizing both hardware and software for energy efficiency, even as they enhance raw performance.

Not All TOPS are Created Equal

Analyzing Geekbench AI scores shows that a higher TOPS (Tera Operations Per Second) value does not always mean better real-world performance. Both Apple’s M4 series and Qualcomm’s Snapdragon Oryon use Neural Processing Units (NPUs) for AI workloads, but efficiency varies. Qualcomm’s NPU is rated at 45 TOPS, compared to Apple’s 38 TOPS, yet Apple’s M4 Max scores higher across various benchmarks.

For single precision tasks, the M4 Max scored 6006, while the Snapdragon-powered Galaxy Book4 Edge scored 2385. In half-precision, Apple’s M4 Max reached 36,044 versus Qualcomm’s 11,218. In quantized performance, the M4 Max scored 48,799 compared to Qualcomm’s 22,565. This suggests that while Qualcomm’s NPU may have a higher theoretical peak, Apple’s hardware, Core ML integration, and Neural Engine provide more efficient and consistent AI performance that yields better effective performance.

While I can’t say for certain why this is quite yet, I believe it has something to do with memory bandwidth, the size of the SLC, and how CoreML optimizes tensor structures for the hardware. Specifically, the overall efficiency might be influenced by how effectively data is managed between memory and the compute units, leveraging the available memory bandwidth and reducing latency in tensor block movement. The SLC cache can help facilitate multiple tensor loads, which allows more efficient use of available processing power by reducing the frequency of memory swaps. When tensor and vector blocks are optimized for this balance, the hardware can perform closer to its theoretical maximum.

Optimizing tensor and vector blocks for NPU hardware is about arranging data to minimize memory access and maximize parallel processing. By efficiently organizing tensors into appropriate block sizes and aligning them with compute units, data movement between memory and processing units is reduced, allowing for better use of memory bandwidth.

Loading tensor blocks in a way that fits well into cache can significantly decrease memory swaps, which is especially important for hardware like the Apple M4 and Qualcomm Snapdragon Oryon, where memory bandwidth and cache size can be limiting factors. The M4’s CoreML likely helps by using the cache effectively and reducing unnecessary memory access, improving throughput.

Ultimately, efficient tensor optimization helps maximize NPU processing power while minimizing latency. By balancing tensor loading and computation, Apple’s M4 chips can achieve strong real-world performance even with fewer theoretical TOPS compared to other hardware, as reflected in the benchmarks.

Put simplify, even if the actual TOPS performance for other hardware, like Snapdragon X Elite or Intel’s Lunar Lake or AMD Strix point, which all have higher TOPS numbers in int4 data types, the Apple M4 still outperforms it due to silicon and software optimizations to maximize the utility of that block of the SoC.

As workloads become increasingly heterogeneous, the entire system-on-chip (SoC), not just the NPU, plays a role in accelerating AI performance. This means that different components—such as the CPU, GPU, and NPU—are all leveraged for AI workloads, contributing to the overall TOPS performance. With more tasks being distributed across multiple cores, including efficiency and performance cores, we are seeing a shift where the entire core architecture is utilized for AI processing. This holistic approach leads to a more effective utilization of computational resources, allowing the SoCs to maximize the total TOPS benefit rather than relying solely on NPU TOPS. The result is a more comprehensive performance boost that better supports modern, diverse AI workloads.

Join the newsletter and stay up to date

Trusted by 80% of the top 10 Fortune 500 technology companies