A weekend with Apple’s Mac Studio with M3 Ultra: The only real AI workstation today

March 11, 2025 / Max Weinbach

Executive Summary:

The Apple Mac Studio featuring the M3 Ultra represents the most powerful AI workstation currently available, tailored specifically toward AI developers’ demanding workflows. With its unprecedented unified memory (up to 512GB) and robust GPU performance, the M3 Ultra Mac Studio excels at running large language models (LLMs) efficiently, surpassing even high-end PCs in practical AI workloads. Its integration with Apple’s MLX framework provides optimized, user-friendly performance, establishing the Mac Studio as a uniquely capable machine for both current and future AI development.

Key Points:

  • Memory Matters Most:
    • Apple’s Unified Memory approach significantly enhances performance, particularly for memory-intensive AI tasks like running LLMs.
    • M3 Ultra Mac Studio offers up to 512GB of Unified Memory, essential for high-quality AI models and large context windows.
  • AI Development Optimized:
    • M3 Ultra’s powerful GPU (80-core) paired with Apple’s MLX framework provides unmatched efficiency in running models without excessive memory overhead.
    • Unlike other systems, MLX dynamically manages memory, enhancing workflow efficiency and enabling higher precision models.
  • Performance Benchmarking:
    • Real-world benchmarks demonstrate superior LLM inference speeds on the M3 Ultra Mac Studio compared to high-end PCs (Intel i9 13900K, RTX 5090).
    • Context windows and model sizes are growing rapidly; Mac Studio’s extensive memory future-proofs workflows against this trend.
  • Comparative Advantage:
    • While Nvidia GPUs like RTX 5090 excel in GPU benchmarks and optimized client AI scenarios, Apple Silicon offers unmatched ease of use and consistent practical performance.
    • Recommended developer setup: M3 Ultra Mac Studio for desktop AI workflows combined with rented Nvidia H100s for intensive server-based tasks.
  • Practical Implications:
    • Apple Silicon’s simplicity and native optimization (MLX) significantly reduce barriers to entry and improve development productivity.
    • No competing workstation matches the combination of ease-of-use, practical performance, and available memory for running sophisticated LLMs.

I’ve been a huge fan of Apple Silicon since I got my first M1 MacBook Pro in 2020. Going from an M1 MacBook Pro to M1 Max MacBook Pro to M3 Max MacBook Pro, one of my favorite parts of the machines was memory. It’s not only because I prefer Chrome as my browser of choice, but because RAM and memory have more or less always been the limiting factor in my opinion. 

When I got the M3 Max, I decided to get the 128GB memory option because LLMs were finally running well on the machines with frameworks like Llama.cpp and MLX slowly becoming more popular and eating as much memory as you could provide. To be frank, for modern models and agentic workflows, 128GB isn’t even enough for a lot of AI devs. 

Apple’s M3 Ultra SoC in the Mac Studio is insane because it’s the first workstation that seems to actually target AI developers, offering an insanely powerful GPU combined with up to 512GB of LPDDR5x Unified Memory and a memory bandwidth up to 819GB/s. Essentially, the perfect workstation for AI developers, especially since almost every AI developer I know uses a Mac! Essentially, and I am generalizing: Every major lab, every major developer, everyone uses a Mac. 

The Mac I’ve been using for the past few days is the Mac Studio with M3 Ultra SoC, 32-core CPU, 80-core GPU, 256GB Unified Memory (192GB usable for VRAM), and 4TB SSD. It’s the fastest computer I have. It is faster in my workflows for even AI than my gaming PC (which will be used for comparisons below; it has an Intel i9 13900K, RTX 5090, 64GB of DDR5, and a 2TB NVMe SSD). 


Update: Since posting this article, I’ve gotten my hands on a Mac Studio with M3 Ultra SoC, 32-core CPU, 80-core GPU, and 512GB Unified Memory with an 8TB SSD. I’ve updated relevant sections based on new capabilities with more Unified Memory.


Just to get it out of the way, here are some benchmarks from the M3 Ultra, M3 Max, and my gaming PC.

*Geekbench AI will be in Full Precision, Half Precision, and Quantized in order

Benchmark M3 Ultra (32C/80G) M3 Max (16C/40G) Gaming PC (13900K/5090)
Geekbench 6 CPU Single/Multi 3,227 / 27,115 3,242 / 21,152 2,666 / 17,278
Geekbench 6 GPU 263,100 164,076 406,460
Geekbench AI CPU 5,303 / 8,192 / 6,427 4,750 / 7,430 / 6,425 5,447 / 5,370 / 13,660
Geekbench AI GPU 28,320 / 23,115 / 20,734 18,983 / 21,642 / 17,032 39,726 / 56,188 / 29,875
Geekbench AI NPU 5,348 / 30012 / 33339 4,697 / 27,188 / 29,809 N/A (No NPU)
Cinebench 2024 CPU Single 143 139 122
Cinebench 2024 CPU Multi 2,503 1,657 1,951
Cinebench 2024 GPU 19,493 10,247 N/A

A Little Bit About LLMs

Before actually talking about how LLMs run on the M3 Ultra Mac Studio (hint: best on any machine I’ve tested), let’s talk about how LLMs work and use memory. If you already know this, feel free to skip ahead; this is just good context to understand the need for memory. 

There are two main parts of an LLM that use a lot of memory, and some can be optimized for. The first is model size; generally, models are in an fp16 format, so 2 bytes per parameter. Essentially, parameter count x 2 = size in GB.

Llama 3.1 8B is about 16GB, for example. Models like Deepseek R1 are native in an FP8 count, so the 685B parameters is about 685GB. The best open-source model as of writing this is Alibaba’s QwQ 32B model, matching Deepseek R1! This model is BF16, so around 64GB for the full model. 

When you quantize down to 4-bit, you can cut that in half or a quarter depending on the model. An 8B parameter model in 4-bit quantization is around 4GB, QwQ 32B is around 20GB, and Deepseek is around 350GB. You can find models with smaller 1.5 to 2-bit quantization, but generally these lose so much quality it’s not worth using outside of demonstrations. It could be good enough for larger models like Deepseek R1, but that’s still ~250GB of memory needed to load it. The smallest version of Deepseek R1 is about 180GB, but this isn’t the full story.

The other part of the model that uses memory is the context window. Essentially, how much can you feed the model to generate a response off of? Most models now are 128K tokens for a context window, but users need significantly less, around 32K tokens is sufficient for most users (this is what the ChatGPT Plus tier has). This is called KV Cache, where it stores prompt and input tokens used to generate the output. 

The most common framework for LLMs on the client, llama.cpp, will essentially load the entire context window cache as well as the model, so to load QwQ, which by itself is only 19GB, uses a total of ~51GB of system memory! This isn’t a bad thing, and makes sense for a lot of use cases.

Some frameworks like Apple’s MLX only use system memory for the KV Cache as it’s utilized, so it only uses 19GB when loaded and utilizes more memory as the model is used. At peak utilization, it will be ~51GB when the full context window is filled. Given M3 Ultra and M4 Max have way more than that available, it allows you to use much higher precision models. For example, the native BF16 of QwQ 32B could use over 180GB of memory at its maximum context window and quality. This is 180GB for a 32B parameter model! This is the reality of these models, they will use as much memory as you can give them.

Context windows are growing, and that seems to be where the majority of future-proofing around memory should be for client devices. Qwen already has a 1M token context window model, Grok 3 from xAI is 1M tokens as well (and will be open-sourced in the future). Scaling laws exist for model sizes; we’ll get smaller models that are powerful, but big models will still better. Context windows and actually using the models, small or large, will only require more memory. RAG works in some workflows, but I believe context windows will matter more, and that requires A LOT of memory. Mix a big model with a large context window? You need 512GB of memory or more.

You can actually connect multiple Mac Studios using Thunderbolt 5 (and Apple has dedicated bandwidth for each port as well, so no bottlenecks) for distributed compute using 1TB+ of memory, but we’ll save that for another day.

Look, all of this is to say: you can run an SLM or LLM on your phone or any laptop, and it’ll work. For it to work well, be able to use it in production and actually evaluate models, to have a REAL AI workstation, you need a lot of memory in the GPU. Mac Studio with M3 Ultra is the only machine that allows for this today. While buying H100s or AMD Instinct cards may be faster in the actual inference step, it’s also going to be 6-80x the price to actually own the silicon, especially since nearly everyone will be testing models that’ll be deployed on a cloud for production.

Training is a whole other mess that I’m sure the Exo Labs team will talk more about, since they are building an Apple Silicon-only training node for LLMs! I’m sure they’ll be a better resource in the memory requirements for training, but at the end of the day, I think the best way to put it is: more memory is better. 


LLM Performance

OK, the important part now! I’ll keep it brief; the LLM performance is essentially as good as you’ll get for the majority of models. You’ll be able to run better models faster with larger context windows on a Mac Studio or any Mac with Unified Memory than essentially any PC on the market. This is simply the inherent benefit of not only Apple Silicon but Apple’s MLX framework (the reason we can efficiently run the models without preloading KV Cache into memory, as well as generate tokens faster as context windows grow). 

This is not what I would consider a good comparison, because Blackwell as an architecture is great in data centers for AI, and on the client for consumer-focused AI. That’s not really what I’m testing here, so I would say use this as a frame of reference for the practical performance of an LLM in a workstation

Below is just a quick ballpark of the same prompt, same seed, same model on 3 machines from above. This is all at 128K token context window (or largest supported by the model) and using llama.cpp on the gaming PC and MLX on the Macs.

*OOM = Out of Memory

Model M3 Ultra 256GB M3 Ultra 512GB M3 Max RTX 5090
QwQ 32B 4-bit 33.32 tok/s 36.87 tok/s 18.33 tok/s 15.99 tok/s (32K context; 128K OOM)
Llama 8B 4-bit 128.16 tok/s 135.22 tok/s 72.50 tok/s 47.15 tok/s
Gemma2 9B 4-bit 82.23 tok/s 88.50 tok/s 53.04 tok/s 35.57 tok/s
IBM Granite 3.2 8B 4-bit 107.51 tok/s 112.87 tok/s 63.32 tok/s 42.75 tok/s
Microsoft Phi-4 14B 4-bit 71.52 tok/s 75.91 tok/s 41.15 tok/s 34.59 tok/s
DeepSeek R1 4-bit OOM 19.69 tok/s OOM OOM

I will say, there are ways to run larger models on the RTX 5090 with CPU offload and lazy loading, mixing system memory and CPU into the inference or loading layers of the model as needed, but that introduces latency and frankly isn’t really what’ll matter given the RTX card. There are also frameworks like TensorRT-LLM which let you quantize to the native fp4 datatype as supported by Blackwell, but when I tried to compile the model for the RTX 5090, I received quite a few errors and frankly didn’t have the time to debug it. The theoretical performance of an optimized RTX 5090 using the proper Nvidia optimization is far greater than what you see above on Windows, but this again comes down to memory. RTX 5090 has 32GB, M3 Ultra has a minimum of 96GB and a maximum of 512GB.

This leads to another Apple Silicon advantage: It’s just easy. Everything here is optimized and works. MLX is the best framework, constantly updated by not only Apple but the community. It’s a wonderful open-source project to take advantage of the unified memory on Apple Silicon. As great as the RTX 5090 performance is, and yes it does outperform the M3 Ultra GPU at its peak for AI, the software like CUDA and TensorRT is a limiting factor when not going for scale in a data center, where those are second to none.

I see one of the best combos any developer can do as: M3 Ultra Mac Studio with an Nvidia 8xH100 rented rack. Hopper and Blackwell are outstanding for servers, M3 Ultra is outstanding for your desk. Different machines for a different use, while it’s fun to compare these for sport, that’s not the reality. 

There really is no competition for an AI workstation today. The reality is, the only option is a Mac Studio.

Join the newsletter and stay up to date

Trusted by 80% of the top 10 Fortune 500 technology companies