Explainer: What Gemini powering Siri really means
Apple and Google did their joint announcement today, confirming Mark Gurman’s report from August (I swear this guy lives in the walls of Apple Park) that the new Siri and Apple Foundation Models will be based on Gemini models and technology.
There’s a lot of nuance to this that many are overlooking from a product, architecture, and investment perspective so I thought it was worth clarifying a bit about how it’s likely to work. I want to split this up into a few parts to do that.
- Gemini technology as a foundation of Apple Foundation Models
- Siri and Apple Intelligence architecture
- Private Cloud Compute and inference silicon
Gemini technology as a foundation of Apple Foundation Models
I’d rather not get into too much detail about model training, that’s for another day, but for this to make sense you need to understand the basics. Relevant here are two concepts, pre training and post training.
Pre training is when you take all your data and throw it at compute. The model learns and processes that data, making connections between topics. This is where the knowledge of the internet gets compressed into weights, essentially a weighted mathematical model of all the world’s knowledge. At the end of pre-training, you end up with a base model.
A base model is special because it doesn’t not chat or interact like you or I are used to, it just completes text. It’s entire purpose is next token prediction based on input tokens. For example if I enter “Hi” the base model would likely generate the tokens “ how are you?” It’s not conversational, it simply completes text.
To get to what we are used to, models go through post training. This is where you tune the model to understand structured conversion, there are hidden tokens you never see to delineate this like <|start|>, <|end|>, and <|message|>. The model is trained to understand conversation, reason, and take action with tools via these tokens. Reasoning, after all, is just the model generating English tokens to reason through problems between <|start_think|> and <|end_think|> tokens.
Post training is also where models learn how to interact within their harness, the framework for how they will be used. Harnesses are things like Claude Code, the ChatGPT website, Cursor, and any basically any surface you use that has an AI model powering it, including all AI agents.
This stage is also where the model is given its personality, guardrails, and tuned for safety and to reduce bias.
The reason I think it’s important to explain this is when Google licenses and gives a model to Apple, they will likely provide the base model. This leaves Apple to train the model, likely with Google Cloud, to behave how they want Siri to behave. This means Apple gives Siri its personality, its values, train it to understand and work within their Apple ecosystem with App Intents, and work with their harness.
When Google serves the model via the Gemini API, it’s a model that needs to be a blank canvas with certain limits so any developer or user can pick it up and expect it to work. The model will need to work within any surface. Siri is more structured, after Apple’s training, it ONLY needs to work within Siri. This means the model can better understand how it needs to work within its harness vs. a standard model served over an API. It also gives Apple the ability to use a smaller model, like Gemini 3 Flash at ~1.2T parameter vs Gemini 3 Pro, which is rumored to be as large at 7T parameter and retain or exceed the quality of the larger model.
This is also important for the on-device model, which Apple and Google implied would also get a Gemini technology overhaul. Apple and Google both train on-device models with model distillation, training small student models from larger teacher models. Apple’s on device foundation model is very good, in practical application it’s even industry leading because of its Neural Engine acceleration, speed, and support for function calling and structured outputs.
If Apple uses their stronger cloud model and distill it into a new on device model, some parts of the Apple Intelligence and Siri workloads can be offloaded to the users device and reduce overall compute demand pretty significantly. More on that later.
The terms of the deal aren’t public, but I’m sure this deal does give Apple this sort of access and control over the model. Even without the base model and pure weights, Google can do a ton of work, similar to what I mentioned above, to make sure it works well for them.
Siri and Apple Intelligence architecture

Apple Intelligence has a bunch of features, some powered by LLMs and SLMs and some aren’t. For the sake of this, we’re just going to be discussing the LLM portion.
Today, depending on the Apple feature and context of the request, it will either use the on-device SLM or Apple Foundation Model server. For example Fitness Buddy, where you get a personalized coach/trainer during workouts to encourage you along, tell you your stats and milestones, and provide context to your workout, that uses Apple’s server model. Email and notification summaries use Apple’s on-device model. Some things like summaries of voice recordings can happen on device or server model depending on length. Rewriting text as well, it just depends.
You are able to check this yourself, actually! In Accessibility and Privacy settings, Apple stores (on your device only) all of the Apple Intelligence requests, the prompts, and if it was processed locally or on Private Cloud Compute.

I’ve spent a lot of time reading Apple’s research on agents, tool calling, and LLMs and how to make them work, especially when looking towards Siri. The image above is from an Apple research report from 2024 titled “Context Tuning for Retrieval Augmented Generation” and roughly how I believe the new Siri will work.
The user will make a request to Siri. If it’s basically, something like search the web for XYZ or turn on a timer, it will run locally and use the on-device model. When we get into more complex questions, like “Where did I meet john for coffee last week” it does it in multiple steps. It will likely use the cloud model, searching multiple times through the different indexes on the device. This could be calendar, photos, messages, calls, maps, etc. The model exists in the cloud, Apple’s Private Cloud Compute, searching the data on your device.
Siri will also be able to use App Intents to connect to non-Apple apps. This will allow developers to expose their apps and data to Siri and Apple Intelligence. For example, Tesla supports app intents. This would allow me to just tell Siri “warm up my car”, and it will know my car and warm it up using the Tesla app. This requires some sort of notepad or memory, which Apple will likely include. Something like this could run on the on-device model.
I am taking an educated guess at roughly how the new Siri will work, obviously we will not know until Apple announces it officially later this year.
This is just to say, of the Apple Intelligence features, the ones you use most often and happen in the background like summaries happen on-device, while the larger more sensitive tasks or ones that require multiple steps happen on the cloud. This is how hybrid should work, and because of how bespoke Apple’s systems are, they can make sure it works in a way I’m not sure any other company could.
Private Cloud Compute and inference silicon
The last part is Private Cloud Compute, this is Apple’s infrastructure for running LLMs and Apple Intelligence models. They are powered by Apple Silicon and designed by Apple for Apple to power Apple Intelligence.
You might be wondering, what silicon does it run? Unfortunately, nobody knows. Through supply chain and conversations, I’ve heard many many many different things, including that it runs on A18 Pro, A19 Pro, M3 Ultra, M4 Max, M4 Ultra, M5, and M5 Max. Given there has been absolutely zero overlap on this, it means nobody knows! Apple are the best secret keepers in the world, they are keeping this one tight.
It may be any of these or even a custom version of an M series chip designed for AI inference. What we know is all Apple cloud AI inference happens these Private Cloud Compute blades in Apple owned servers. They are now assembled in Houston Texas.
Apple’s American-made advanced servers are now shipping from our new Houston facility to Apple data centers!
These servers will help power Private Cloud Compute and Apple Intelligence, as part of our $600 billion US commitment. pic.twitter.com/maOd3lCGfK
— Tim Cook (@tim_cook) October 23, 2025
You may be wondering, what makes Private Cloud Compute special and why is Apple building their own AI inference silicon rather than just buying from Nvidia, Google, or anyone else? There’s a few reasons like cost and scale, but let’s go with what they’ve spoken about.
The first is, obviously, privacy. PCC does not have any persistent storage, it is all stateless. There are no logs saved on Apple servers. Your data isn’t saved, and as far as I know, has no ability to be saved. The customized OS doesn’t support that. Apple uses the same security chips from iPhone and Mac to authenticate the PCC servers, so they will not boot if the OS is tampered with. When a device makes a request to the PCC servers for AI inference, your devices recieves a key from the PCC server that your phone must approve as legitamate before sending the data. That data is end to end encrypted in transport, and decrypted on the PCC server using hardware security key.
During manufacturing, the devices are xrayed and checked at a board and silicon level to make sure nothing is tampered with or added to snoop on data. It is then checked again at the data center before installation to confirm nothing was added and it is the same as it was when it left the factory.
Apple does not log or debug requests on the server, but on the client device they do store logs of the requests. This means Apple has no ability see any data, store any data, or access any data without the client device (your iPhone, Mac, iPad, etc) manually pulling that data to provide. The Apple Intelligence report with this logging is, by default, only stored for 24 hours. You can change it to store for 7 days at a maximum, no longer.
The other reason is cost, realistically Apple can make as many of these chips as they want. They have a lot, I MEAN A LOT, of TSMC capacity for M series-type chips. My theory is that Apple delayed M5 Pro/Max until this year rather than October 2025 (matches M3/M4 Pro/Max time frame) to put that silicon capacity towards building M5 based Private Cloud Compute modules for the upcoming launch of Gemini based model powering Siri. With the new tensor cores, custom networking for multi chip connect, and higher memory bandwidth, they could totally have a competitive AI inference stack with reasonable efficiency.
The other thing about all of this is, Apple does not need to support 2B Apple users day one. One thing I’ve seen come up pleanty of times is that this will roll out to all Apple users, giving Gemini models the most surfaces globally. This is not true.
Apple Intelligence and the new Siri will only run on Apple Intelligence supported devices. This is iPhone 15 Pro, iPhone 16 series, iPhone 17 series, all Apple silicon Macs, various iPads, and Apple Vision Pro. My ballpark is maybe 450-500 million devices. When you account for regional availability day one, maybe 200-300 million devices. This is also powering, according to Mark Gurman (again he must live in the walls at Apple Park), just Siri and Apple Health+ AI features. Will these have the same sort of constant traffic that something like ChatGPT or Gemini would? Absolutely not. Apple must be able to support max throughput, but their scale doesn’t need to be as large as Google or OpenAI day one. This means Apple can spend time building out their Private Cloud Compute enough until the Broadcom/Apple AI Accelerator is ready, which is reported to be towards the end of this year.
Apple doesn’t need to run on Nvidia GPUs or Google TPUs, they can manage with their own existing silicon until their better silicon is available. It is a luxury Apple has that many others do not.
You might be thinking after all of this, why did Apple even use Gemini based models in the first place, are they just behind in AI? The answer is basically yes. Apple has their own foundation models. The on-device foundation model is arguably the best in practice, but the server model is not. It roughly matches Llama 4 models while being more efficient with a smaller total parameter count.
The server model is not up to par, and if they could have made one, they wouldn’t be using Gemini. While yes this isn’t ideal, it did give them a decent advantage: this is cheaper than training a model from scratch.
To train a competitive model, they would need to buy all of the data likely from Google or companies like ScaleAI and others. That is extremely expensive. They would then need to process and optimize that data. Once they do that, they need training infrastructure which they reportedly don’t have. That means renting at market rates from AWS, GCP, or Azure. They would need to run multiple training runs, likely over $50-100M each. They would need to fine tune and manage the model. If we look at OpenAI, they likely spent $10-15B on R&D in 2025! For Apple to catch up and exceed would cost that or more.
$1B a year is a steal for the access they are getting, and it cost Google nothing. They already had the weights, it’s just an added bonus. This also allows Apple to not rush development of their own in-house model, and take time with it. If their product works with a Gemini engine, there is no immediate need to swap to an Apple engine until the Apple engine is better or significantly cheaper than the Gemini one.
You can think about this like Qualcomm and Apple’s 5G modems. Apple couldn’t make a 5G modem overnight, so they used Qualcomm’s 5G modem until they could make their own. Once they could and it was competitive and cheaper, they started to phase out the Qualcomm modem for their own. I would expect the same here.
All of this, I hope, clarifies what is going on a little more. It’s wild times, so hopefully this helps!