We Do Not Have Enough Compute

April 2, 2025 / Max Weinbach

We are in a huge AI compute shortage right now. At this very moment! When you hear most people talk about this, it’s generally in regards to model scaling and compute scaling for training. Not enough compute so labs can’t scale up the scaling laws. Right now is different, it’s the Ghibilification of AI, there’s too much consumer demand for AI!

The goal of this is to outline a problem, there is now more demand for AI than can reasonably support without development setbacks for future models, products, and inevitably AGI. There are a few solutions, all of which are… difficult, which we’ll also outline below.

A brief overview of the past week: The xAI team, thanks to Grok 3 in India, had to move around their available compute towards inference. Google’s Gemini 2.5 Pro experimental model, currently the most intelligent and most useful all-around agentic model, is pushing their TPU capacity, so Google is actively allocating more TPU allotment towards that model. Then the big one, ChatGPT 4o native image output and the Ghibli photos.

After the new image gen last week, Sam Altman has spent a lot of time on Twitter talking about how popular ChatGPT has become after this update. They were getting 5 million users an hour at a certain time and now have 20 million plus, $20 a month, users. The viral nature of the image generation and the Studio Ghibli-style images led to such insane growth that they don’t have enough GPUs for inference. Sam Altman went as far as to ask for access to GPUs in chunks of 100,000!

With these massive numbers and proper optimization techniques, I wanted to estimate around how many concurrent requests 100,000 GPUs could handle depending on models and parameters of the models. With some good prompting, context, and some time with Gemini 2.5 Pro, here are about what I’d consider realistic numbers.

Note there are multiple scenarios here and different GPUs. For ChatGPT 4o, there really isn’t a good estimate of model size. It’s estimated to be between 400B and 1.2T parameters

GPU Model Deployment Scenarios

Scenario Description	Assumed Precision	Est. GPUs / Instance	Est. Concurrent Users (@ ~45 tok/s per user)	Key Assumptions & Notes
~1T Params, MoE Model on H200				Target Model: Large, hypothetical ~1 Trillion parameter Mixture-of-Experts model. Target Speed: 40-50 tok/s per user. Framework: Triton.
Instance: 16 x H200	FP16	16	~30 – 90 (Midpoint: ~60)	Assumes model + 8K KV cache fits comfortably. Focus on balancing latency & throughput.
~1T Params, MoE Model on H100				Target Model: Same ~1T MoE model. Target Speed: 40-50 tok/s per user. Framework: Triton.
Instance: 20 x H100	INT8	20	~40 – 50 (Midpoint: ~45)	INT8 assumed necessary for efficient fit & performance on H100s. Slightly lower perf. vs. 16xH200 due to hardware differences.
~800B Params, MoE Model				Target Model: Hypothetical ~800 Billion parameter MoE model. Target Speed: 40-50 tok/s per user. Framework: Triton.
Instance: 14 x H200	FP16	14	~40 – 80 (Midpoint: ~60)	Fits comfortably at FP16.
Instance: 14 x H100	INT8	14	~20 – 45 (Midpoint: ~33)	INT8 likely preferred for better VRAM headroom/efficiency on H100s.
~500B Params, MoE Model				Target Model: Hypothetical ~500 Billion parameter MoE model. Target Speed: 40-50 tok/s per user. Framework: Triton.
Instance: 10 x H200	FP16	10	~30 – 60 (Midpoint: ~45)	Fits comfortably at FP16 on fewer H200s.
Instance: 16 x H100	FP16	16	~15 – 40 (Midpoint: ~28)	Fit Constraint: Very tight/may not fit with 8K context at FP16. Requires optimization or compromise. Lower perf. due to VRAM limits.
Instance: 16 x H100	INT8	16	~22 – 55 (Midpoint: ~39)	INT8 provides comfortable VRAM headroom, allowing better batching & potentially higher throughput than constrained FP16 scenario.
Large Scale Deployment (100k GPUs)				Target Speed: 40-50 tok/s per user. Scaling: Assumes near-linear scaling (optimistic). Subject to large-scale caveats below.
Model: ~1T Params
100,000 x H200 GPUs	FP16 (per inst.)	~16 / instance	~190k – 560k (Midpoint: ~375k)	Creates ~6,250 instances.
100,000 x H100 GPUs	INT8 (per inst.)	~20 / instance	~200k – 250k (Midpoint: ~225k)	Creates ~5,000 instances (using INT8 for efficiency/fit).
Model: ~800B Params
100,000 x H200 GPUs	FP16 (per inst.)	~14 / instance	~290k – 570k (Midpoint: ~430k)	Creates ~7,140 instances.
100,000 x H100 GPUs	INT8 (per inst.)	~14 / instance	~140k – 320k (Midpoint: ~230k)	Creates ~7,140 instances (using INT8).
Model: ~500B Params
100,000 x H200 GPUs	FP16 (per inst.)	~10 / instance	~300k – 600k (Midpoint: ~450k)	Creates ~10,000 instances.
100,000 x H100 GPUs	INT8 (per inst.)	~16 / instance	~140k – 340k (Midpoint: ~240k)	Creates ~6,250 instances (using INT8 for comfort/efficiency).

With 5 million users, that’s not 5 million active requests, but you need to be ready to handle some amount of those requests at a reasonable speed, which is ~40 tok/s for ChatGPT. If you slow down the tok/s and batch more requests, which OpenAI will likely do at some point.

Reasonably, around 100,000 GPUs in the Hopper generation can handle between 250-500K concurrent requests. There’s some wiggle room here, with token speed and batching, context window size, speculative decoding, as well as a few other optimizations, but given we have no clue what OpenAI does, this is the best I can give. Obviously, Blackwell chips could handle far more. I can’t give any good estimates but ballpark maybe 50% more requests per node, so 375-800K concurrent requests per 100K GPUs.

If we assume OpenAI is mostly on Nvidia GPUs, there may not be enough global GPUs accessible to OpenAI to handle inference at the scale they want. Nvidia has shipped around 3.7 million Hopper GPUs. Meta has around 400K Hopper GPUs, xAI around 200K, and the other hyperscalers with numbers around there. Azure likely has a larger amount as well, just due to their exclusivity with OpenAI for current models. I’m not certain OpenAI will have access to enough Nvidia GPUs to actually serve this model, meaning other GPUs like Intel Gaudi or AMD Instinct, or custom ASICs, may be the only option to continue serving LLM inference at scale and have enough compute to allow growth.

When you need to move GPUs dedicated to training, research, and experimentation to inference to match the demand of consumer products, products in the pipeline get pushed back. OpenAI has publicly said this as well! This burst in demand will push back the future product pipeline.

On the other side of things, xAI Grok and Google Gemini. These are a little simpler. Grok has had immense growth in India, leading the xAI team not only to move more focus towards inference, but also Android app development! Up until recently, the Android app was developed by a single person. After the explosive growth in India, xAI grew out the developer team to focus more on that platform and country. They had to move more GPUs from training to inference as well, in their multiple data centers with over 200K Hopper GPUs. Their current, likely, Memphis data center has a dedicated 12K Hopper GPUs which I believe are used for inference. Unfortunately, there are also no good estimates on Grok 3 and the inference stack or how it works.

Gemini 2.5 Pro is a more interesting story. It was launched early last week, easily having the best performance of any model. On current benchmarks and IQ tests, it’s leading them all. By far the best model.

Google has a similar issue to OpenAI, they have a limited number of TPUs they can use, I’m guessing Gemini 2.5 Pro reasonably requires TPUv6 or TPUv5p, and these are allocated to different business units in Google outside of just Gemini AI. The weather model they use and developed, likely the embedding algorithms for Search, YouTube, YouTube Music, etc. These all likely use TPUs. Even if Google gives Gemini priority to TPU access, there’s still multiple models! This means splitting allocation between Gemini 2.0 Flash, 2.0 Flash Lite, 2.0 Flash Thinking, 1.5 series, etc.

One of the big parts of the Gemini 2.5 Series of models was that all base models would be thinking models. Another way to read this is hybrid models, a single model instance that can think or one-shot answers depending on the developer or model settings. Currently, only the thinking model is available, but the non-thinking setting will be soon as well. This makes sense when it comes to allocation, and likely where all models are heading: one that thinks or doesn’t depending on the input. This allows for better utilization of hardware because there are fewer models to serve overall, leading to better pricing and more capacity.

While there is no finalized pricing on Gemini 2.5 Pro yet, I’m hearing it’s around the same price as Deepseek R1 while being significantly more intelligent. It’s more intelligent than o1-Pro from OpenAI while being around 150x cheaper as well. It also supports function calling and is an agent-first model.

Simply put, as of today, Google has the best model for agents, the cheapest model, the fastest model (150 tokens per second), and it is running fully on a custom-designed vertical stack from silicon to compiler to training data to data center. To compete, everyone is going to need something like this or WAY more Nvidia GPUs.

We’re in an era now where we need more compute. This isn’t even compute for inference or training, just more compute for both. We’ve essentially saturated both. As much as edge AI feels like a solution to this, these models are still too large, and there are too many issues with memory, compute, software optimization, model performance, model latency, model architecture, and many more things I could list off. We simply need more data center compute, not less, and as much of it as possible as quickly as possible.

Join the newsletter and stay up to date

We Do Not Have Enough Compute

April 2, 2025 / Max Weinbach

GPU Model Deployment Scenarios

Join the newsletter and stay up to date

Trusted by 80% of the top 10 Fortune 500 technology companies