All About Agents: Cheap Tokens, Local Models, and Product Fit
Over the past year or so we’ve heard a lot about agents. Most of it hasn’t really happened, things have changed quickly, it’s hard to pinpoint and trends tend to be near impossible to predict because something you think may be successful ends up becoming irrelevant in three months.
If you take a step back and try to understand not the products and agents, but rather how these agents work the fundamentals haven’t really changed. An agent itself is an LLM within a harness. That harness gives it access to tools, it’s prompt, skills, and files. Simply, it is an LLM in an environment. Think of it like a person in an office. As you read this, try to imagine an agent within that context. Things may seem out of order as you read this, but trust me, there’s a reason.
The economics of agents favor frontier cloud models as the primary reasoning layer, with smaller/local models acting as specialized tools where privacy or cost matter. There is a time and place for everything.
The reason I think this is important is because nobody makes decisions for the future based on where the technology is today, they base it on where it is tomorrow. Three years ago, we had people proudly yelling to the clouds that model capabilities and intelligence from large models will be running on your phone in a 3-7B parameter model in 2026. That was incorrect. People made decisions two years ago based on this assumption, and if you did, you got screwed. Many people got screwed.
This technology developed and advanced in ways and directions we did not expect. It’s time we take a look at how it works today, because three years from now, it’s likely to be pretty similar, because believe it or not, the way agents worked three years ago is still pretty similar to today.
Orchestrator’s should be large cloud models
One of the things in the analyst community I’ve been hearing is this idea of an orchestrator model running locally for privacy/security for a given task and then call to cloud models only as needed, because this will be cheaper and handle most tasks. While that sentiment is correct and does make sense, that assumes the cloud and local models are as capable and intelligent as each other, or close to it. The reality is we aren’t there and may never be.
I struggle to even refer to most agents as orchestrators inherently. For the sake of simplicity, I’m going to refer to it as a main agent powered by the main model. This main agent can orchestrate subagents. The nuance here is its goal isn’t just to orchestrate subagents to do the tasks for the main agent, but rather use them as a more efficient means to an end using these subagents as tools.
Now depending on the app you use (also known as agent harness) subagents and orchestrators can be done in different ways. Some, like Codex and Claude Cowork, focus on doing most of the work via the main model instance then will spawn a subagent and orchestrate them to do specific things, like research or understanding data structures. The main agent then uses these to collect knowledge and understand the main agent needs to complete the task.
A great example would be financial modeling. The way it may work is the main agent would be tasked on modeling out a DCF and financial growth from scratch of a business. The way this would tend to work is the model would reason through what is required then task subagents to collect that data. This may mean one for SEC filings, one for investor decks, one for official news, and one for social commentary. These are each general purpose agents tasked with a specific goal. Each would collect this information and save it in a place the main agent could find it and give a quick explanation of what it found. The main agent would then complete this model, possibly asking the user some questions if it’s unsure, and then send it to a new agent dedicated to reviewing this work and providing feedback to the main agent. If it is good, it presents it to the user or makes corrections as needed, asks for review again, and then presents once completed.
The main agent is what does the work, the other agents work for that main agent doing the time consuming grunt work.
The reason I say this is the architecture that makes sense and will likely proceed into the future is efficiency on multiple fronts:
-
- Sometimes it’s easier and faster for the main agent to do the work itself rather than delegate
- Dedicated agents are just a model with a specific prompt and tools. Plugins provide models the ability to not only call its own skills (the prompt that would power the agent) but also those tools.
- A small model will have a higher failure rate, if a task takes 5 minutes on either a small local or large cloud model, the likelihood that the large cloud model provides you a good piece of finished work on the first try is an order of magnitude higher than small local. This could be the difference between 5 minutes of prompting and 2 hours of back and forth.
This doesn’t mean local models don’t fit in this architecture, they actually fit in perfectly! If part of this task is getting information or data from sensitive data pools, you as an agent builder can prompt the model something along the lines of “if you are looking for proprietary information from XYZ database, you must use the model gpt-oss-120b, only it has access to this data” then you can gate that data behind that model on a local deployment. If you want to run on the client device itself, you could even go further to gate it to that device with an on device model, like a 20-30B reasoning model. The local subagents essentially become tools of the main agent rather than standalone workers.
This is also significantly more token efficient. When you look at token usage on any agent today, the majority is simply data collection and data exploration! It is not the agent doing work, it is the agent or subagents pulling in information. You do not need an expensive frontier model to search for a file on your computer, this can burn hundreds of thousands if not millions of tokens! If we can use a small open model like Nemotron 3 Super or MiniMax M2.5 or gpt-oss-120B, we can save not only significant amounts of time, but significant amounts of money.
When you look at subagents as tools of the main agent, rather than an orchestrator with it’s team, things tend to be a lot more clear.
One thing we need to remember about agents like OpenClaw, it’s an agent that is designed to do anything, not just limited tasks. This will be the assumption for all agents going forward. To do anything, you need the most intelligence. Intelligence models know when to delegate, intelligent models know what needs to be done, intelligent models find and succeed doing unique and novel things. Small models may be capable, but they are not the same. You should be able to ask ANYTHING, and it will find a way. The capability gap is large, but doesn’t remove the need for smaller local models.
Tokens Can Be Cheap & Not All Tokens Are The Same
If you go onto the internet right now and look at a leading model to function as your main agent, there are a few models that will work well. Your best bets are Claude Opus 4.6, Gemini 3.1 Pro, or GPT 5.4. If you want to use an open model, those are Kimi K2.5 or GLM 5. There are some debates on which open models, with MiniMax M2.5 or Nemotron 3 Super, but in practice I do not find these models to be capable enough to trust. This could change in the future, but in my experience closed frontier models all work, and large open models. Anything under, let’s say, 500B parameter, seems to hit that unreliable point.
But that’s besides the point here, when you go to pay for access any of these models there are roughly five ways to do it:
- Pay per token at market rate
- Pay for a subscription
- Pay for a unit of compute that translates to various quantities of tokens of a given model
- Rent or buy a dedicated GPU deployment
- Negotiate minimum guarantees for lower per token cost
If you aren’t paying much attention, a few of these options may seem funky to you. I think the first two are self explanatory, it’s your classic API pricing or ChatGPT Pro/Claude Max. Where the third option comes in is with OpenAI and how they now do enterprise billing and consumer overages. If you hit your usage limit, they will sell you credits. It’s roughly $40 per 1,000 credits. These credits are mostly used in Codex, but for enterprises can be used for ChatGPT as well.
These are what I would call a unit of compute, as they do not map to a number of tokens directly. OpenAI says a message in Codex with GPT 5.4 is roughly 7 credits, while GPT 5.4 mini can be 3-4 credits. It’s worth mentioning that this includes the work it’s doing! If you send a message “hello” to even the large model, it will use a fraction of a single credit. If you give it a hard task, it will use more than 7 credits. The reason I say this maps to a unit of compute rather than tokens is if you convert this to cost per token, you get significantly more usage out of $40 in credits than $40 in API tokens. I would ballpark around 5-8x the number of tokens via the API.
While this tends to be more of a open model option (Google does offer licensing you Gemini to run on your own on-prem GPU), you can do private deployments as well. This is where you just rent or buy 8+ GPUs to run the model yourself. For rentals you pay by the GPU hour, and depending on where you host it, manage your own deployment instance. Some services like Baseten or FireworksAI will rent you the deployment and manage the inference framework. For $48 per hour for a model like Kimi K2.5, you can generate SIGNIFICANTLY more than $48 worth of tokens during that hour, but it’s a fixed cost. You could use more than $48 worth, or less than $48 worth. If you don’t use it for some parts of the day but do others, you could be net negative or net positive. Some do offer scale to zero and auto-scaling, so you only pay for GPU hours while it’s being used. Depending on your scale, this is an incredibly affordable option. These economics do come into play with buying your own GPUs and running your own rack on-prem, but that’s even more nuanced.
The last option is services like OpenCode Go who do something unique (which you will find is common practice with many larger companies like Cursor, Bolt, Lovable, etc) where they negotiate a minimum spend and minimum number of tokens in return for lower cost per token. OpenCode Go offers $60 worth of inference for $10 per month, hinting at prices being 6x lower!
they’re not – inference at sticker prices is generally ~ 60% margin
there are LOT of complicated economics beyond this though that come from the absolute chaos in the markets
— dax (@thdxr) March 30, 2026
The reason they are able to offer this is simple, running inference isn’t expensive but running inference on demand has risk. Someone like Fireworks AI needs to spend millions if not billions of dollars on GPUs. They have to take on the debt, front the cash, and pay for infra built out, maintain, hosting, etc. They take on the risk, but are able to generate tokens with that. Token generation is not 24/7, especially for these smaller inference providers. Since it’s on demand, you can not guarantee your GPUs are being hit all the time. Due to this risk, if you can get a minimum agreement that a party will pay you X amount (in the case of OpenCode, a minimum number of tokens at a lower cost per token) this can be a far better deal. If I can guarantee $100,000 worth of tokens at 1/6th the cost per token vs the uncertainty of demand, this is a compelling deal. Almost all inference providers will offer deals like this. I’ve heard OpenAI and Google offer deals like this to companies using their models to power major services. A few companies I’ve heard this from are AI companies you would all know. It’s hush-hush, but true.
Just because tokens are expensive to you and me does not mean tokens are expensive to everyone. Economies of scale, and all. You could cost effectively service free or cheap AI services if done correctly. What’s worth noting is regardless of your choice, all options are profitable for the inference providers. That does not mean the companies are profitable, but serving the inference is.
You Do Not Need to Buy a GPU*
Now that we know and assume that tokens can be significantly cheaper than market rates, that changes the economics of owning your own infrastructure quite a bit. If you can get your hands on a GB300 rack and generate your own tokens, yes this will be the cheapest cost per token but you must front the $5M+ for the GB300 rack itself. If I’m a business and looking for access to a model, I’m likely to just try to make a deal with OpenAI/Anthropic/Google guaranteeing minimums first and foremost to try to lower my cost per token of the best models. If I can get my token cost down by 2-3x, this changes the economics of cloud models significantly. This get’s down to open model rates for frontier models, and in some cases below open model rates.
But what about for consumers? What about smaller edge models? This is where things become a little more complicated. Up until now I’ve more or less been trying to generalize architecture and infrastructure across both consumer and enterprise domains. Where things differ now is the tollerance for error.
Enterprises will not tolerate error. If a model makes a mistake too often, they will simply remove the product, not the model. If ChatGPT makes mistakes because you are using GPT 5.3 Instant rather than GPT 5.4 Thinking, they will simply disable ChatGPT all together. The appetite for risk tolarence is not the same. With this in mind, it’s likely that for agents the reality is you will just want to use the best model as your main agent. It’s likely to always be this Claude Opus, full size GPT, or Gemini Pro model.
The smaller data collection auxiliary subagents can be whatever works best to collect data, research, and just search. As I mentioned before, these could run locally or you could find a service like Groq, Cerebras, or Baseten to provide the inference. Point being, models are cheap and sometimes it simply makes more sense to just use a cheap host.
Now on the consumer side, things are a bit different. We do not have access to the same economies of scale on a per token basis, this is actually what the subscription services are! Your “subsidized tokens” are the same style of minimum guarantees that these companies offer to enterprises. For some this can still be expensive or they want to make a $20 subscription last. The same style of offloading smaller tasks onto local GPUs or NPUs works incredibly well.
Consumers also have preferences of specific models they enjoy talking to, ones they think are better at certain things. My view is this is an unhealthy relationship with the technology, but my opinion here doesn’t really matter. People want this. Owning a GPU is a great way to guarantee
There are communities that view the only way to exist going forward is to own a bunch of RTX cards to run hundreds of agents simultaneously, and I believe that’s a story for another day, but this won’t really matter for the VAST majority of people, I’d go as far as to wager 95% of the developed world.
If you’re wondering about companies like Apple, Google, Meta and how they’ll do it? I’ll explain more later but if we assume running AI isn’t as expensive as it seems and becomes more efficient over time, then these AI features become a compelling reason to buy products and services with the margins to fund the AI features for free to the end user. Loss leader mentality, it will work for some but not all.
Local Models Sometimes Make Sense
Again, because I want to emphasize this point, local models sometimes make sense. We have to define what we mean by local models and use cases, though, and which devices they will run on. In this case, because it is the vast majority of computers shipped annually are laptops (75-80%!!!!) I’ll be discussing this in terms of laptops only.
In a very basic sense are two types of models, dense and Mixture-of-Experts (MoE). Dense models mean all tokens are generated based on all context and 27B parameters. MoE models use something called shared layers and a router to route only to certain experts for each token. For a model like gpt-oss-120b, you have 117B total parameters, but for each token generated only 5B are active. The model decides which experts to use based on a router that takes the input and tries to select the best experts.
So for gpt-oss-120b, you have 128 experts with 4 active per request as well as a shared layer on top that is active for all requests. This tends to make the models extremely fast, but on smaller MoE models the intelligence tends to take a massive hit. A great example is Qwen3.5 27B dense has slightly greater intelligence than the Qwen3.5 122B MoE model.
Essentially, you are taking a 4x hit on memory usage for a 2x speedup in inference speed. Because of modern cloud infrastructure this is actually benefitial for cloud models since you can negate the effects with larger models spread across more GPUs which tends to increase throughput but that dynamic does not play out for smaller models.
Essentially, yes you can fit a large quick model in your 128GB of RAM laptop but it may be worth taking the slower smaller model instead, which cooincidentally can run on laptops with 32GB of RAM.
With that out of the way, we have different classes of local models for LLMs. We have the 2-3B parameter models, the types that come included with your phone or computers. Think Gemini Nano, Phi Silica, Apple Foundation Model. These are great for basic things: rewriting and grammar correcting text, summarizing text, translating languages, etc. These are super basic tasks that this class of on device models are very good at.
If we step up a little, we are look at models from the 7-30B range, this gets perfect for basic agentic tasks. This gets you into agentic search, research, and reliably pulling actual user data from the local machines. It’s the type of model that can write a basic email for you, generate you a quick summary of your day from your calendar, summarize meeting notes for you. If you were to put a model like this into something like Claude Cowork, you would find it nearly unusable. It’s the class of model where you may use it for some basic multi-turn data collection or extremely basic tasks, but nothing mission critical. These are the tasks and features that you as a consumer wouldn’t expect to pay for, but just happen for you.
We can go over this into the 47-100B range, which still fits on laptops with 128GB of RAM, but I consider these extremely niche. The reality is you likely won’t be doing much real work with models like these, and developers who are serious about getting work done will likely get annoyed or upset with these models, especially when compared to cloud models. You can do 100B dense, but that becomes unbearably slow, and 100B MoE doesn’t have many intelligence gains over a 30B dense but is very quick.
Remember with local models it’s not just how capable are they, it’s how good are they compared to cloud models. A large local model can do fine development work sometimes, but that is still an order of magnitude worse than cloud equivalent. As open models become more capable, so do cloud models. This delta remains extreme regardless of how good open models get.
Your Private is Not My Private
This concept of privacy, and I do mean concept, is not universal. Privacy is, inherently, the right to be free from intrusion of your personal information. This means by definition I have the right to choose who gets what and where it is stored. Privacy does not mean that everything must run locally or be stored locally, but rather that I choose where it’s stored, I choose where that data is processed, and I choose who sees it.
This is true of both a consumer and enterprise perspective, enterprises and businesses store and process some data locally. Others send it to private instances on clouds like Azure, GCP, and AWS. Others use public consumer facing services with enterprise guarantees like Google Drive and iCloud.
When it comes to AI, privacy isn’t about making sure nothing leaves your device and is processed there, but rather you choose what is processed locally vs. on a cloud model, who you trust with that data, and where it goes. I may trust Google over OpenAI, so I would rather have give Gemini my health data over ChatGPT. You may choose the reverse. Someone else may choose to run it locally. All of these are, by definition, private because that data was not taken without my permission or knowledge, but I allowed it because of my trust with the host.
Some people may not trust the cloud hosts and may want it local, some people may not care, some enterprises may want to own it all, some enterprises may say fuck it and hope for the best. These are all by definition private. Where that becomes more of a gray area is what the hosts do with that data, which is the trust aspect.
I can trust Google with that data, and that trust extends to how they may use it. Some people may no longer trust Google, and therefore not consent to the transaction of data, and not consider Google a private provider. Others, like myself, trust them to not misuse that data. A breach of privacy would be using that data for something other than it’s intended purpose. No AI company to this day has been found to be doing that.
This is also completely ignoring the fact that in some cases, like with health data, local LLMs are not capable of giving accurate analysis of your health data. Health is one of those things where it needs to be accurate, lives are quite literally put on the line with health data. You can not risk it being wrong. If you don’t want to send your health data to a cloud provider, it’s likely you will not have access to these features. You have the right to privacy, to choose not to share your data, but that doesn’t mean you will retain access to all features.
The big one most people point to as an example is Facebook and Cambridge Analytica, and yes that was a breach of privacy because the data was collected without their consent. Has anything happened with the players we are discussing since then? No. It’s ok to not trust these companies, by all means. Data is personal and private, you should choose who has access or if anyone has access all together. It is not inherently more private to run locally vs. cloud, it’s just about choice.
I mean, we have some crazy examples of this trust and transactional privacy. Whoop is a $10B private company who’s entire business is fitness and wellness tracking. They sell you a subscription service for a band that collects data and uploads that data to an app. It collects between 10-100x more data than other fitness trackers. Their users love it! All data is processed on massive data center models, not the band or device. It’s still private data, it’s health data, and it’s all cloud based. Customers love whoop because the quality of the product is so good, and they trust whoop to keep that data private.
Product vs. Feature
Another part after ALL of this is how an end user interacts with the agent, model, AI, app, etc. We have a few different user experiences that seem to be taking hold. Chatbots like ChatGPT/Gemini, coding agents like OpenCode/Claude Code/Codex, work applications like Claude Cowork/Perplexity Computer, and always running agents like OpenClaw/Hermes Agent/Claude Dispatch.
I want to specifically focus on OpenClaw and Hermes Agent and Claude Dispatch for a moment, since I feel like the rest speak for themselves. These products work by essentially having your computer be the hub and you can delegate work from anywhere using either an app or messaging services.
Something that can always run and give you a heads up on XYZ thing you may have missed. The service I use is called Poke from Interactions. I connect my email, calendar, and my whoop. I get a text from Poke when an important email comes in with the specifics and a breakdown in the morning with calendar events for the day. A little later I get my sleep score from Whoop and at night my strain for the day. Pretty nice for an agent, eh?
Literally all of these are features already built into each app. Whoop sends me notifications, so does Gmail based on importance and calendar when events are coming up. The agent doesn’t do anything but provide me the same information in a different way. I can chat with it, and it can send emails for me… but I don’t like that. So it’s there to remind me of things. If Poke stopped texting me today, I’m not sure I would notice.
This isn’t a bad thing for these agents or Poke, it’s just that as a consumer, it does not fit into my life. I know other people who could not live without it. The idea that everyone will want a consumer agent that is running 24/7 on your behalf is not universal, and I think better phrasing and product fit would be including intelligent features into existing apps, and letting your AI of choice handle the specifics if you so choose.
I want Gmail to send me a notification about an email, then I can just tell Gemini, Siri, or Claude what I want it to do, or I can just respond. If I want research done and sent to me daily, ChatGPT and Gemini literally have been doing this for a year already. What’s nice isn’t having these general purpose agents replace individual services, but a single platform to control it on the go. Most of this is overkill, do I need a large model to refresh every 5 minutes or with every incoming email? Absolutely not. We have notifications for a reason and even now on iOS 26 the on device model will prioritize what’s important.
The other part that I’ve heard constantly from friends outside of the AI sphere is what can it do? I installed openclaw, what now? I think this is completely fair. It can do basically anything, so what do you want? What in your life can you automate? What parts of your life would be simpler if they were automated? Go ahead and think about this for a moment. Some of you may come up with a hundred things, others nothing. I would wager if you asked a thousand people more would respond “I’m not sure” than a list of things to automate or handoff to an agent. This could be because they aren’t creative enough to think of something, don’t know it’s possible, or simply don’t want to.
Quite frankly the main use cases I’ve seen are people using it to spam social platforms, try to half-ass build brands, or just become a public nuisance. While the concept is cool, it’s empowering the wrong people. Don’t get me wrong there are many who are completely fine and uses it well, but it’s slowly turned sour.
The value for consumers won’t be this. This will have some value to some people who love it. The value will be telling your phone to do something and it does it. That’s it. At an architectural level, it’s likely a main agent sending tasks off via tool calls to specific apps, features, or logic. The agent will take the request, search for the tool based on what it knows about you (do you use Gmail or Outlook, prefer Uber or Lyft, what is XYZ’s nicknames) and just do it. It’ll happen on your phone using cloud inference for the LLM, which semantic embedding on device for the rest.
Your phone is the agent environment, the assistant is the agent harness, the cloud is just the LLM powering it. This will be the product. This will mean my phone stays as it is today, the perfect glass and metal rectangle as my portal to the world, but with intelligent features. AI is not the product, it is the engine of a feature powering the product, my phone.
We don’t need dedicated AI hardware, but it could be useful. It all must be additive. We don’t need glasses always recording, it should be as needed. We have the technology to find lost items, recognize where you are without a camera, have a conversation and show you information without pulling out your phone. AirTags, GPS, ChatGPT + AirPods + Apple Watch. The products must work together to enhance the features, it shouldn’t require an entire shift of how we use and communicate with technology. That’s needless and would be worse than the experiences we have today.
I don’t need my glasses to record 24/7 and then try to understand where I left my keys. This is a solved problem, my phone can track my keys with cm precision. I don’t need my glasses to remember every face I’ve ever seen, forgetting apologizing, and re-introducing yourself is a human experience. Some things are uncomfortable but that doesn’t mean we shouldn’t do them. If I never need to get to know somebody to know somebody, what’s even the point of human interaction? What’s the point of relationships?
At the end of the day, what will sell is products and the features on those products. The future with AI looks a lot like today, but with less annoying steps.
What does this all mean?
I don’t want to get into the whole AI accelerator debate with who has the best GPUs and TPU vs GPU, CPUs are the new prized gem debate. I don’t want to enter that debate for one reason: it doesn’t really matter. The infra spend will happen regardless, investors want to know where they can make money in the next gold rush. You figure that out, what I am to do is understand and explain the product fit.
We know AI will be meaningful, the question in my mind becomes how? If you are too focused on it only working one way and assume that trends like local models become equal to cloud models, you miss out on building useful products, features, and services based on something that may not be true. If you only focus on cloud, you may burn too much cash to be reliable. If you follow trends rather than understanding why the trends form, you will always be chasing.
At the end, what I think is most important is the product. We are in an era where basically anything is possible, but assumptions on economics and user concerns are pushing a lot of people on the wrong direction. It’s time we take a step back and think about what users want, then make sure brands have the technology to make that happen rather than trying to force one specific way.
It’s worth having an open mind and not rushing to overspend today. It’s worth understanding that just because it is someone’s business to front the cash to build infrastructure does not mean you need to. This is near identical to telecom and cellular with data and bandwidth.
It’s ok to be late if you do it correctly, others you jump from trend to trend and risk losing user trust. Losing user trust is more costly than being a little bit late.