What’s up with ChatGPT & GPT-5?
OpenAI finally released GPT-5 last week, and there has been quite a lot of discourse on the internet about its quality. Some users have a great impression, others not so much. Some are mourning the loss of a friend, others are just happy to have a more intelligent model with free access. There is a lot of confusion, a lot of questions, and a lot of varying opinions.
To be clear, there are a few models in the GPT-5 series. There’s GPT-5, GPT-5 Mini, GPT-5 Nano, and GPT-5 Chat. All of these are unified reasoning models, with 4 options for reasoning: minimal (effectively non-reasoning mode), low, medium, and high. GPT-5 and GPT-5 Chat are a similar size model, but the Chat option is a finetune of the model optimized for multi-turn conversational chat. GPT-5 Mini and GPT-5 Nano are smaller, optimized versions that don’t require as much intelligence. Simple things such as web search or summarizing an email are great with the Mini and Nano options, but agents and high value workloads are meant for GPT-5! There is also a difference between the API and ChatGPT platform, as well as others like Cursor/Codex/Perplexity/Copilot. Let’s try to explain!
When you use ChatGPT, you are chatting with “ChatGPT 5” (from now on, when I refer to a model as ChatGPT 5 or ChatGPT 5 Thinking, this means it was accessed from the ChatGPT app) which is a unified router, meaning depending on the prompt, it could be forwarded to one of the above models with Thinking enabled or disabled. Paid users can force Thinking for 3,000 queries per month in the Plus tier or unlimited for the Pro tier. The Pro tier is given access to GPT-5 Pro as well, which is a more advanced version of GPT-5 Thinking that far outperforms any other model, but for the sake of this report, I’m not going to be mentioning it because it makes an already confusing setup more confusing.
The reason I mention all of this is because depending on how you use GPT-5, where you use it, and what you use it for, your experience will be different. In the ChatGPT app, it’s unlikely to change much.
The ChatGPT 5 Thinking model is roughly equally as capable as the previous o3 model, ChatGPT 5 is roughly as capable as GPT-4o/GPT-4.1. For search, basic conversation, or image gen it’s unlikely you will notice a meaningful difference outside of its personality. Will ChatGPT 5 hallucinate still when not given external context? Yes. Will it sometimes be wrong? Yes. If it doesn’t search, if it doesn’t have external context, it will still have classic problems. It makes up URLs, it does the classic AI things. You will likely think GPT-5 is worse if this is how you use ChatGPT, and frankly it probably is in some of these cases.
On the other hand for more sensitive workloads, such as data analytics, business processing, health queries, coding, and others along those lines it’s a significant upgrade. The workloads where you give it information or data, and ask it to work with that. One of the major improvements was a reduction in hallucinations, and this is where it comes into play. One of the previous problems with ChatGPT was hallucination in the chain of thought when uploading data leading to bad data processing.
One of the examples we use internally to evaluate models is uploading raw Qualtrics data of a study we ran, and seeing if the model could accurately answer questions from it. Now, this is not testing raw model performance but rather platforms powered by models. For example, when you give ChatGPT the data in a csv format, it writes python code to look at and understand the data. Not every platform or model does this well or consistently, so the environment where you run the model matters just as much as the model itself. Letting an agent run without giving said agent access to a proper environment or harnesses for it will result in it being unable to complete its task!
The first model to successfully understand the data was the OpenAI o3 model in ChatGPT, but only when given specific information and hand-holding to understand how it was formatted. The second model to successfully understand the data was the ChatGPT agent, it did not require hand holding, but did hallucinate information. ChatGPT 5 Thinking, on the other hand, was able to look at the data, process it, and provide accurate information without detailed prompting about data formatting.
This is a common trend I’ve noticed in testing of ChatGPT 5 Thinking, it’s able to extract information better, understand that information, and use it in a more effective way. When I upload equity research notes, workshop a financial model, or ask it to create an interactive dashboard in a Canvas, ChatGPT 5 Thinking handles it far better than most other models, similar to the performance of something like Claude 4.1 Opus but usually better. My opinion is ChatGPT 5 Thinking is the best model to use for information that requires precision when given context.
The other side of things is GPT-5 in the OpenAI API. Essentially, if you are anyone but OpenAI, this is what you will be using to integrate GPT-5 into your application. It is, bar none, of the best models on the market for this use. Note I’m not talking about ChatGPT or the value prop of a ChatGPT Plus subscription vs Gemini vs Grok vs Claude. There are a few reasons for this, but let’s start with the most important one: it’s natively agentic with proper understanding of tool calls and how to interact in its environment.
One of my favorite examples is Cursor, the absolutely massive coding tool. The Cursor Agent went from making roughly 15-25 tool calls with a model like Claude 4 Sonnet, trying to understand context, and roughly doing what you asked, but taking some liberties to go above and beyond. When it comes to a model helping assist in programming, this is not good. You would ask for help with one thing, and it would refactor half of a codebase adding unnecessary complexity. It was really good, but too much. GPT-5 on the other hand does not do this. It does exactly what you ask, bringing in the right information, working to make sure it does what you asked and doing it properly. It’s far better, and does not go out of its way to add complexity.
This sort of capability jump is important, because it also leads to other platforms like Hebbia, which is an AI finance platform. They noted that by simply swapping their model to GPT-5, their workflows went from 40-50% completed by the agent to 90% completed by the agent.
3/ Every existing workflow is now 50-60% automated.
With GPT-5 and Hebbia– excel outputs are now 90% completed for you.
The output isn’t just text, it’s an Excel model built from scratch. pic.twitter.com/nBQBdKqiyL
— Hebbia (@HebbiaAI) August 7, 2025
METR is a benchmark for models that tests how long a model can work autonomously to complete a task. The longer it can work autonomously, testing and confirming its own work, to complete a goal is one of the major tests of an agentic model. Right now, GPT-5 leads all models. It’s able to work for roughly 2 hours and 15 minutes autonomously. This is ahead of all other models, including Grok 4 and OpenAI’s own o3.
This is, quite frankly, the best and only model that could power a reliable agent at scale without the need for extreme scaffolding to make it work.
From developers making agents, developing for enterprise, and trying to use the API to create high quality AI applications, GPT-5 has had outstanding feedback. Platforms like Rork, which lets you vibe-code mobile apps, saw an 190% improvement in error rates with the model. Cursor set GPT-5 as the default. Hebbia found it nearly completing full workflows. When this model powers an agent in a proper environment, it is incredibly powerful. Chatbots are only part of what AI can do, and are not the best example of a models “intelligence” but rather how long it can work, like the aforementioned METR benchmark.
Depending on who you ask about the GPT-5 models, you’ll get a different answer on if it’s a good or bad model, and that’s honestly all from OpenAI’s not amazing naming/handling of the UX around the new ChatGPT 5 router. If we look at the pure model, its capabilities, its value, and how it functions in actual workloads, it is objectively a great model. The best at the price it’s offered at, and most capable today. This isn’t to say this will last forever, it’s likely Google has Gemini 3 coming soon which is likely to beat GPT-5 in intelligence and value, Anthropic has more model updates coming to Claude 4 series which is likely to match GPT-5. The space is heating up, and things aren’t slowing down.
The reality is, even if we have the most intelligent model in the world, it’s unlikely you or I could ask it a question that could elicit a response that really shows that. We can always find the place it won’t be better us, we can always find the limits, but if the limits are it messing up questions like “how many Rs in strawberry” but being able to look through thousands of results from a raw data of a study and perform an accurate statistical analysis, do the dumb questions really matter? Probably, but we can be willing to ignore those for now for the value we are otherwise getting.