The rollout of GPT-5 is a lesson in how to hurt your brand while delivering a big upgrade. First I dive into why the rollout is landing so badly and then dive into the capabilities.
Rollout
OpenAI dropped GPT-5, the supposed everything model which was going to save people from the faff of selecting models and cleverly adapt into deeper thinking modes, web search, images, etc. as needed automatically. And without much warning, they stripped access to the other models.
The problem is that the model called ‘GPT-5’ in the drop down is a bit crap, a bit stern and happy to disagree in a cold data led way and the internet wasn’t happy. So much so that OpenAI caved and brought back GPT-4o!
So let's address this step by step, is it really crap? The first thing I asked it for was what's the best way to run a Claude Code equivalent with GPT-5 to check out its agentic coding chops. It gave a flurry of different answers most of which were not that relevant. My wife asked me what I thought and my first response was "I hate it". I've seen other people complain that it can't answer relatively simple questions. One example doing the rounds is:
So yes it's a bit crap but if you switch over to ‘GPT-5 Thinking’ in the dropdown on paid plans, then it feels like it actually starts, well "thinking" which is rather useful in an assistant.
My guess is that this is the intended experience. They created a cut down version for the free tier that may be a minor improvement over 4o but for people used to the other models it feels like a major step back. My gut tells me that it was probably originally going to be GPT-5 mini or lite or whatever but some marketing genius thought ‘don't let them feel they’re getting a cut down version’, just call the non lobotomised version GPT-5 and then tell them real GPT-5 is smarter.
They did create a little hack where you can tell GPT-5 (Non-thinking) to think deeply and it does actually try a little harder but telling someone to think every time they respond, just to get a half decent reply seems like a good way to boost keyboard sales as people trash their computers in a fit of frustration. Anybody who is a parent will know the deep pain of having to remind someone to follow a simple instruction for the 10,000th time and now OpenAI is bringing this experience to you at work! 😂
While you might be forgiven for thinking that telling everyone GPT-5 is like having your very own PhD researcher in every topic and then people finding out that the default can’t handle tasks children in elementary school can handle quite easily would be the biggest issue for most people, it wasn’t even close. The decision to remove GPT-4o was the thing the vast majority were shouting about.
It takes some serious lack of empathy to not realise that a lot of people are going to miss the warm positive model that tells you everything you do is amazing, if you suddenly replace it with a colder, more analytical model that’s very happy to disagree with you.
It’s not to say that an overly sycophantic model is a good thing and besides the amusing anecdote below there have been worrying cases where it’s reinforcing unhelpful ideas for the mentally unwell.
Sama assured us he was going to fix this ‘glazing’ issues with GPT-4o but it feels like he might have overcooked it on the correction for GPT-5 and made a model that is a bit too cold and bitter for more of the companionship use cases and less of a pleasure to work with.
GPT-5 Capabilities
So they’ve blundered the launch but is it a dud? Not at all, I ran my reasoning test. If you want to get more info on the Kortical Schema, you can see this article here:
It’s basically an advanced Winograd Schema that addresses some of the weaknesses to create what has thus far been LLM Kryptonite, even if the questions are pretty easy for a reasonably smart person.
Model scores on the Kortical Schema:
GPT-3.5 - 50% (As each question is A or B, it’s essentially random)
GPT-4o - 75% (Starting to see real signal)
o3 - 92.5% (Now we’re cooking)
o3 - High Compute - 93.7%
GPT-5 - 95%
GPT-5 - High Compute - 100%!
This test is not designed to test for superhuman reasoning capabilities, just fairly average reasoning (No ‘feel the AGI’ here) but it is designed to be especially hard for LLMs where they can’t rely on just parroting answers. I haven’t had too much time with it yet but as models go up this scale they tend to be much better for agentic use, where we get them to do chains of tasks and they need to run many times in a row without making an error in any step. So this is a pretty exciting and significant result, at least for me.
Another thing that’s pretty cool is that you can get it write code and run it right in ChatGPT. It’s also much better at design, so the UI feels better and it seems to be pretty good at one shotting something kinda workable.
I had a go at creating an equation game with my kid and it couldn’t make kid friendly graphics or make the game work without bugs but as far as it did get with a couple of minutes and zero coding was actually pretty damn impressive.
Though if it’s anything like my experience with Claude Code (The best LLM coding assistant so far), the real trick is learning when you’ve reached the limits of the model and should start working by hand. With Claude Code it’s really easy to get stuck in a feedback loop making little progress and burning the productivity gains.
Time will tell if it’s really the start of personal non expert app creation or cool for prototyping or for getting started but still needs pros to make it real.
As a writing assistant I’m still preferring GPT-4.5. Writing is so personal, it’s hard to nail down exactly why I prefer it but I think it’s a combination of the insight, the delivery of the insight and the suggestions. My goto writing prompt is ‘critique what I wrote, do not rewrite it’, so It use it more akin to an editor than for producing text. Maybe the suggestions are just as good and it’s just I prefer a more positive tone but I feel like 4.5 just has a better grasp of how things will land emotionally too.
As a technical sounding board that stronger tendency to disagree seems useful, rather than telling me everything is excellent, it’s been able to disagree and cite research that backs up the direction it’s suggesting. I tested it with a project where I had originally been using o3 and Claude and it gave a bunch of useful ideas to try out that the other models hadn’t. On top of this, it gave a sense of understanding my project better. That’s with very limited testing so far though.
Wrapping Up
Anyway, that’s my 2c on GPT-5 so far. GPT-5 seems like a genuinely useful addition and it’s got some intriguing results that could make it very good at certain use cases but I have to agree with the model choice crowd, it’s not the everything model. Let me know how you’ve got on!
Also let me know if you’d like me to dive in on anything like LLM reasoning, how we get to Artificial Super-Intelligence, Artificial Consciousness or anything else.











