GPT-4.5 is actually a big deal

An in-depth analysis of the features and implications of OpenAI's GPT-4.5 model.

Mar 1, 2025

OpenAI GPT-4.5 System Card just dropped, and 𝕏 is filled with complaints. Most are saying it’s too expensive. Others are calling it a half of a half-step.

My take? Greatness in base models will feel subtle now. And if you know what to look for (like you've been building around the limitations of 4o or 3.5 sonnet), then you'll know 4.5 is hiding some real wins.

My second take? They've got something more important to work on now. And this release feels like a way of cleaning the slate so they can turn their attention to it. After all, attention is all you need. Or, perhaps it just feels that way because the general population doesn't know enough about the two scaling laws to understand the difference between a foundation model and a reasoning model... and OpenAI didn't do a good enough job of explaining it.

In either case we're in for a better launches soon.


4.5 is Quietly Great

Title page of the GPT-4.5 system card, positioning it as a research preview.
  • Their first low-key announcement
    OpenAI isn’t exactly flexing here. The tone feels more like “here it is, don’t @ us.” Makes me think they’re shipping this to move on to something bigger.
  • Price Shock
    Yeah, it’s expensive. Sam Altman straight up admitted they don’t have enough GPUs to make it widely available yet. That could mean the price is artificially high—supply-constrained more than anything.
  • Why Care?
    If you’ve ever been burned by hallucinations or flaky multi-step reasoning, this model might be the fix. And if that’s your bottleneck, maybe it’s worth the premium.

It's Wildly Expensive

GPT-4.5 is expensive
  • Artificial Analysis always has the best charts for these things. It's pricey.
  • I'm not going to belabor the point

It's Hallucination Rate is a Big Deal

Graph comparing GPT-4.5 to past models on PersonQA hallucination rates.
  • From 28% to 78% Accuracy
    This is a massive jump. If hallucinations have been wrecking your app, 4.5 might be a lifesaver.
  • Outperforms o1 on Raw Facts
    In pure knowledge recall, 4.5 is already ahead, imagine what happens when they build reasoning on top of this.
  • No one is talking about this, but they should be.

It's Capable of Doing Actual Tasks

GPT-4.5 vs. GPT-4o in agentic tests
  • 3x Time Horizon
    GPT-4.5 can handle much longer chains of reasoning before losing the plot. That’s huge for agentic workflows.
  • My experience: Claude 3.7 very overeager on agentic tasks. GPT-4.5 is more reliable, it might be that nuanced EQ taking effect.
  • That EQ loaded in a reasoning model? That’s a difference maker in many apps.

It can MakeMePay

GPT-4.5 outperforms other models in a persuasion-based test.
  • In persuasion tests, GPT-4.5 beat deep research, which is built on o3. That’s wild.
  • Persuasion is a scary capability for AI. This skill has worth.
  • Who knows what happens if you fine-tune this model on insecure code. If they aren’t careful, this model could be too good at manipulation more here.

It's has deep code understanding

Capture the Flag (CTF) benchmark: GPT-4.5 pre/post training vs. other models.
  • 2x to 4x Better Than 4o
    Web security tasks? Major gains.
  • Not Yet o1-Level, But Close
    The true reasoning models are still ahead, but what happens if OpenAI builds chain-of-thought on top of this?
  • Feels Like a Missing Piece
    If they haven’t already trained an “o1-style” reasoning model from this base, they’re probably about to.

It's good at real-world code

GPT-4.5 vs. GPT-4o, o1, and deep research in a SWE-Lancer benchmark.
  • I think SWE-Lancer is the right benchmark for coding evaluations.
  • In it, GPT 4.5 is >2x better than GPT-4o and 3% better than o1. Again, imagine this foundation in a reasoning model.
  • Maybe even more interesting is it's better than o3-mini which we know was tuned to be apt at coding tasks.
  • I'm waiting for 3.7 Sonnet to get on this chart.

Note the gains aren't as large in SWE-Bench, which was the best benchmark prior to SWE-Lancer (imo). This suggests that the model is more specialized for real-world coding tasks in real-world code repos, where broader knowledge is beneficial, instead of smaller code changes in smaller code libraries.

GPT-4.5 vs. GPT-4o, o1, and deep research in a SWE-Bench benchmark.

Parting Thoughts

GPT-4.5 isn’t the “GPT-5” moment people were waiting for, but it’s clearly a meaningful step up. OpenAI’s quiet tone makes me think they’re looking ahead to something bigger, but the data shows that hallucination rate, agentic capability, and persuasion power all saw major upgrades. And remember, this is a base model. The real magic will come when they build reasoning on top of this.