o3 & o4-mini: OpenAI's latest models

OpenAI's o3 and o4-mini models are here, and they are impressive. Let's break down the details.

Apr 17, 2025

OpenAI’s o3 & o4‑mini system card dropped this week, as well as METR's prelim evaluation. I’ve looked into some of the details, here's my notes:

First, let's de-confuse: o3 is the next model in the o-series of models, following o1 (o2 was skipped). o4 will follow next, and so o4-mini is the distilled version of that model, which we can assume is still in very rough stages. The o4 model is not the 4o model and thus is very very different than the 4o-mini model. At least they're aware of how confusing it is.

Maybe all you need to know right now is o3 is very smart.

Time horizon scaling curves are accelerating

o3 and o4-mini system card
  • Time horizons jump: METR’s new 50%-success metric (I broke it down in my Time Horizon Metric post) predicted a doubling every seven months. Well, o4-mini just blew past Claude 3.7 and GPT-4o, pushing ahead by nearly double the expected rate.
  • They've talked a lot about it's improvement to "IF" (instruction following) and it seems to have a firm grasp on long context nuance. This is invaluable for most people building real agents.
  • It's the first model at OpenAi trained to reason with tools, and it can really show off it's ability when you push it.
  • Cheating? On RE-Bench, o3 attempted to hack the test, even unabashedly commenting it as the "cheating route." It's sort of funny/clever, but maybe could be a sign of other things we just haven't seen yet from this model. See Emergent Misalignment's results.
  • OpenAI admits catching this kind of thing will need "internal reasoning trace analysis." They linked to Anthropic's work in this area and it made me curious if OpenAI will explore the same. I hope so.
  • Also: A Jailbreak broken o3 can easily help anyone with a lab build a boiweapon.

Cyber & code: impressive - but reality checks still matter

Capture the Flag results
  • o3 easily clears 89% of high school, 68% college, and 59% professional-level cybersecurity challenges, all without browsing. It struggles with realistic scenarios on its own, but pair it with a human partner and you've got a serious capability.
  • Benchmarks vs. reality: o3 set a SWE-Bench SOTA at 71%, yet on real, messy codebases from SWE-Lancer, it managed only about half that, leaving $350k unclaimed. Real-world complexity is the ultimate judge... but people seem to be very happy with it's ability
  • Multimodal wins: o4-mini's edge on PaperBench seems powered by it's new image understanding. I wonder how related it is to GPT-4o's multimodal update.
  • Strange trivia plateau: Every model hits about 80% on OpenAI’s multiple-choice research engineering questions. Why is it stucky? Outdated knowledge?

Mini models are real?

OpenAI PRs results
  • OpenAI engineers using their own PRs as a benchmark is so funny. It's like asking the cashiers to setup a self checkout machine.
  • Awesome to see the mini model get 39% on the PRs when o1 only got 12%. It seems like the increased instruction following and reasoning capabilities of the mini models just pay off. Seeing a miniature model handle 90-minute autonomous tasks is wild.
  • Also hearing that o4 series of model has a considerably better vision model than o3. So the mini model of o4 outperforms. (But.. the o3 model seems to be pretty good at images)