o1 to o3 - A Bitter Lesson in Coding

OpenAI's o3 model teaches the bitter lesson to its o1 model

Feb 13, 2025

OpenAI shared how it was able to become the 175th best programmer in the world.

I took a few notes while reading (below). My main takeaway is that RL on frontier models is the path forward for most all tasks. We might not need much more "raw intelligence" than what o3/o4 gives us but will need harnesses for make that intelligence useful. I wrote about that in a prior blog.

o1's Codeforces Evaluation Involved More Hand-Holding Than Expected

OpenAI pushed o1 hard to compete on Codeforces.
Additional RL training and a complex workflow were used, including clustering and reranking.
With significant human-designed guardrails, o1 achieved 213 points, placing it in the 49th percentile—a strong performance for an AI, though still below the top competitors.

o3 Learns to Solve Problems Without Human Heuristics

o3 abandons human-engineered heuristics and learns its own test-time reasoning.
RL alone narrowed to an equivalent strategy while outperforming human-designed methods.
Despite generating 1,024 solutions per problem compared to 10,000 for o1, o3 delivered superior results.
Specific RL methods were applied during o3’s evaluation that may not persist in the final model, raising questions about generalization.

o3 Breaks the Gold Threshold Under IOI's Strict 50-Submission Limit

o3 surpassed the gold threshold with just 50 submissions.
Achieved this without human heuristics or relaxed submission limits.
RL generalized problem-solving more effectively than human strategies.
More RL = Less reliance on human heuristics = Better results.

o3 Crushes Software Engineering Benchmarks

RL-trained models continue to improve.
o3 vastly outperforms earlier versions on real-world SWE benchmarks.
Strange that o3 was absent in the HackerRank AI Benchmark benchmarks in the report. Perhaps it was too new?

The Bitter Lesson: RL + Compute > Human Heuristics

RL + Compute outperforms hand-coded human strategies.
Same feeling as my User Attention is All You Need post.

Parting Thoughts

Reasoning models + RL is saturating the code/software benchmarks.
My personal experiences don't match the benchmarks I see. Models are smart but blind.
Need benchmarks/gyms that are more reflective of real-world reward functions.