o1 to o3 - A Bitter Lesson in Coding
OpenAI's o3 model teaches the bitter lesson to its o1 model
Feb 13, 2025
OpenAI shared how it was able to become the 175th best programmer in the world.
I took a few notes while reading (below). My main takeaway is that RL on frontier models is the path forward for most all tasks. We might not need much more "raw intelligence" than what o3/o4 gives us but will need harnesses for make that intelligence useful. I wrote about that in a prior blog.
- OpenAI pushed o1 hard to compete on Codeforces.
- Additional RL training and a complex workflow were used, including clustering and reranking.
- With significant human-designed guardrails, o1 achieved 213 points, placing it in the 49th percentile—a strong performance for an AI, though still below the top competitors.
- o3 abandons human-engineered heuristics and learns its own test-time reasoning.
- RL alone narrowed to an equivalent strategy while outperforming human-designed methods.
- Despite generating 1,024 solutions per problem compared to 10,000 for o1, o3 delivered superior results.
- Specific RL methods were applied during o3’s evaluation that may not persist in the final model, raising questions about generalization.
- o3 surpassed the gold threshold with just 50 submissions.
- Achieved this without human heuristics or relaxed submission limits.
- RL generalized problem-solving more effectively than human strategies.
- More RL = Less reliance on human heuristics = Better results.
- RL-trained models continue to improve.
- o3 vastly outperforms earlier versions on real-world SWE benchmarks.
- Strange that o3 was absent in the HackerRank AI Benchmark benchmarks in the report. Perhaps it was too new?
- Reasoning models + RL is saturating the code/software benchmarks.
- My personal experiences don't match the benchmarks I see. Models are smart but blind.
- Need benchmarks/gyms that are more reflective of real-world reward functions.