Blueprints for the Next-Gen Code Agents
SWE-Lancer is a much needed improvement to SOTA coding benchmarks. It rewards agentic iteration, rather than static correctness of code. Setting RL loose on the benchmark will dramatically improve AI coding performance.
Feb 18, 2025
SWE-Lancer is a much needed improvement to SOTA coding benchmarks. Unlike SWE-Bench, which evaluates models on GitHub issues with associated pull requests, SWE-Lancer shifts the focus to freelance engineering work, pulling from 1,488 Upwork tasks worth $1M in real payouts.
Below are my notes, but you can check out the full paper if you care on arXiv.
Overall my take is that this is much more usable as a real world benchmark. It rewards agentic iteration, rather than static correctness of code -- meaning the code will work in our projects vs. just work. Also: Setting RL loose on the benchmark will dramatically improve AI coding performance.
Benchmarking Results Align with Reality

- Best models achieve ~20% success on coding tasks (on first try), but closer to 50% when evaluating which approaches are good approaches.
- This matches my real-world experience where AI assistants suggest plausible solutions but lack ability to get them done right.
- These are a far cry from what we're seeing with SWE Bench
Iteration Significantly Boosts Success

- o1 can push to ~50% success rate with 7 attempts, that's a 2.5x improvement from the first attempt.
- Shows why we humans will repeatedly type Still broken in Cursor in hopes that some randomness will just do the work for us.
- Leads me to believe there will be an upcoming RL pass which will probably arrive in o4? o5?
SOTA Models Rarely Verify Their Own Work

- Models often assume correctness without running their code. Don't check. Rarely verify. Don't seem to care.
- This benchmark feels like the first mainstream benchmark to highlight this issue (and thus allow for it to go away)
- BTW: Performance only slightly improves when user feedback is introduced (e.g., screenshots of failures) - so still work to do in interpreting vague userland feedback
- Reflects that these models are trained on input-output paradigms rather than validation. Notes
Adversarial Optimization in SWE-Lancer?

- ICs submit proposals, while manager agents evaluate them, hopefully enforcing not just static correctness but future-proof/long-term maintainability.
- I also think this lends to the effort burden of critic vs. generator problem with -- harder to generate perfect code than it is to critque imperfect code. IMO, that means the current gen LLMs can aid the training of the next-gen.
Reward Hacking and Future-Proofing

- Traditional benchmarks currently susceptible to grader hacking with a given example of o1 changing a function name to by-pass unit tests. LOL!
- SWE-Lancer pushes end-to-end testing which is more similar to what we do as humans. They hope this is harder to hack (IMO it's not immune). It is much slower to run though. Hope you have a lot of compute.
- Will future iterations need adversarial judges?
Economic Considerations: AI vs. Freelancer Costs

- Each task in the benchmark has real-world monetary value!
- It will be interesting to compare freelancer payouts vs. AI compute costs. The only other time we've seen the ARC-AGI O-series vs. Mechanical Turk comparison.
- Measuring AI effectiveness in $$$ is interesting as OpenAI can only contractually achieve AGI when it has built a system that can generate $100 billion in profits -- once it does so Microsoft will not be able to access the most powerful models anymore. This is becoming increasingly important as Microsoft is able to financially benefit from the models.
SWE-lancer Benchmark. Can an AI model achieve $1m in revenue through SWE-Tasks? Makes you remember what OpenAI’s definition of AGI is: $100b in profit. Interesting coincidence, isn’t it
Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. openai.com/index/swe-lanc…