GEPA: From Evals to Shipping Prompts

Reflective prompt evolution turns PMs into prompt curators—outperforming GRPO and MIPRO with a fraction of the rollouts.

Sep 29, 2025

I’ve been tracking GEPA for months, waiting for it to escape the pdf land and enter code land. There were some python implementations relatively quickly, but I've been trying to keep things in typescript for now. Lars surfaced a neat TypeScript implementation by @swiecki. No more excuses, time to dive in.

My initial thought was: If evals tell us what’s broken with our agents, GEPA is the first practical tool for startups that helps you fix it.

Lars Grammel

@lgrammel

·Follow

Prompt optimizer for AI SDK (community project by @swiecki )

7:52 AM · Sep 16, 2025

444

Read 7 replies

Below are my notes from reading the paper, running the code, and thinking through how to plug it into a startup workflow.

The TLDR is you should probably be using this if your prompts work 80%+ of the time, you've shipped them, but don't have time to circle back and iterate.

The App Layer Has Been Running on Vibes

GEPA proposes reflective prompt evolution with Pareto sampling across tasks—outperforming GRPO while using far fewer rollouts.

GRPO was big for researchers. But app-layer teams don’t have time for 24k-rollouts, and with the doubling time of models, there's no point. Just pick a frontier and update the model in 4 months.
But beyond, early datasets are tiny; so vibe tests are the only dial we have before customers churn.
GEPA aims to help. "Genetic" evolution of your prompt to optimize scores. To me, it leans into the bitter lesson: use GPUs to iterate test time compute -- though that's not really accurate. But it does read traces, reflects in plain English, proposes a tighter prompt, then tests it.

Sample Efficiency That Finally Fits Startup Budgets

Across HotpotQA and HoVer, GEPA beats both GRPO (24k rollouts) and MIPROv2—reaching higher scores with far fewer samples.

~10% aggregate gains over GRPO with up to 35× fewer rollouts. Baselines jump ~15%—from “useless” to “ship it”.
Roughly 2× the lift of MIPROv2 (aggregate +14% vs. +7%).
Pareto sampling preserves multiple strong prompts per task; you’re not betting the release on one lucky candidate.
GEPA+Merge is there when modules need to share DNA, but the core win is reflective mutation: ever-stricter, higher-signal instructions.

Where The Nuance Shows Up

One-hop prompt rewrite: good hygiene, but you should already do this with any modern model.

One hop = table stakes. It’s the “ask ChatGPT to write the prompt” move.
Multi‑hop is where GEPA sings. The second‑hop retrieval prompt calls out missing entities, reframes search breadth, and avoids restating known facts—stuff that’s hard to vibe-check.

Privacy delegation: iterative mutations stack domain rules into a rigorous, auditable protocol.

Each branch encodes a lesson: “ban partial redaction,” “justify abstraction,” “transparent rationale only.” By iteration eleven, I’d ship this.

How I’m Plugging GEPA Into My Stack

Reflect → mutate → eval → Pareto sample → ship. Repeat nightly.

Working on evals first. Bad evals + GEPA --> bad prompts, no matter how clever the reflection LM.
Seed GEPA with prompts that pass my vibe tests (approx ~80% "done" and let it grind out the stubborn 20%).
Pareto frontier > “best single candidate” — diversity catches regressions.
"Product" acts as taste curator ensuring the scores align with the product's vision.

Parting Thoughts

Evals always felt half the story for the app layer. GEPA gives us the other half: automated, sample‑efficient fixes that a startup can actually run to outperform.

If your prompts work ~80% of the time, this is how you close the gap—no RL team required. Weekly GEPA runs, a quick frontier review, ship the upgrades.

Blog...

A blog posts...

The content for a blog post is loading...