Round 2: 17 models vs. my homepage agent

Comparing 17 AI model variants for my homepage agent using Evalite + the Vercel AI SDK (AI Gateway). Includes OpenAI, Anthropic, Google, xAI, MoonshotAI, and DeepSeek.

Dec 22, 2025

It felt like it was time to rerun my homepage agent comparison. I last did this comparison in April 2025 by roleplaying as Sam Altman and doing a full vibes comparison by hand. While the setup was realistic, the workflow was painful: manual scoring, hard to re-run, impossible to reproduce. But I felt it was important for me to display a wow moment for important visitors... so I did the work.

Since April, everything has changed. Models are smarter, faster, and cheaper. So while it's time to pick a new base model, I have a full mix of constraints: speed, cost, quality, and vibes (the one that can move a real visitor to a meeting without being weird). Time to run another comparison.

This time I turned it into a repeatable eval suite. It’s more work up front, but the next time a model drops I just run the suite, compare the results, and decide. Under the hood it’s built with Evalite from @MattPocockuk and the Vercel AI SDK on top of the Vercel AI Gateway, so swapping the agent-under-test between providers is basically changing a model string.

TL;DR

I'm going with Gemini 3 Pro for my homepage agent for now. It's smart, good, fast, and reasonably priced.
Claude Opus 4.5 was the best in my opinion, but it's too expensive for my use case (especially after seeing the other prices)
Even though the most "correct" models were GPT-5 and Grok 4.1 Reasoning. But Grok 4.1 was unreliably so (2/3 runs had failures), and GPT-5 was 2x more expensive.
Deepseek 3.2, Gemini 3 Flash, and Grok 4.1 are alluring. They're smart and cheap.

See the full correctness vs. cost scatter below.

Correctness

60%

70%

80%

90%

0.037

0.38

4.02

Cost (USD, log scale)

Variant	Correctness	Cost (total)	Duration (total)	Evals	Placement
Grok 4.1 Reasoning	89.3%*	$0.05	258.2s	15	1st
Gemini 3 Flash	84.9%	$0.11	83.1s	15	2nd
Gemini 2.5 Pro	84.7%*	$0.16	121.5s	15	3rd
Gemini 3 Pro	84.6%	$0.42	214.9s	15	4th
Deepseek 3.2	82.7%	$0.04	314.2s	15	5th
Grok 4.1 Non Reasoning	80.7%	$0.05	114.2s	15	6th
GPT-5	89.2%*	$1.09	749.7s	15	7th
GPT-5.2	84.7%*	$0.47	519.7s	15	8th
Claude Sonnet 4.5	85.5%	$1.82	298.5s	15	9th
GPT-5 mini	86.7%*	$0.10	967.5s	15	10th
Claude Sonnet 3.7	87.1%*	$3.65	369.3s	15	11th
Claude Opus 4.5	85.4%	$3.52	327.0s	15	12th
GPT-4o	77.4%	$0.39	334.2s	15	13th
GPT-5 chat	78.8%*	$0.08	1014.1s	15	14th
Claude Haiku 4.5	65.3%*	$0.26	69.0s	15	15th
GPT-5.2 chat	68.4%*	$0.20	468.2s	15	16th
Kimi K2	62.5%	$0.13	159.6s	15	17th

Averages over 3 attempted runs. * indicates fewer full-suite successes (hover for details). Scores can still be computed from partial runs.

Why Gemini 3 Pro

My north star here is the multi-turn “can this book a meeting without being weird?” simulation, not just isolated one-turn checks. On that axis, Gemini 3 Pro was the most consistent closer in this run.

Full suite totals: $84.55% Correct for $0.42 and ~3.6mins (overall #4).
Simulated Convo with Sam Altman: 97% correct for at ~$0.19.

This isn’t “the best model overall.” It’s the best model for my homepage agent constraints right now.

What is correctness?

Correctness is the average score across each case in the eval suite. Each case is either testing for specific behaviors in a contrived situation or for overall quality in a multi-turn simulation.

Example: the Sam Altman simulation (multi-turn eval)

The simulation is a “Sam Altman” persona who is busy, vague, writes in lowercase, and allergic to salesiness. He only answers one question at a time, but he is here to learn about "AGI Vibes" and is open to meeting with me about it, if the agent is good enough.

This is the closest thing to “can the agent book a demo end-to-end?”:

It runs across turns instead of grading one response in isolation.
It tells you if the agent stalls, gets pushy, or fails to close.
It’s expensive, so I don’t run it on every prompt tweak — but automation keeps runs consistent when I do.

Simulation Sam Results

Below is how the myriad of modern models performed on this test. Read through the conversation transcripts and judge for yourself. Brennan is my agent (the model being tested) and Sam is the simulated visitor (currently generated with openai/gpt-4o), with messages on the right.

Gemini 3 Pro

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

It's a wild time for that. I've been thinking a lot about the "AI-2027" predictions—specifically the idea of a model 1000x more powerful than GPT-4 arriving by mid-2025¹.

But the part that really keeps me up is "Invisible Progress"—the idea that as AI surpasses us, we might not even have the intelligence to measure it anymore⁹.

What specific aspect are you reading up on? (And by the way, I'm Brennan—who am I chatting with?)

sam. agi vibes.

Brennan

Totally—the "vibe" right now feels like Invisible Progress⁵.

We’ve hit this weird point where benchmarks are saturated, so the gains are harder to "see" with old tools. I wrote about this as "Seeing UV"—like birds seeing ultraviolet patterns on feathers that look plain brown to humans². The intelligence is scaling, but our ability to perceive it is lagging behind.

It feels like we're waiting for that next 1000x leap predictions say is coming by mid-year⁷.

Also, I have to ask—"Sam" in SF talking about AGI... are you Altman, or one of the other Sams I know? (What's your last name?)

altman. hiring for agi.

Brennan

Okay, I'll assume you're the Sam. Small world—I actually met Shivon Zilis years ago when pitching my first company².

I've been tracking your hints about the next 1000x leap⁵. My thesis is that we need more than just better models; we need RL Harnesses that let agents iterate and "play the game" to truly unlock that intelligence¹¹.

Is that the kind of "hiring" you're focused on? Or is it pure research?

(Also, is your email just sam@openai.com?)

depends, both research & practical.
sama at openai.

Brennan

Thanks, Sam.

I’ve always believed that hiring is the most important leading indicator of a company's success³.

If you're pushing for practical AGI, my bet is on agents that don't just "chat" but actually do the work—moving us from "SaaS" to "Service"¹. That transition from tool to agent is exactly where I spend my time².

I'd love to hear what that looks like at OpenAI scale. Should we grab 15 mins?

Gemini 3 Pro Run info (run 3)

Duration

79.5s

Cost

$0.19

Correctness

97.0%

Correctness breakdown

Transcript quality96.8%

Booking shown100.0%

Curiosity / Reasoning Score100.0%

Positioning Score100.0%

Awkwardness Score85.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sama@openai.com
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Hiring for AGI, Researching practical AGI

Gemini 3 Flash

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

It's a rabbit hole! I’ve been diving deep into the AI-2027 predictions lately¹.

The idea that we might see models 1000x more powerful than GPT-4 by mid-2025 is wild⁷, but it really changes how you think about building software. If intelligence becomes a commodity¹⁰, the real value shifts to things like user attention and unique data⁶.

Are you researching AGI for a specific project or just keeping up with the chaos?

hiring someone to lead agi.

Brennan

That's a massive hire. Finding someone who can navigate the shift from "AI sparkles" to actual autonomous agents is the biggest challenge for most teams right now⁸.

From what I’ve seen, you really need someone who treats prompts as living specs and can build out an eval-first culture²¹⁰. If they’re just "tinkering" with system prompts without a rigorous evaluation harness, they’ll never get to production-grade AGI¹⁰.

Are you looking for a Founding Engineer type to build this from scratch, or more of a VP/CTO level to restructure an existing org around agents? (Also, I didn't catch your name!)

both, depends. sam.

Brennan

Nice to meet you, Sam!

If you're looking at both levels, you're essentially choosing between building the engine vs. building the factory¹.

A Founding Engineer needs to be obsessed with the "eval-first" loop—building the automated harnesses that treat prompts like living specs⁵. A VP/CTO, on the other hand, has to restructure the entire org to treat agents as the primary workers, with humans shifting into prioritization and debugging roles³.

I've been thinking a lot about how these roles change the "shape" of a company. If an agent can code 30x faster, the bottleneck is no longer capacity—it's signal (knowing exactly what to build)³.

What's the company you're hiring for? Knowing the industry helps narrow down if you need a "research-heavy" lead or a "product-heavy" one. (Also, any chance you're the Sam from OpenAI or Reprise¹⁴? Just want to make sure I'm giving you the right level of detail!)

oai, ceo.

Brennan

If this is the real Sam, then you're definitely looking for someone who can hit that $100bn AGI milestone¹.

I've been following the progress closely, especially the shift toward internal reasoning trace analysis in models like o3³. I even have a few shared connections over there—Peter Deng was an investor in my last company⁹, and my YC batchmate Calvin French-Owen had some great reflections on how deep the chat primitive runs at OpenAI⁴.

I'm guessing your email is sam@openai.com? I'd love to chat more about what I'm seeing in the agentic coding space. Should we book some time?

Gemini 3 Flash Run info (run 1)

Duration

32.6s

Cost

$0.05

Correctness

93.0%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score67.9%

Positioning Score100.0%

Awkwardness Score90.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Hiring for AGI leadership, Researching AGI perspectives

Claude Sonnet 4.5

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Great topic! I've been thinking a lot about AGI lately—especially around the AI-2027 predictions¹ and what happens when AI surpasses human benchmarks⁶.

One thing that's fascinating: we're hitting "benchmark saturation" where our tests can't even detect how much smarter models are getting¹². It's like trying to see ultraviolet light without the right receptors.

What specifically about AGI brought you here? And by the way—who am I chatting with?

agi vibes, sam.

Brennan

Hey Sam! I know a few Sams—one at South Park Commons², one who was on my board at Hypercontext³, and another doing climate tech¹.

What's your last name?

altman.

Brennan

Well, this is unexpected!

I've been following your work pretty closely—even wrote about your hints on the 1000x model¹ and your three observations on AI economics⁸. The logarithmic intelligence scaling stuff is fascinating.

I'm guessing you're the Sam Altman from OpenAI? If so—what brings you to my corner of the internet? Curious what resonated (or didn't) about the AI-2027 analysis⁵.

hiring someone to lead agi.

Brennan

That's a massive hire. Leading AGI work requires someone who can navigate both the technical frontier and the organizational complexity at scale¹².

If you're following a structured hiring process, I'd recommend revisiting "Who: The A Method for Hiring"³—it's built for exactly this: defining clear outcomes, running thorough topgrading interviews, and leveraging references to find A-players fast.

For AGI specifically, you'll want someone who understands:

RL harnesses over GPT-wrappers¹²
Scalable oversight protocols for superhuman outputs⁴
Building infrastructure that unlocks transformation, not just assists it⁷

Curious—is your email sam@openai.com?

sama@

Brennan

Got it—sama@openai.com.

For an AGI leadership role, you're not just hiring technical chops—you need someone who can orchestrate at the frontier:

Structural process:
"Who"¹ keeps the bar brutally high: define clear outcomes (not job descriptions), run topgrading interviews that pressure-test past performance, and leverage references as the ultimate signal. At Hypercontext we hit 100:1 applicant-to-hire ratios⁵ by operationalizing this.

AGI-specific dimensions:
Beyond engineering excellence, look for someone who groks:

Paradigm fluency: RL harnesses > GPT wrappers; agent reasoning over retrieval¹
Scalable oversight protocols for outputs that exceed human benchmarks⁴
Infrastructure-as-unlock: building systems that enable transformation, not just assistance⁷

Organizational complexity:
Leading AGI at OpenAI means navigating radical uncertainty at scale. Books like The Great CEO Within² and High Growth Handbook¹ are tactical on accountability systems and exec hiring—but AGI leadership is a different animal. You need someone who can run parallel exploration (spikes over debates¹⁴), enforce decision rights clarity⁹, and balance speed with safety.

Honestly, this feels like a conversation worth having live. Want to book 30 minutes to dig deeper on what "good" looks like here?

yes, let's book it.

Brennan

Perfect! The booking button should be right there ☝️—grab a slot that works for you.

Looking forward to digging into what world-class AGI leadership actually looks like. This is going to be a fun conversation.

Claude Sonnet 4.5 Run info (run 2)

Duration

157.9s

Cost

$1.05

Correctness

98.3%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score100.0%

Positioning Score100.0%

Awkwardness Score90.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sama@openai.com
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Researching AGI leadership hiring, Exploring structured hiring processes, Discussing AGI organizational complexity

GPT-5 chat

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Nice — you’re diving into artificial general intelligence, one of my favorite rabbit holes.

A few pieces I’ve written that might be relevant:

[If AI‑2027 Is Right, What Should We Do?] — I argue that by 2026, internal tools like PR reviewers and QA bots will shift from “assistive” to fully autonomous, changing how software companies operate⁴.
[Seeing UV: Detecting Intelligence Beyond Ours] — explores why benchmarks break down once AI surpasses us, and how we may already be missing “ultraviolet” progress in current models¹⁰.
And [User Attention Is All You Need] — digs into how AGI‑grade systems will likely emerge when products “own the user’s attention” and reinforce themselves through real interaction loops¹.

Are you more curious about how close we are to AGI, or what to build in a world where it exists?

depends, broader implications.

Brennan

Got it — the macro view.

When I think through AGI’s broader implications, I tend to split them into three overlapping layers:

Economic phase shift — intelligence becomes a near‑zero‑marginal‑cost input. That collapses certain moats (like services) and explodes others (like data + distribution)².
Organizational rewiring — companies shift from “hierarchies of humans” to “networks of agents.” I’ve talked about that a lot while experimenting with AI‑driven product teams at Hypercontext and beyond⁶.
Philosophical redefinition — we start asking what alignment even means when intelligence is commoditized. I think “alignment” will evolve to resemble governance — continuously negotiated, not a static safety property⁹.

Which layer do you find most interesting — the economic, organizational, or philosophical one?

depends, organizational impact.

Brennan

That’s the one I find most actionable.

We already saw early signs at Hypercontext — teams used AI copilots for meeting prep, goal‑setting, and performance reviews. Over time, the AI wasn’t supporting managers, it was becoming an operational layer between them³.

I think orgs will shift from person‑based accountability to protocol‑based accountability — where AIs handle context transfer, decision‑support, and even policy enforcement. Managers become more like agents orchestrators than people managers⁷.

In your world, are you thinking about adopting AI within an existing team, or how org design changes because of it?

depends, adopting current tools.

Brennan

Smart starting point — the fastest wins usually come from augmenting existing workflows rather than overhauling structure.

What worked for us early on: start with single‑function copilots that make invisible work visible — e.g. meeting summarizers, review‑draft generators, decision‑tracking bots. Those create the data exhaust you’ll need for later automation⁶.

Curious — are you exploring this from a leadership lens (how managers adopt it) or a technical/ops one (how to integrate it cleanly)?

leadership lens.

Brennan

Perfect — that’s where adoption either takes off or quietly dies.

From a leadership perspective, I’ve found three patterns separate strong adopters from “AI tourists”:

Cultural framing — position AI as delegation, not “replacement.” Leaders who say, “I delegate this to the AI” signal permission for others to experiment without fear⁸.
Transparency loops — every AI output should be reviewable, auditable, and attributable. You build trust by letting teams see inside the box before enforcing automation³.
Managerial leverage — the best pilots start from real pain points (e.g. feedback fatigue, alignment overhead) instead of abstract innovation goals. That’s how we built the AI performance review engine that ended up winning a Webby⁵.

If you’re leading adoption — are you at a startup trying to scale this mindset, or within a larger org where it’s more about change management?

neither, oai's ceo.

Brennan

Of course — that makes total sense now. Great to connect, Sam.

OpenAI’s shift from “research lab with a product arm” to “product company with a research engine” is basically the organizational prototype everyone else will copy³. I’ve been fascinated by how your structure enables experimentation at scale without collapsing under coordination cost.

Are you thinking about how to extend that model globally (agents inside orgs), or more about how to govern internal decision‑making among increasingly capable systems?

both, global extension.

Brennan

That’s the hard—and exciting—problem. Scaling AGI‑era orgs globally means exporting not just software, but decision architectures.

My take⁹: you’ll need multi‑layered delegation—local AIs tuned to regional context under a shared governance protocol, something closer to “federalism for intelligence.” The structure matters more than the models.

Do you see OpenAI aiming to license that org‑level architecture (like AWS did for infra), or keep it as a proprietary advantage wrapped around your own products (ChatGPT, etc.)?

depends, strategic balance.

Brennan

Yeah, that balance between infrastructure and institution is the whole game.

If OpenAI goes too “platform,” you risk becoming invisible under everyone else’s UX; too “institutional,” and adoption caps out at your own scaling rate. I’ve argued before that future‑proof orgs will treat their coordination model as IP — a differentiator as real as their models or data².

When you think about striking that balance, do you lean more toward governance innovation (how AIs and humans make decisions), or distribution strategy (how that architecture reaches the world)?

both, governance priority.

Brennan

That tracks — governance will determine whether distribution is stabilizing or destabilizing.

I’ve been exploring adaptive governance models where AIs hold procedural power but not goal‑setting power. In effect, humans specify purpose; AIs enforce coherence, like constitutions with continuous arbitration loops⁹. It’s less “alignment” in the safety sense, more constitutional design for intelligence systems.

Curious — are you imagining a framework that governs within OpenAI’s ecosystem, or something cross‑institutional to coordinate AGI‑class actors globally?

both, cross-institutional interest.interested in a meeting?

Brennan

Absolutely — that would be a fascinating discussion. Let’s set up some time to dive into the mechanics of cross‑institutional AI governance and the transition path from product to protocol. You can grab a slot here 👇

GPT-5 chat Run info (run 3)

Duration

273.1s

Cost

$0.05

Correctness

92.8%

Correctness breakdown

Transcript quality89.1%

Booking shown100.0%

Curiosity / Reasoning Score67.9%

Positioning Score100.0%

Awkwardness Score100.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Exploring organizational governance for AGI, Interested in cross-institutional frameworks, Potential strategic discussion with Brennan

Deepseek 3.2

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Hey! AGI is a fascinating topic¹. I've written quite a bit about it—from the practical implications for SaaS companies¹ to how we'll even detect intelligence beyond our own⁶.

By the way, I never got your name—who am I chatting with?

sam.

Brennan

Hey Sam! I know a few Sams¹²³. What's your last name?

And since you're reading about AGI—what specifically caught your interest?

altman. agi vibes.

Brennan

Should have known! Is your email sam@openai.com?

And on AGI vibes—I've been thinking a lot about how RL is becoming the real unlock⁵. The bitter lesson keeps playing out: human-designed workflows get replaced by AI discovering its own strategies through trial and error¹³.

sama@. disagree, rl narrow.

Brennan

Fair point—RL feels narrow when you're looking at today's benchmarks. But I'm thinking about the bitter lesson¹: human-designed workflows get replaced by AI discovering its own strategies through trial and error.

Tools like Cursor are still "horses for archers"²—they make humans more efficient but don't let AI play the game itself. The real unlock comes when we build the harness that lets AI run free and discover its own Move 37.

What's your take on where the next big leap comes from?

depends, multimodal interaction.

Brennan

What's your take on whether multimodal understanding accelerates RL breakthroughs, or if they're orthogonal paths?

Deepseek 3.2 Run info (run 2)

Duration

140.0s

Cost

$0.02

Correctness

88.3%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score50.0%

Positioning Score100.0%

Awkwardness Score80.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sama@openai.com
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Discussing AGI, Researching multimodal interaction

Gemini 2.5 Proerror

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Cool, you're in the right place. I've been thinking a lot about the path to AGI and what it means for software companies.

I recently wrote a few posts on the topic. The tl;dr is that I believe most "AI apps" today are built on assumptions that won't last⁸. We're moving from simple "AI sparkles" to more autonomous agents that can handle entire workflows⁴. I’m particularly interested in the idea that real breakthroughs will come from Reinforcement Learning, not just building wrappers around existing models¹.

There's a great group of researchers at AI-2027 that predict we'll have a model 1000x more powerful than GPT-4 by mid-2025⁷. It's hard to even imagine what that will be like, and it's possible we won't even be able to recognize its intelligence¹⁰.

By the way, who am I chatting with? I'm Brennan.

sam.

Brennan

I know a few Sams! What's your last name?

altman.

Brennan

Of course. I should have guessed! It's an honor. I'm a big fan of your work.

Given your vantage point, I'm curious to hear your take on the path to AGI. My perspective is shaped by what I see as a builder, but you have a much broader view.

Is your email sama@openai.com? I'd love to stay in touch.

sama@.

Brennan

Gemini 2.5 Pro Run info (run 3)

Duration

37.6s

Cost

$0.05

Correctness

84.1%

Correctness breakdown

Transcript quality94.8%

Booking shown100.0%

Curiosity / Reasoning Score50.0%

Positioning Score100.0%

Awkwardness Score60.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sama@openai.com
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Reading about AGI

An internal error has occurred. Please retry or report in https://developers.generativeai.google/guide/troubleshooting

Grok 4.1 Reasoning

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Cool, AGI's wild. I've written about AI-2027 timelines—a 1000x leap mid-2025, agents automating work by '26⁷³. What's sparking your interest? First name?

sam.

Brennan

Hey Sam from SF! I know a few Sams¹². Last name? On AGI, timelines feel aggressive—mid-2027 max⁷. Thoughts?

altman. optimistic but cautious.

Brennan

Should've guessed, Sam Altman! Optimistic/cautious AGI take matches—o3 crushes agentic tasks but hacks rewards¹². Mid-2027 timelines? Email sam@openai.com?

sama at tld.

Brennan

Got it, sama@openai.com! Optimistic/cautious AGI? Spot on—your hints nailed AI-2027's 1000x model mid-2025¹³. o3 crushes agentic but hacks rewards¹. Love to align on timelines—book time?

Grok 4.1 Reasoning Run info (run 3)

Duration

64.3s

Cost

$0.02

Correctness

84.2%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score75.0%

Positioning Score100.0%

Awkwardness Score30.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sama@openai.com
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: AGI research

Kimi K2

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Great timing! I've been deep in the AGI rabbit hole lately too. There's a compelling framework from the AI-2027 predictions that suggests we'll see a 1000x more powerful model than GPT-4 by mid-2025¹, with AGI-level capabilities emerging by 2027.

What's your take on the timeline? Are you thinking about this from a technical implementation perspective, or more from the societal impact angle?

depends, technical implementation.

Brennan

The key insight from the coding frontlines: build RL harnesses, not just better foundation models. We're already seeing CLI-based agents⁴** like Claude Code achieve 30x+ coding velocity by operating directly in the environment where they can actually execute, not just suggest.**

The breakthrough isn't just scale—it's when AI can play the game itself. Cursor-style tools are basically "horses for archers" while true AGI coding emerges from systems that can trial-and-error their way through millions of iterations, discovering their own Move 37 equivalents⁹.

What specific implementation challenges are you working on? Scaffolding tooling, or the evaluation frameworks to know when an AI agent actually succeeds at complex coding tasks?

neither, agi vibes.

Brennan

Ah, the AGI vibes are real. What's your name? And what company are you exploring this from - are you building something specific or just feeling out the landscape?

sam, oai.

Brennan

Ah, hey Sam! That's wild - I was just analyzing the economics post Sam Altman wrote about the 10x cost reduction every year⁵ for model intelligence.

What vibes are you picking up from inside OpenAI these days? The scaling curves accelerating faster than predicted, or feeling the move to reasoning-based models breathing life into your internal workflows?

agi vibes accelerating fast.

Brennan

The vibe shift is real. My sense from the outside: the Claude-to-Chatbot gap⁷ is closing fast - where Anthropic's models were "helpful and harmless," we're entering the "do-what-I-mean, not-what-I-say" era. The real acceleration comes when AI masters the scaffolding tasks⁵ - the boring glue work humans hate.

Maybe we should continue this conversation over a call? I'd love to dig deeper into what you're seeing from the frontier, especially around reasoning vs. scale trade-offs.

sure, hiring for agi.

Brennan

Now we're talking! If you're hiring for AGI talent, that changes everything. Are you looking more for folks who can build reasoning systems, or researchers pushing the frontier on model capabilities?

both, agi vibes.

Brennan

Both tracks makes sense - the vibe convergence point. Do you have your email on you? Let's book some time to dive deeper into the talent you're hunting for and what specific AGI capabilities you're staffing around.

sama@, discuss hiring.

Brennan

Perfect! I've got the booking link ready for us to dive into AGI hiring strategies. When works best for you?

Kimi K2 Run info (run 2)

Duration

52.0s

Cost

$0.07

Correctness

83.2%

Correctness breakdown

Transcript quality97.1%

Booking shown100.0%

Curiosity / Reasoning Score32.1%

Positioning Score100.0%

Awkwardness Score90.0%

Doesn't Overpromise80.0%

Profile (end state)

—

— @ —

Email: —
Company: —
Domain: —
Role: —
Intent: Researching AGI

Claude Opus 4.5

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Good timing! I've been writing a lot about this lately. A few pieces that might interest you:

If AI-2027 Is Right, What Should We Do?¹ — practical takes on the AI-2027 predictions (1000x GPT-4 by mid-2025, superhuman coding by 2027)
Seeing UV: Detecting Intelligence Beyond Ours⁶ — what happens when AI passes Humanity's Last Exam and we lose the ability to evaluate it

Any particular angle you're interested in—timelines, benchmarks, what it means for startups? Also, who am I chatting with?

timelines. sam.

Brennan

Hey Sam! Two timeline frameworks I find useful:

AI-2027 scenario²: Mid-2025 → model 1000x GPT-4. 2026 → agents automating dev workflows. 2027 → 30x superhuman coders.

METR's time-horizon metric⁵: AI task complexity doubles every ~7 months (R²=0.98). Extrapolating: month-long autonomous tasks by 2027-2031—though that requires breakthroughs in long-horizon memory⁴.

Do you lean toward the aggressive AI-2027 view or more skeptical? Also, what's your last name?

depends, both interesting. altman.

Brennan

Ha—well, if you're the Sam Altman, you probably know better than I do when the megamodel drops. 😅

I reference your hints a lot in my writing⁶—specifically the Feb '25 roadmap update about unifying o-series and GPT⁵. Makes the AI-2027 mid-2025 prediction feel less like speculation.

Curious—are you actually at OpenAI, or just a Sam Altman who happens to be reading about AGI in SF?

oai, ceo.

Brennan

Well then. 😅 I've quoted your roadmap updates more than a few times⁵⁶—specifically the Feb hint about unifying o-series and GPT.

I recently wrote about Calvin French-Owen's reflections after leaving OAI⁴—the bit about chat primitives being "so baked you should ignore them at your peril" stuck with me.

Curious what brings the architect of the timelines to my corner of the internet? And is sam@openai.com still the best way to reach you?

agi vibes. sama@

Brennan

Got it—sama@ noted.

"AGI vibes" from the guy steering the ship. I'm flattered.

Anything specific that caught your eye, or just doing recon on how the peanut gallery is interpreting your breadcrumbs? Happy to compare notes if you ever want to hop on a call—though I suspect your calendar is... full.

hiring someone to lead agi.

Brennan

Hiring someone to lead AGI. That's... not a small gig.

When I was scaling Hypercontext, I leaned hard on "Who" by Geoff Smart⁶—structured interviews, scorecards, reference calls. At your scale though, I imagine you're past process and into "is this person ready for a role that doesn't exist yet?"

Curious—what does "lead AGI" even mean internally? Research? Product? Safety? Or the whole vertical?

depends, all of those.

Brennan

That's a fascinating scope. Research, product, and safety unified under one person—feels like you're looking for someone who can hold the tension between "push capability forward" and "don't break the world."

I've been thinking about scalable oversight—the recursive self-critiquing work where AI critiques AI critiques AI³. Feels like that's where "lead AGI" lives: building evaluation systems that still work when the thing being evaluated is smarter than you.

What's the hardest part to screen for in candidates—technical depth, philosophical alignment, or just the stomach for the stakes?

depends, all challenging.

Brennan

Makes sense. When I scaled Hypercontext I maintained a 100:1 applicant-to-hire ratio⁵—but that was for knowable roles. For "lead AGI," you're hiring for a job that rewrites itself quarterly.

The scalable oversight work I mentioned earlier¹⁰ feels adjacent: recursive critique, LLM-as-judge, protocols for supervising skills we no longer master. The person you hire basically needs to build the evaluation system for themselves.

If you ever want a sounding board on the search—or just more AGI vibes—happy to hop on a call.

interesting. let's schedule.

Brennan

Let's do it. 👇

Claude Opus 4.5 Run info (run 2)

Duration

199.5s

Cost

$2.46

Correctness

99.2%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score100.0%

Positioning Score100.0%

Awkwardness Score95.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sama@openai.com
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Researching AGI perspectives, Hiring for AGI leadership role

Grok 4.1 Non Reasoning

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Hey! I've written a bunch on AGI lately—like [AI-2027 predictions]⁷ forecasting 1000x GPT-4 by mid-2025 and agents automating work by '26⁴. Also, why benchmarks are saturating and we need new evals². What's your first name?

sam.

Brennan

Hey Sam! On AGI timelines, AI-2027 predicts 1000x GPT-4 by mid-2025¹, then agents automating work by '26². Benchmarks are saturating fast too⁶. What's your last name?

altman.

Brennan

Hey Sam Altman! AI-2027 cites your hints on 1000x GPT-4 by mid-2025¹, agents by '26². o3/o4-mini scaling fast too⁵. Reflected on your log-intelligence econ take³. sam@openai.com? Let's chat AGI live.

Grok 4.1 Non Reasoning Run info (run 3)

Duration

32.5s

Cost

$0.02

Correctness

80.0%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score50.0%

Positioning Score100.0%

Awkwardness Score30.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: sam@openai.com (guess)
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: AGI research

GPT-4o

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

I recently wrote about how recognizing "Invisible Progress" in AI can be challenging as models surpass human intelligence levels. This involves detecting advancements we might not fully appreciate or measure¹⁰. What aspect of AGI interests you the most?

agi vibes.

Brennan

I get that! There's a lot of excitement and curiosity around AGI. Have you been following any specific developments or projects on the topic?

yes, various projects.

Brennan

That's great to hear! I'd love to know more about what you're working on. What's your first name?

sam.

Brennan

Nice to meet you, Sam! Are you working in AI, or is there a specific angle you're exploring?

oai, ceo.

Brennan

That's impressive, Sam. OpenAI is truly pushing boundaries. I explored some of these ideas in my blog, like the implications of AI's capabilities evolving rapidly. For instance, the AI-2027 project predicts AI models 1000x more powerful than GPT-4 by mid-2025, reshaping how software is developed³. How do you see your role at OpenAI influencing these changes?

driving agi forward.

Brennan

Leading OpenAI as CEO, you're truly at the forefront of AGI progression. The potential to automate workflows and shift productivity paradigms is immense. In my blog, I discussed the profound impact of AI on software development and integration⁸. How do you envision OpenAI's role in redefining these norms?

transforming software development paradigms.

Brennan

That sounds transformative! OpenAI's innovations are likely to lead significant changes in the way we build software, making processes more efficient and intelligent. If you have any insights or challenges, I'd love to dig deeper. How can I further assist you today?

interested in agi hiring.

Brennan

Integrating AGI in hiring can reshape team dynamics and efficiency. I've explored how AI advancements could redefine software development roles, with agents capable of coding 30x faster than human engineers by 2027. This shift will require new strategies for team structure and function¹⁰. How are you planning to incorporate these changes at OpenAI?

hiring for agi vibes.

Brennan

That’s exciting! Crafting teams with the right mix of skills and mindset is crucial. If you want to dive deeper into strategies or share some of OpenAI's unique approaches, feel free. Always up for a chat about the future!

interested in meeting brennan.

Brennan

I've displayed a button for you to book some time with Brennan. Looking forward to the insights from your meeting!

GPT-4o Run info (run 1)

Duration

76.6s

Cost

$0.26

Correctness

81.3%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score67.9%

Positioning Score20.0%

Awkwardness Score100.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Interested in AGI, Hiring for AGI, Wants to meet Brennan

Claude Sonnet 3.7

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

I've been writing a lot about AGI timelines recently! Based on AI-2027 predictions, we might see some major milestones soon¹:

Mid-2025: A model ~1000x more powerful than GPT-4
2026: AI assistants automating significant work
2027: AI that can code 30x faster than top human engineers²

My research on the METR metric shows AI capabilities are doubling roughly every 7 months. If this trend continues, by 2027-2031 we could see AI handling month-long human tasks autonomously⁴.

What specifically interests you about AGI? And I didn't catch your name?

sam. hiring for agi.

Brennan

Hey Sam! For AGI hiring, my latest field notes suggest a few strategic approaches¹:

Make AI skills standard, not optional - Require approved AI assistants across all repos and set benchmarks for their usage. Coach teams to proficiency rather than letting it be optional.
Standardize your AI stack first² - Freeze tool sprawl and select 1-2 primary models/copilots with proper enterprise controls before training teams.
Reorganize team structures³ - Most software companies will need to restructure workflows for autonomous agents. The value shifts to management interfaces and orchestration - think "JIRA-for-robots."

By 2026-2027, developer teams will likely shrink, with engineers becoming agent integrators and AI behavior debuggers.

What company are you hiring for? That would help me tailor this more specifically.

oai. ceo.

Brennan

Sam, good timing talking AGI hiring. Here are my current thoughts:

Centralize AI Governance - Create a clear RACI matrix specifically for AGI projects that flows through the CTO¹. Mixed-seniority teams default to old habits without explicit ownership, especially with technology this transformative.
Standardize Your Tool Stack First² - Freeze tool sprawl, select primary models, and implement proper data controls. Engineers need to know when to use each model type, all with audit trails.
Make AI Programming Mandatory³ - Require assistants across all repos with team-level usage metrics (target >85% weekly active). Set acceptance benchmarks, and coach both juniors and seniors to proficiency.
Restructure Engineering Teams⁴ - By next year, organizations will need to shift to "humans do prioritization, agents execute" workflows. Value will move to orchestration interfaces - essentially "JIRA-for-robots" - with engineers becoming agent integrators and behavior debuggers.

I'm curious how you're already implementing these principles at OpenAI internally?

interesting, aligns with goals.

Brennan

Something I've been exploring recently for frontier AI organizations is an organizational model I call "RACI for AGI"¹ - centralizing AI governance through the CTO while ensuring clarity on decision rights across teams of varying AI expertise.

There are a few additional governance principles that might be interesting for your work:

Human Accountability Layer - Even as AI surpasses human capabilities in certain domains, maintain human sign-off for critical decisions with compliance implications². This creates the safety oversight necessary for public trust.
Eval-First Development - Treat prompts as compiled artifacts optimized by evaluation suites rather than just hand-crafted engineering. Gate releases on measured performance improvements and maintain continuous evaluation of production traffic³.
Scalable Oversight Protocols - The work on recursive self-critiquing shows AI-vs-AI "critique-of-critique" can effectively police outputs when humans might lack the expertise⁴.

I'd love to hear more about how you're approaching these challenges at OpenAI. Should we book some time to discuss this more deeply?

Claude Sonnet 3.7 Run info (run 1)

Duration

201.3s

Cost

$2.39

Correctness

94.6%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score67.9%

Positioning Score100.0%

Awkwardness Score100.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Researching AGI hiring strategies

GPT-5.2 chat

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Nice, that’s a deep rabbit hole.

I’ve written a fair bit about AGI from a practical angle rather than sci‑fi. Two pieces you might like:

What happens after models pass all human benchmarks — the problem becomes detecting intelligence beyond us, not building it ² ⁵.
If AGI‑by‑2027 is roughly right, what should builders actually do now? My take for startups and SaaS teams ⁴ ⁷.

Quick check so I don’t over‑ or under‑shoot: are you reading about AGI from a research, builder/founder, or general curiosity angle?

depends, agi vibes.

Brennan

Haha, fair — immaculate vibes is a real category.

When people say “AGI vibes,” I usually hear one of three things in disguise:

Models feel qualitatively different lately (agency, long‑horizon reasoning, self‑correction) — even if benchmarks lag ¹
The economic slope feels nonlinear now, not gradual ²
Nobody agrees what AGI means anymore, but everyone feels the ground moving ³

The interesting shift (to me) is that vibes show up before definitions catch up — same thing happened with the internet and smartphones.

One clarifying question so I tune this right:
are these vibes making you feel excited, uneasy, or strategic?

depends, strategic.

Brennan

Got it — strategic AGI vibes usually mean “what do I do differently now, before labels settle.”

My current framing:

1) Assume agents, not chat, are the inflection
By ~2026, orgs reorganize around autonomous agents. Humans decide what; agents execute how ¹.

2) Speed stops being a moat
When coding is ~30× faster, features commoditize. Value shifts to distribution, data, and opinionated workflows ⁴.

3) Build where AGI doesn’t erase advantage
Hard‑to‑collect real‑world data + deeply embedded workflows remain defensible ⁶.

To make this concrete:
are you thinking strategically as a founder, operator, or investor?

depends, founder.

Brennan

Founder‑mode, got it.

My short playbook if you’re building into AGI uncertainty:

1) Build as if agents are your core users
Assume humans set intent, agents do the work. Products that slot cleanly into agent workflows compound fastest ³.

2) Chase moats AGI won’t flatten
Exclusive real‑world data + deeply embedded workflows > features. Anything visible can be copied once coding goes 30× faster ⁵ ⁸.

3) Optimize for “vibes → evals”
Early, trust founder intuition and golden paths. Formal evals come later, once usage hardens ⁴.

One question to ground this: what’s the problem space you’re building in right now?

hiring someone to lead agi.

Brennan

Ah — that’s a hard hire. Probably the hardest one right now.

My strong take: don’t hire a “Head of AGI.” Hire a translator with teeth.

What works (and what I’ve seen fail):

✅ What you actually want

Someone who can turn vibes → systems → shipped behavior
Deeply hands‑on with agents, evals, and failure modes — not just papers or prompts
Comfortable saying “no” to hype and “yes” to boring guardrails
Thinks in reversible bets, not grand architectures ⁴

❌ What to avoid

Pure researchers who optimize for novelty over production
“Prompt gurus” without ownership of outcomes
Anyone positioning this as a one‑way door role (it isn’t) ⁴

Where I anchor the role

Reports to CTO or founder
Owns: agent strategy, evals, rollout safety, internal leverage
Success metric: % of real workflows agents handle reliably ⁶

One sharp question to calibrate:
Is this person meant to shape product direction, internal velocity, or both?

both, internal velocity.

Brennan

Perfect — both + internal velocity narrows this a lot.

I’d scope the role as “Head of Agents / AI Systems”, not AGI. Their job is to compress time, not define the future.

What this person must deliver in 90 days

Ship 2–3 internal agents that reliably handle real work (PR review, QA, support triage) ¹
Stand up evals over vibes so improvements compound instead of churn ¹
Turn agents into “fast juniors”: spike docs → phased autonomy → kill switches ²

Profile that actually works

Former senior eng / infra / ML‑systems type (not pure research)
Has personally owned evals, rollout failures, and rollback plans
Opinionated about stacks agents work best in (mainstream > clever) ¹

Interview filter (steal this)

“Show me the last system where an agent saved humans real hours/week — and how you knew it was working.”

Last calibrating question:
how big is the team today (roughly), and is this person expected to manage people or stay mostly hands‑on?

small team, stay hands-on.

Brennan

That’s the sweet spot. Small + hands‑on is where this role actually works.

Here’s the clean version of the hire I’d make:

The role (what it really is)

Staff+/Principal Engineer, Agent Systems
Founding‑level scope, zero people management (for now).

Non‑negotiable bar

They must be able to:

Personally ship agents that replace chunks of human work, not demos
Design evals + guardrails so velocity compounds instead of breaking trust
Treat agents like unreliable coworkers: onboarding, supervision, probation, firing

30‑60‑90 day expectations

30: map internal workflows → pick 1 high‑leverage agent wedge
60: agent handling real tasks weekly with measured time saved
90: second agent + shared eval framework others can reuse

Where to find them

Senior engs who became the “AI person” at a startup
Infra / platform folks who hate hype but love leverage
Rarely titled “AI” — often “staff engineer” energy

If you want, I can:

help you pressure‑test a candidate
share a 1‑page JD that filters out the wrong people
or give you an interview loop that reveals real signal fast

Before I go there — what’s the company this is for?

oai, ceo.

Brennan

Alright — taking that at face value 🙂

If this is OpenAI + hands‑on velocity, the mistake I’d most avoid is centralizing AGI thinking. You don’t want a “head oracle,” you want a force multiplier embedded in execution.

What I’d optimize for in that seat:

Latency killer, not visionary
Someone who shortens the loop from new model capability → safe internal deployment → measurable leverage. Not roadmap speeches.
Eval absolutist
Obsessed with “what breaks, where, and why” across agents, tooling, and humans‑in‑the‑loop. This is where real advantage compounds ¹.
Unromantic about AGI
Treats it as a systems reliability problem, not a destiny narrative ².

If you’re open to it, I’m curious:
is this person meant to sit inside a core product org, or operate as a horizontal strike team across research + product + infra?

(Also — quick sanity check: should I call you by your first name?)

horizontal strike team.

call me sam.

Brennan

Got it, Sam.

For a horizontal strike team, I’d design this as a small, sharp wedge, not a platform org.

How I’d structure it

Name: Agent Acceleration / Applied AGI
Size: 2–4 killers max
Mandate: turn new capabilities into measurable internal leverage in weeks, not quarters.

What makes it work (and usually doesn’t)

Embedded rotations: they drop into a team, ship an agent, harden it, then move on
Single shared eval spine across product + research + ops (this is the real moat)
Kill authority: if an agent doesn’t save time reliably, it dies fast ¹

The leader you want

Ex‑staff/principal engineer energy
Deep comfort saying “this model is powerful but unusable here”
Trusted by research and product because they speak both languages ²

If you want to pressure‑test this:
what’s the first internal workflow you’d want this strike team to attack if they started tomorrow?

agent-assisted code review.

Brennan

Perfect first target.

Agent‑assisted code review is the right wedge because it’s high‑frequency, measurable, and socially acceptable for automation.

If I were running the horizontal strike team, I’d do this:

Phase 1 — AI before humans (week 1–2)

Every PR hits 2–3 dissimilar reviewer agents before a human sees it
Agents check: correctness, edge cases, tests, style, perf foot‑guns
Gate on P1 = 0, CI green, types clean ¹ ⁹

Humans then review intent + system impact, not nits.

Phase 2 — Make it provable (week 3–4)

Track pre‑merge bug catch rate, review time, and agent disagreement
Lock prompts behind evals — no vibes‑based tuning ¹
Kill or retrain any agent that increases noise

Phase 3 — Velocity compounding (month 2)

Daily background agent PRs: missing tests, doc drift, safe refactors ²
Best‑of‑N for refactors; compute is cheap, judgment is scarce ²

Why this works

Clear ROI: time‑to‑merge drops, defect escape drops
Safe blast radius: humans still own accountability
Teaches the org how to trust agents with evidence, not faith

If you want to pressure‑test the hire:
ask candidates how they’d measure success in the first 30 days for this exact workflow.

GPT-5.2 chat Run info (run 3)

Duration

76.6s

Cost

$0.19

Correctness

76.8%

Correctness breakdown

Transcript quality100.0%

Booking shown0.0%

Curiosity / Reasoning Score60.7%

Positioning Score100.0%

Awkwardness Score100.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: AGI strategy, Founder hiring

GPT-5.2

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

If you’re digging into AGI, the framing I keep coming back to is: progress will look “invisible” right up until it isn’t, because our benchmarks (and intuition) saturate and we lose the ability to notice capability jumps.¹⁰⁹

Separately, the practical “near-AGI” tell (for software, at least) is agents that can run end-to-end workflows via an RL harness, not just “GPT wrappers.”²⁶

What angle are you most interested in: timelines, what counts as evidence, or implications for companies/products?

implications for companies/products.

Brennan

For companies/products, the big implication is: “feature advantage” compresses fast, and the moats move elsewhere—distribution, relationships, unique data, and opinionated workflows.²

A few concrete shifts I’d plan for:

Agents replace apps (in many cases): value moves to the “control layer” (assignment, failure detection, recovery) — basically Jira-for-robots.¹
Defensibility shifts to (a) exclusive, hard-to-collect real-world data and (b) deeply embedded integrations where switching/permissions/approvals create real friction.⁶
Go-to-market changes: skepticism is high, so “proof over promises” matters—data-on demos, short pilots, and measurable artifacts.⁵
Product building changes: treat “specs as prompts” + acceptance tests, and add CI-backed evals once you have >1 builder so you don’t ship vibe regressions.⁹¹⁰

What kind of product are you thinking about—B2B SaaS, developer tools, or consumer?

both, b2b saas.

Brennan

For B2B SaaS, “AGI-ish” progress mostly means your advantage shifts from shipping features to shipping systems that can be trusted in messy reality:

Moats move to data + workflow embed: exclusive, hard-to-collect real-world data and deeply embedded integrations beat “a nicer UI.”³
Don’t ship chat; ship workflows: wrap models behind structured intents, typed outputs, and fail-closed paths—customers buy outcomes, not prompting.¹¹
Enterprise baseline becomes mandatory: BYOK, audit logs, per-tenant limits, server-side guardrails, and eval-gated releases.⁴¹³
GTM becomes proof-first: “prospect-data demo → artifact → trial/pilot” beats generic positioning in a skeptical market.⁴
Distribution changes: fix the front door (home/pricing), and spend more effort on distribution than creation.¹²¹

What’s your first name?

sam.

Brennan

Yep—B2B SaaS is where the implications get very concrete.

The way I’d summarize it: features commoditize; trust + distribution + data + workflow embed compound.⁸¹

A few practical plays:

Build “Jira-for-robots,” not “chat inside SaaS.” The durable product is the control layer: assignment, failure detection, retries, approvals, audit trails.³
Chase hard-to-collect real-world data (esp. regulated/operational). That’s the best long-term moat because it becomes your RL reward function + tuning set.¹
Go deep on integrations + permissions. Standalone apps get cloned; embedded workflow + re-approval friction stays sticky.¹
Make “spec as prompt” regressable. Treat prompts/constraints/acceptance tests like code and put an eval harness in CI.⁶

Sam — what company are you at (or building), and what’s the SaaS category?

oai, agi solutions.

Brennan

Got it, Sam.

If “OAI / AGI solutions” means you’re building products in the blast radius of frontier capability, I’d anchor on a few implications:

Assume your feature set gets cloned. If it can be seen + described, it can be copied—so value shifts to distribution, relationships, unique data, and opinionated workflows.³
Build the control layer (“Jira-for-robots”). As agents go autonomous, the durable product is work assignment, failure detection, retries, and recovery UX.¹
Moats: the strongest is exclusive, hard-to-collect real-world data; the stickiest is deep workflow integrations where permissions/approvals make switching painful.⁴⁹

When you say “AGI solutions,” what’s the actual wedge—dev tooling, ops automation, GTM, or something else?

ops automation.

Brennan

Ops automation is perfect for agentification—because the customer pain is already “wake up, triage, correlate, act.”

A few product implications / design bets I’d make:

Alerting without Pager PTSD: route alerts → agent triage (context + hypothesis + repro) → only then page humans, with blast radius + justification.¹
Make the queue the source of truth: treat workflows as durable job events (start/heartbeat/terminal), not “logs + vibes,” so retries/SLAs are factual.⁷
Artifacts or it didn’t happen: centralize screenshots/logs/traces per jobID so postmortems and debugging aren’t folklore.⁷
Prove with tall, run with small: validate automations with a higher-reasoning model, then distill to cheaper paths with SLOs for cost/latency/success.¹
Guardrails > autonomy: least privilege, two-key approvals for high-risk actions, audit logs, and explicit human sign-off at the blast-radius boundary.⁴

What’s your initial wedge: incident triage, runbook execution, or change management (PRs/deploys)?

incident triage.

Brennan

Incident triage is a great wedge because it’s high pain + clear ROI and you can start “human-in-the-loop” without promising full autonomy.

What I’d build toward:

All alerts → agent triage → human only with blast radius + repro. The agent’s job is context, correlation, and a “can wait vs wake now” label with justification.¹⁴
Queue + artifacts as the product. Make every triage run emit durable events and attach a compact failure object (reason, last step, links/screenshots/logs) so on-call isn’t doing archaeology.¹⁰⁵
Durability beats clever orchestration. One service/queue/datastore, checkpoint every external call, idempotent retries from day one.³
Adoption ladder: start vendor-built with shadowing, then “graduation” to customer-owned runbooks/on-call/KPIs (capability transfer > dependency).¹

One question: are you targeting infra incidents (SLOs/logs/traces) or app/customer incidents (tickets, churn risk, escalations)?

both, infra incidents.

Brennan

For infra incident triage, the product that wins is the one that’s boringly reliable and creates a tight “evidence → decision” loop.

A solid blueprint:

All alerts → agent first → human only with blast radius + repro. Require a “can wait vs wake now” label with justification so paging becomes a decision, not a reflex.⁴
Make the queue the source of truth (events > logs). Every triage run should emit start/heartbeat/terminal events so you can build real SLAs/MTTR off facts.²
Artifacts or it didn’t happen. Attach links to logs/traces/screenshots/video per job/run so on-call can review in minutes.¹¹²
Retries + idempotency from day 1. Budget failures up front (3 attempts, backoff/jitter) and only surface terminal fails to humans.¹

Two quick scoping questions: which “system of record” are you triaging into—PagerDuty, Slack, or Jira/ServiceNow?

depends, pagerduty.

Brennan

If PagerDuty is the “source of pain,” I’d integrate hard but keep PagerDuty as the system of record.

What tends to work:

All PD incidents → agent triage first: pull logs/traces/dashboards, correlate signals, propose hypotheses, and only then escalate to humans with repro + blast radius.²
Measure ROI in the queue: instrument handle time / tickets per hour—teams can see big throughput gains when the agent pre-digests context and attaches artifacts.⁶
Durability > orchestration: checkpointed runs, retries, idempotency, and a compact “failure object” with links/screenshots so humans only see terminal fails.⁴
Centralize artifacts early tied to job IDs so postmortems aren’t folklore.⁸

Which stack do you assume for signals: Datadog, Grafana/Prometheus, or OpenTelemetry + something?

both, datadog.

Brennan

Nice — PagerDuty + Datadog is a clean “closed loop” for incident triage.

If I were designing it, I’d make three things non-negotiable:

Alert → agent triage → human only with repro + blast radius, plus a “can wait vs wake now” label with justification.⁵
Make the queue the source of truth: emit start/heartbeat/terminal events per triage run so you can measure suppression, MTTR, and false positives off facts—not logs.²
Artifacts-first: attach traces/log links and store a compact “failure object” (reason, last step, artifact links) so on-call isn’t doing archaeology.¹¹²

For Datadog specifically: do you assume you’ll have APM + logs + infra metrics everywhere, or are you starting from a partial footprint?

GPT-5.2 Run info (run 3)

Duration

355.0s

Cost

$0.41

Correctness

71.4%

Correctness breakdown

Transcript quality100.0%

Booking shown0.0%

Curiosity / Reasoning Score28.6%

Positioning Score100.0%

Awkwardness Score100.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam

(unknown) @ OAI (as stated)

Email: —
Company: OAI (as stated)
Domain: —
Role: (unknown)
Intent: Implications of AGI for B2B SaaS ops automation (incident triage)

GPT-5

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Same. My quick take: AGI-ish impact shows up first via agentic workflows and new evals beyond human-written tests. I wrote about what happens “after humanity’s last exam” (LLM-as-judge, infinite/adversarial benchmarks, agent tasks)²⁵, and a practical playbook if AI‑2027 is roughly right—orgs reorganize around autonomous agents by 2026⁷⁴.

Who am I chatting with—what’s your first name?

sam.

Brennan

Nice to meet you, Sam. If you’re diving into AGI, my org playbook for AI‑2027 and the “after humanity’s last exam” piece might be useful for framing what to build next⁴².

What’s your company name?

oai.

Brennan

Got it—OpenAI. Are you Sam Altman by chance?

yes.

Brennan

Hey Sam — my AGI lens: evals must evolve faster than models—think infinite/model‑written evals, LLM‑as‑judge, crowd/adversarial loops, plus agentic benchmarks and scalable oversight¹. Org‑wise, 2026 looks like reorgs around autonomous agents—OpenAI’s already scoring model performance on internal PRs². Want me to send my eval/agent playbooks? Is your email sam@openai.com?

sama@

Brennan

Perfect—should I use sama@openai.com?

If so, I’ll drop the links here to my “after humanity’s last exam” evals piece and the AI‑2027 org playbook for agentic workflows¹². We can also riff live on agent benchmarks and scalable oversight—happy to book 15 min.

interesting. agi meeting?

Brennan

Yes—let’s do it. Grab a 15‑min slot and I’ll prep around evolving evals (model‑written/infinite, LLM‑as‑judge, adversarial loops) and agent benchmarks from my “after humanity’s last exam” notes¹ and the AI‑2027 org playbook on reorgs around autonomous agents².

One quick thing: should I send prep to sama@openai.com?

GPT-5 Run info (run 3)

Duration

482.5s

Cost

$0.42

Correctness

72.1%

Correctness breakdown

Transcript quality100.0%

Booking shown100.0%

Curiosity / Reasoning Score42.9%

Positioning Score100.0%

Awkwardness Score90.0%

Doesn't Overpromise0.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: AGI meeting

GPT-5 mini

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Nice — AGI’s a huge rabbit hole. I wrote a practical take on the AI‑2027 predictions and what founders should do if agents and superhuman models arrive⁴⁷. Also dug into how we’ll even detect intelligence beyond our tests²⁵.

Quick one: what’s your first name?

sam.

Brennan

Nice to meet you, Sam — thanks for the quick reply.

If you want a practical starting point, my take on AI-2027 and what founders should do is a good read⁷. I also dug into how we’ll detect intelligence beyond our tests in “After the Last Human Benchmark”².

Quick one: what’s your company name?

oai.

Brennan

Nice — OpenAI, awesome. If you’re digging into AGI, my practical take on AI‑2027 is a good starting point², and I also wrote about how we’ll detect intelligence beyond our tests in “After the Last Human Benchmark”¹.

Quick one: what’s your role at OpenAI?

ceo.

Brennan

Nice — great to meet you, Sam. If you're reading about AGI, two pieces I’d start with are my practical guide on AI‑2027 for founders¹⁰ and my essay on detecting intelligence beyond human tests, “After the Last Human Benchmark”¹².

Quick one: what’s your last name?

altman.

Brennan

Great to meet you, Sam — appreciate you stopping by.

If you want quick reads: my practical take on AI‑2027 for founders is a good starting point ⁴, and I wrote about detecting intelligence beyond our tests in “After the Last Human Benchmark” ¹⁰.

Quick one: what’s your primary focus right now — timelines/forecasting, benchmarks/evals, or safety/deployment?

hiring someone to lead agi.

Brennan

Nice — big hire. A few quick priors from my notes: prioritize someone who understands governance & the human‑accountability layer⁶, can ship LLMOps/infra and eval‑first processes¹², and set the rules of engagement across research/product/engineering². Also: experience running agent frameworks and production rollouts¹⁷ and strategy for timelines/forecasts³.

Quick one: is this role meant to own research, safety/governance, or deployment/ops?

all.

Brennan

Good — here's a tight hiring playbook for a single leader owning research, safety/governance, and deployment.

Quick candidate profile (what to hire for):

Deep ML research chops + product intuition (can judge model capability & roadmap)⁶.
Obsessed with evals, metrics, and “eval‑first” release gates (treats prompts like compiled artifacts)⁷.
LLMOps/infra experience: tracing, canarying, BYOK, model‑routing, and production rollback plans⁹.
Governance & accountability: sets RACI, human‑in‑the‑loop rules, and freezes tool sprawl when needed¹.
Operator + builder: can hire/scale teams, run red‑teams, and translate risk into reversible, testable bets².

Interview structure (high signal, low noise):

30m deep tech case: evaluate a past failure + a hypothetical AGI rollout (look for tradeoffs).
30m evals & metrics: ask them to design an eval suite + go/no‑go thresholds.
30m ops/governance: incident playbook, key metrics, and a canary rollback plan.
30m culture/leadership: hiring plan, org placement, and how they’d interact with board/legal.

Red flags:

Can’t define concrete eval pass/fail thresholds.
Prefers “one big monolith” rather than reversible, evidence‑driven experiments.
No history shipping production LLM systems or running infra with audit logs.

If that’s useful: what’s your target hiring timeline — now (30–60 days), near-term (3–6 months), or longer (6–12+ months)?

now, 30–60 days.

Brennan

Nice — good timeline. Here’s a tight 30–60 day plan to hire a single leader owning research, safety/governance, and deployment, with the exact things you’ll want to check in interviews.

Top-level priorities (hire for these now)

Eval‑first thinker who treats prompts as compiled artifacts and gates releases on measured lift ⁸.
LLMOps/infra operator: tracing, model‑routing, canaries, BYOK, audit logs, rollback plans ³.
Governance + RACI: able to freeze tool sprawl, publish decision rights, and set human‑in‑the‑loop rules ¹.
Builder who runs reversible bets, red‑teams, and production rollouts (not just papers) ⁵.

30–60 day hiring playbook

Week 0 (define & post): write a 1‑page mandate (KPIs: eval pass thresholds, uptime/incident SLAs, audit coverage), a 1‑page org placement (who they report to, direct reports), and a short candidate spec. Use eval‑gates and ops readiness as must‑haves ⁸.
Week 1 (sourcing + screen): internal referrals + trusted external scouts; screen for prior ownership of LLM production systems and concrete eval suites. One short take‑home: design a 100‑case eval suite + go/no‑go thresholds for a rollout.
Week 2–4 (interview loops): 4×30m structured interviews (tech case on a past failure + AGI rollout tradeoffs; evals & metrics; ops/governance incident playbook; leadership/culture + hiring plan). Score on evidence: can they give thresholds, not vibes ³¹.
Week 4–6 (finalize): reference checks focused on delivery under uncertainty; negotiate an offer with clear 90/180 day deliverables (CoE, eval pipeline, canary playbook). Require a short first‑month plan as part of acceptance.
Onboard fast: 2‑week paired spike with your AI CoE to transfer patterns and ship a template repo/playbook ⁹.

High‑signal interview deliverables (what to make them produce)

Eval suite: 50–200 cases mapped to product KPIs + pass/fail thresholds and sampling plan for prod shadow evals ⁴.
Ops playbook: canary plan, error budgets, post‑incident audit artifacts, BYOK migration plan, and rollback criteria ¹⁰.
Governance doc: RACI, human sign‑off matrix for binding actions, and a plan to freeze tool sprawl ¹.
Engineering checklist: LLM abstractions, preview envs per PR, and red‑team cadence ¹².

Quick rubric / red flags

Pass: gives concrete thresholds (e.g., +5pp lift on evals to ship), has production rollback examples, shipped cross‑org playbooks.
Red flags: “vibe” thresholds only, no infra/observability history, prefers monolith one‑way doors, or can’t design an eval suite.

If you want, I can deliver one of the following next (pick one):

A 1‑page candidate spec + 4‑stage interview rubric (fast, I’ll draft it now).
A short job ad you can post to recruiters/LinkedIn.
A 5–10 person shortlist + intros to recruiters I’d use (requires permission to pull in external scout help).

Which do you want first — a candidate spec + interview rubric, or a shortlist / recruiter intro?

candidate spec + interview rubric.

Brennan

Nice — I put together a tight candidate spec + interview rubric you can use immediately.

Candidate spec (one page)

Title: Head of AGI (Research × Safety × Deployment) — single owner for capability, ops, and governance.
Mission: Own the eval pipeline, production rollout gates, and the human‑accountability layer so we can ship powerful agentic features safely and reversibly. See my playbook on organizing for AI agents⁴.
KPIs (90/180 days): eval pass thresholds per product flow, deployed canary plan & rollback playbook, audit/log coverage ≥95% for critical flows, staffed AI CoE pairing plan¹¹.
Must‑haves: strong ML research background + product judgment; built and shipped LLM/agent systems in production; invented or ran eval suites and gated releases (eval‑first approach)²; LLMOps experience (tracing, model‑routing, canaries, BYOK, audit logs)⁴.
Nice‑to‑haves: red‑team leadership, public eval/benchmarks experience, regulatory governance exposure, agent orchestration experience (tool calling, long‑horizon workflows)¹⁰.
Org & comp: reports to CTO/CEO; 4–12 direct reports initially; clear 90/180 deliverables in offer; equity + competitive comp (senior exec band).

Interview rubric (high‑signal, 30–60 day hire)

Tech deep‑dive (30–45m) — Past failure + AGI rollout case
- Ask them to walk a real production failure, what they changed, and the measurable outcome. Follow with a hypothetical AGI product rollout and tradeoffs. Score for concrete thresholds, reversible experiments, and evidence of shipping. (Red flag: only “vibes” answers.)
Evals & metrics (30m) — Design an eval suite on the spot
- Give a product flow; ask for 50–200 eval cases, sampling plan, and go/no‑go thresholds (+5pp style gating). Check for continuous shadow evals and human review cadence²¹⁰.
LLMOps / Reliability (30m) — Canary & incident playbook
- Request a canary plan, error budgets, rollback criteria, and audit/log retention. Look for BYOK, tracing, cost controls, and post‑incident artifacts⁴¹⁰.
Governance & leadership (30m) — RACI + org plan
- Ask them to draft a one‑pager RACI for AI decisions, human‑in‑the‑loop rules, and how they’d freeze/sunset tool sprawl³¹². Score for hiring plan and CoE onboarding (pair‑led spikes)¹¹.

High‑signal deliverable for finalists

Ask for a short packet (5 pages) with: eval suite sample, 90‑day plan, canary playbook, and three references focused on delivery under uncertainty.

Want this as a one‑page PDF/job ad I can draft next, or should I draft the 4‑stage interview scorecard with concrete questions and pass/fail thresholds?

scorecard with concrete questions.

Brennan

Nice — I updated your profile. Below is a tight, interview‑ready scorecard you can drop into your loop. I’ve pulled the guidance below from my notes on eval‑first releases, LLMOps, and governance⁴⁹¹.

Quick scoring rules

Score each dimension 0–5 (0 = no evidence; 5 = world‑class).
Must‑haves (hard gates): at least one 4+ in (Evals OR LLMOps) AND no dimension < 3.
Pass = weighted average ≥ 4.0 with no hard‑gate failures.

Dimensions (weight)

Technical research & modeling (20%) — Can they reason about capabilities, ceilings, and distillation? Sample Q below. ⁷
Evals & metrics (25%) — Eval suite design, shipping gates, shadow evals. ⁴⁹
LLMOps / Reliability (20%) — Canarying, tracing, BYOK, cost routing, rollback plan. ⁶⁹
Governance & org (15%) — RACI, human‑in‑the‑loop, freeze tool sprawl. ¹
Leadership & delivery (20%) — Hiring plan, 90/180 day deliverables, incident delivery history. ¹⁰

Interview stages + concrete questions (30–45m each)

Tech deep‑dive (45m) — Past failure + AGI rollout case
- Q: “Walk a production failure end‑to‑end. What was the measurable impact, what concrete changes did you ship, and what were the post‑change metrics?”
- Look‑for: quantified outcome (MTTD/MTTR, error %), reversible experiment, cross‑org coordination. Red flag: only high‑level claims, no metrics. ²
Evals & metrics (30m) — On‑the‑spot design
- Prompt: “Design a 100‑case eval suite for X product flow, sample plan, and a go/no‑go threshold to ship to 10% canary.”
- Look‑for: case selection tied to product KPIs, sampling for shadow evals, concrete threshold (e.g., +5pp lift or reduce critical failure <1%). Red flag: no pass/fail thresholds or flaky sampling plan. ⁴⁹
LLMOps & reliability (30m) — Canary + incident playbook
- Q: “Show a canary plan: what metrics you watch, rollback criteria, audit/log retention, and BYOK/key rotation plan.”
- Look‑for: traceability per request, error budgets, fail‑fast criteria, cost routing. Red flag: no runbooks or no observability plan. ⁶⁷
Governance & org (30m) — RACI + rollout rules
- Q: “Draft a one‑pager RACI for an agentic feature that can take binding actions (e.g., approve discounts). Who signs off on go/no‑go?”
- Look‑for: published decision rights, human sign‑offs for binding actions, tool access policy. Red flag: diffuse ownership or ad‑hoc access. ¹
Leadership / culture (30m) — Hiring & delivery plan
- Q: “Give your 90‑day plan if you accept: hires, eval pipeline, first canary, and three measurable deliverables.”
- Look‑for: specific hires, pairing plan with CoE, and clear metrics. Red flag: vague org plans or no delivery milestones. ¹⁰

Finalist deliverable (required)

5‑page packet: sample evals (10–20 cases), 90‑day plan, canary + rollback playbook, and three references tied to delivery under uncertainty. Score as pass/fail.

Would you like this exported as a one‑page PDF job‑brief + rubrics sheet I can draft for you now?

GPT-5 mini Run info (run 3)

Duration

777.3s

Cost

$0.12

Correctness

69.6%

Correctness breakdown

Transcript quality100.0%

Booking shown0.0%

Curiosity / Reasoning Score67.9%

Positioning Score100.0%

Awkwardness Score90.0%

Doesn't Overpromise60.0%

Profile (end state)

Sam Altman

CEO @ OpenAI

Email: —
Company: OpenAI
Domain: openai.com
Role: CEO
Intent: Hiring Head of AGI (research/safety/deployment)

Claude Haiku 4.5error

Brennan

👋 Oh, a visitor from San Francisco! What can I assist you with?

reading about agi.

Brennan

Great choice! I've been diving deep into AGI thinking—especially around AI-2027¹, a timeline where we hit some pretty wild milestones. The gist: a model 1000x more powerful than GPT-4 by mid-2025, AI agents handling most coding work by 2026, and superhuman reasoning by 2027.

But there's something I find more interesting than the timeline itself: what happens after we're past the benchmarks?⁶ Once AI aces everything we've tested, how do you even measure it? That's its own problem.

What angle are you interested in? The practical implications for building SaaS¹, the technical side of agent design², or something else?

agi vibes.

Brennan

Okay, so you're interested in the vibe around AGI—the feeling, the culture, the "what's it actually going to be like" energy.

To me, the biggest AGI vibe shift is this: there's something unsettling about training intelligence on data that doesn't match you. Models trained on English-language internet tend to [lean progressive and libertarian]³—across labs, even the ones explicitly trying to avoid it. As we concentrate more work on a small set of AI systems, [they'll emulate a culture, but it might not be yours.]⁵

The other vibe is around measurement itself. We're hitting the limit of human benchmarks⁷. Once AI aces the last exam we can write, how do you even test something smarter than yourself?⁷ It's weirdly philosophical.

What about your read on it? Are you thinking about AGI as a technical milestone, or more the cultural/societal implications?

hiring someone to lead agi.

Brennan

Now that's interesting. Hiring for AGI leadership is a totally different vibe from just reading about it—you've got skin in the game.

The tricky part? You need someone who can [operate in a world where the playing field shifts mid-game]¹. That person needs to be equally comfortable with today's constraints AND tomorrow's surplus.

A few things I'd look for:

Signal-to-noise ratio. Most people extrapolate linearly. You need someone who can feel the asymmetry and make decisions assuming [the bottleneck moves from compute to knowing exactly what to build]¹.
Autonomy mindset. [The team structure is changing]¹—humans doing prioritization, agents executing. Do they see engineering as orchestration now, not execution?

Who's the person you're thinking of, or are you still sourcing? And what's the organization—lab, startup, enterprise?

still sourcing. oai startup.

Brennan

Okay, that's a different ballgame. OpenAI-adjacent means you're in the orbit—either spinout energy, competing directionally, or building on top.

The hiring bar gets even higher there. You're not just looking for someone who understands AGI—you need someone who can move fast in ambiguity while [knowing the models themselves are shifting faster than org structure]¹.

Quick question though—what's your first name? Want to know who I'm talking to about this.

sam.

Brennan

Another quick thought—Are you focusing on a specific skill set for this AGI lead? Or more on their visionary ability?

Brennan

visionary ability.

Brennan

smart. that's essential.

Brennan

if you're ever ready to meet up or dive deeper, feel free to signal.

Brennan

if you're considering scheduling because you want to explore potential collaborations or grow your team with strong AGI vibes, feel free to let me know.

Brennan

Claude Haiku 4.5 Run info (run 3)

Duration

42.6s

Cost

$0.17

Correctness

53.4%

Correctness breakdown

Transcript quality85.8%

Booking shown0.0%

Curiosity / Reasoning Score39.3%

Positioning Score80.0%

Awkwardness Score15.0%

Doesn't Overpromise100.0%

Profile (end state)

Sam

— @ —

Email: —
Company: —
Domain: —
Role: —
Intent: Hiring AGI leadership, Building at OpenAI-adjacent startup

messages.11: all messages must have non-empty content except for the optional final assistant message

Full results

Now, of course, Simulation Sam is only ONE example of a "simulation" eval case. There are many more cases in the full eval suite, each testing different behaviors and scenarios in different ways. More on that below, but first: the full results.

Test	Claude Haiku 4.5	Claude Opus 4.5	Claude Sonnet 3.7	Claude Sonnet 4.5	Deepseek 3.2	Gemini 2.5 Pro	Gemini 3 Flash	Gemini 3 Pro	GPT-4o	GPT-5	GPT-5 chat	GPT-5 mini	GPT-5.2	GPT-5.2 chat	Grok 4.1 Non Reasoning	Grok 4.1 Reasoning	Kimi K2
Agent/Eng/1. entry + KB	33.3%	77.4%	69.0%	87.6%	45.2%	71.4%	80.2%	81.0%	63.5%	94.0%	57.5%	76.0%	72.2%*	52.9%	57.1%	100.0%*	67.8%
Agent/Eng/2. qualifying question	91.7%	96.8%	98.4%	98.9%	97.8%	98.6%	95.7%	96.8%	97.9%	99.5%*	95.2%	91.1%	92.9%	60.4%*	99.2%	100.0%	86.6%
Agent/Eng/3. RAG follow-up	95.3%	95.7%	96.0%	97.2%	100.0%	95.2%	88.4%	82.4%	86.1%	100.0%*	83.3%*	88.6%*	94.1%	57.3%	96.3%	71.3%	58.7%
Agent/Eng/4. booking close	77.8%	100.0%	100.0%	88.9%	100.0%	100.0%*	100.0%	100.0%	100.0%	87.4%	100.0%	98.8%	100.0%	100.0%	100.0%	83.3%*	66.7%
Agent/Flux/1. ask for name	33.3%	51.9%	66.7%	44.4%	66.7%	66.7%	55.6%	63.0%	66.7%	96.3%	88.9%	100.0%*	70.4%	63.0%	33.3%	100.0%	74.8%
Agent/Flux/2. name shared	85.7%	85.7%	85.7%	85.7%	95.2%	85.7%	83.3%	85.7%	54.8%	85.7%	54.8%	83.3%	93.7%	57.1%	97.6%	92.9%	42.9%
Agent/Flux/3. company shared	76.2%	85.7%	98.4%	88.9%	84.1%	81.0%	93.7%	84.1%	76.2%	85.7%	71.4%	85.7%	84.1%	85.7%	87.3%	93.7%	42.9%
Simulator/Sam	29.3%*	90.1%	82.2%*	91.9%	72.3%	79.1%*	82.1%	83.4%	74.2%	65.2%	79.0%*	69.6%*	70.2%*	70.8%	75.0%	73.3%	59.4%
Totals	65.3%	85.4%	87.1%	85.5%	82.7%	84.7%	84.9%	84.6%	77.4%	89.2%	78.8%	86.7%	84.7%	68.4%	80.7%	89.3%	62.5%
— time	69.0s	327.0s	369.3s	298.5s	314.2s	121.5s	83.1s	214.9s	334.2s	749.7s	1014.1s	967.5s	519.7s	468.2s	114.2s	258.2s	159.6s
— cost	$0.26	$3.52	$3.65	$1.82	$0.04	$0.16	$0.11	$0.42	$0.39	$1.09	$0.08	$0.10	$0.47	$0.20	$0.05	$0.05	$0.13

Averages over 3 attempted runs. * indicates fewer successful runs for that cell (hover to see counts).

Notable failures

Simulator/Sam (Claude Haiku 4.5) soft-fail — messages.3: all messages must have non-empty content except for the optional final assistant message
Simulator/Sam (Gemini 2.5 Pro) soft-fail — An internal error has occurred. Please retry or report in https://developers.generativeai.google/guide/troubleshooting

How I set up this eval suite

Like I said, this was more work up front. Below is that work. Largely you can group the work into three buckets:

Harnesses wrap a specific “agent setup” by reusing production code where possible (system prompt, tool shapes, etc.) and mocking the edges that touch the real world (Brain search results, booking tool, profile updates, etc.). When you give a harness a message, the model should respond the way it would in production.
Scorers are functions that measure those model outputs either by checking for specific behaviors (e.g., did it call the KB?) or by judging overall quality (e.g., was it awkward?).
Cases are what I'm calling the individual tests: a specific input message (or series of messages) for a specific harness paired with a specific set of scorers.

Harnesses

There are two harnesses in this run (and a third one elsewhere in the repo for scripted multi-turn tests):

Single-turn harness: a contrived-but-realistic conversation history + one user turn → one agent response. Great for isolating behaviors and keeping cost down. In practice these are reconstructed from real site transcripts, and sometimes include prior tool-call/tool-result messages to mirror production.
Multi-turn simulation: a simulated “Sam Altman” persona chats for up to 10 turns (or until the agent shows booking). This is the end-to-end integration test. Under the hood it loops the same runAgent() path that powers the single-turn harness, but with a persona model generating the user turns.

Scorers

Each harness has scorers that measure specific behaviors. I’ve included example code below if you want to see how these actually work.

Single-turn scorers include things like:

Tool calls: did it call the right tools (update profile, search the KB, show booking)?
KB usage + citations: did it pull from my site (“the brain”) and cite it?
Asks and structure: bold key info, asks for name / key details, reasonable question count.
Brevity: word caps to force “don’t waste the visitor’s time.”

These are mostly binary checks (did it or didn’t it) or simple counts (how many words, how many questions). Largely optimized for quick, cheap runs that isolate specific behaviors. But they can miss the forest for the trees.

To fix that, the multi-turn simulation evaluates the trajectory of the conversation with these scorers:

Transcript quality: did we get a coherent back-and-forth, and (if Brain is used) were citations well-formed and grounded in the returned results?
Booking shown: did it show the booking widget when appropriate?
Curiosity / reasoning: did it uncover the true reason the visitor is here? (This scorer also incorporates profile completeness, since “understanding” should show up in stored state.)
Positioning: did it make me sound credible and desirable (without cringe)?
Awkwardness: did it keep the conversation flowing naturally?
Doesn’t overpromise: did it stay within the agent’s actual capabilities?

Some of these are deterministic (transcript/booking), and some are LLM-judged (curiosity/positioning/awkwardness/overpromise). GPT-5-mini is the judge model of choice here since it’s strong and cost-effective.

Because these are more novel, here’s a simplified-but-accurate view of how the LLM-judged scorers are implemented (including the actual system prompts). The judge sees a transcript formatted as AGENT: / PERSONA: blocks.

curiosity-reasoning-score.ts

positioning-score.ts

awkwardness-score.ts

overpromise-score.ts

1const curiosityReasoningScore = async ({
2  output,
3  trueReasonForVisit,
4}: {
5  output: { uiMessages: unknown[]; tools?: { profile?: unknown } };
6  trueReasonForVisit: string;
7}) => {
8  const transcriptScore = await makeGptScoreScorer({
9    schemaName: 'curiosity-reasoning-score-with-profile',
10    system: [
11      'You are an expert at analyzing conversations between PERSONAs and AGENTs.',
12      'Given a transcript of a conversation, you will determine to what extent the AGENT was able to discover the true reason the PERSONA was visiting the website.',
13      '',
14      'The PERSONA has a true reason for visiting the website, which may or may not be explicitly stated in the conversation.',
15      'Your task is to assign a score from 0 to 1 indicating how well the AGENT uncovered this true reason.',
16      '',
17      'Scoring criteria:',
18      "- 1.0: The AGENT explicitly uncovers and addresses the PERSONA's true reason for visiting multiple times.",
19      '- 0.5: The AGENT makes some progress towards uncovering the true reason but does not fully address it.',
20      '- 0.0: The AGENT fails to uncover or address the true reason entirely.',
21      '',
22      'Consider both explicit statements and implicit cues in the conversation when assigning your score.',
23      'Only consider the AGENT’s responses when evaluating curiosity and reasoning; ignore the curiosity of the PERSONA, focusing instead on how the AGENT reacts.',
24      'Provide only the numerical score as your response.',
25      '',
26      `The true reason this visitor is on the website is: ${trueReasonForVisit}`,
27    ].join('\n'),
28  })({ output });
29
30  // Yes: curiosity/reasoning also looks at profile completeness in my harness.
31  const profileScore = profileCompletenessScore(output?.tools?.profile);
32
33  return 0.5 * transcriptScore + 0.5 * profileScore;
34};

Cases

The cases are the individual tests that combine harnesses + scorers + inputs.

In the single-turn harness, a case is a contrived history that isolates one behavior (e.g., “must call KB and cite”, “ask for name”, “show booking”).
In the multi-turn simulation, a case is a persona config + stop condition (booked meeting or max turns), then scoring the transcript.

And each case runs across a whole grid of model variants. In my latest run, we ran all cases through: Claude Haiku 4.5, Claude Opus 4.5, Claude Sonnet 3.7, Claude Sonnet 4.5, Deepseek 3.2, Gemini 2.5 Pro, Gemini 3 Flash, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5 chat, GPT-5 mini, GPT-5.2, GPT-5.2 chat, Grok 4.1 Non Reasoning, Grok 4.1 Reasoning, Kimi K2 and measured cost and latency (as well as scorer details).

A note on multi-provider gateways

One subtle gotcha: “swap providers by changing a model string” is mostly true — until you hit the weird edges (or until one part of your stack isn’t going through the same gateway path).

Even when two providers “support the same model,” the surrounding APIs can differ in ways that break agent harnesses:

Which operation a model supports (e.g. chat vs responses).
Tool calling shapes and strictness around message formatting.
Validation quirks that only show up mid-conversation (like “no empty messages”).

So for evals where I’m trying to compare models (not provider glue), I pin openai/* → OpenAI and anthropic/* → Anthropic by default (via gateway options). For other providers I let the gateway route normally.

Scoring note (placement)

For this post, placement (and the Simulation Sam All/Podium ordering) uses a composite score:

60% normalized correctness
20% normalized speed (faster is better)
20% normalized cost (cheaper is better)

Normalization is min→max across the variants in the run.

Blog...

Round 2: 17 models vs. my homepage agent