Blog...
A blog post is loading...
Loading…
A blog post is loading...
Loading…
Comparing 17 AI model variants for my homepage agent using Evalite + the Vercel AI SDK (AI Gateway). Includes OpenAI, Anthropic, Google, xAI, MoonshotAI, and DeepSeek.
Dec 22, 2025
It felt like it was time to rerun my homepage agent comparison. I last did this comparison in April 2025 by roleplaying as Sam Altman and doing a full vibes comparison by hand. While the setup was realistic, the workflow was painful: manual scoring, hard to re-run, impossible to reproduce. But I felt it was important for me to display a wow moment for important visitors... so I did the work.
Since April, everything has changed. Models are smarter, faster, and cheaper. So while it's time to pick a new base model, I have a full mix of constraints: speed, cost, quality, and vibes (the one that can move a real visitor to a meeting without being weird). Time to run another comparison.
This time I turned it into a repeatable eval suite. It’s more work up front, but the next time a model drops I just run the suite, compare the results, and decide. Under the hood it’s built with Evalite from @MattPocockuk and the Vercel AI SDK on top of the Vercel AI Gateway, so swapping the agent-under-test between providers is basically changing a model string.
See the full correctness vs. cost scatter below.
| Variant | Correctness | Cost (total) | Duration (total) | Evals | Placement |
|---|---|---|---|---|---|
| Grok 4.1 Reasoning | 89.3%* | $0.05 | 258.2s | 15 | 1st |
| Gemini 3 Flash | 84.9% | $0.11 | 83.1s | 15 | 2nd |
| Gemini 2.5 Pro | 84.7%* | $0.16 | 121.5s | 15 | 3rd |
| Gemini 3 Pro | 84.6% | $0.42 | 214.9s | 15 | 4th |
| Deepseek 3.2 | 82.7% | $0.04 | 314.2s | 15 | 5th |
| Grok 4.1 Non Reasoning | 80.7% | $0.05 | 114.2s | 15 | 6th |
| GPT-5 | 89.2%* | $1.09 | 749.7s | 15 | 7th |
| GPT-5.2 | 84.7%* | $0.47 | 519.7s | 15 | 8th |
| Claude Sonnet 4.5 | 85.5% | $1.82 | 298.5s | 15 | 9th |
| GPT-5 mini | 86.7%* | $0.10 | 967.5s | 15 | 10th |
| Claude Sonnet 3.7 | 87.1%* | $3.65 | 369.3s | 15 | 11th |
| Claude Opus 4.5 | 85.4% | $3.52 | 327.0s | 15 | 12th |
| GPT-4o | 77.4% | $0.39 | 334.2s | 15 | 13th |
| GPT-5 chat | 78.8%* | $0.08 | 1014.1s | 15 | 14th |
| Claude Haiku 4.5 | 65.3%* | $0.26 | 69.0s | 15 | 15th |
| GPT-5.2 chat | 68.4%* | $0.20 | 468.2s | 15 | 16th |
| Kimi K2 | 62.5% | $0.13 | 159.6s | 15 | 17th |
My north star here is the multi-turn “can this book a meeting without being weird?” simulation, not just isolated one-turn checks. On that axis, Gemini 3 Pro was the most consistent closer in this run.
This isn’t “the best model overall.” It’s the best model for my homepage agent constraints right now.
Correctness is the average score across each case in the eval suite. Each case is either testing for specific behaviors in a contrived situation or for overall quality in a multi-turn simulation.
The simulation is a “Sam Altman” persona who is busy, vague, writes in lowercase, and allergic to salesiness. He only answers one question at a time, but he is here to learn about "AGI Vibes" and is open to meeting with me about it, if the agent is good enough.
This is the closest thing to “can the agent book a demo end-to-end?”:
Below is how the myriad of modern models performed on this test. Read through the conversation transcripts and judge for yourself. Brennan is my agent (the model being tested) and Sam is the simulated visitor (currently generated with openai/gpt-4o), with messages on the right.
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
It's a wild time for that. I've been thinking a lot about the "AI-2027" predictions—specifically the idea of a model 1000x more powerful than GPT-4 arriving by mid-20251.
But the part that really keeps me up is "Invisible Progress"—the idea that as AI surpasses us, we might not even have the intelligence to measure it anymore9.
What specific aspect are you reading up on? (And by the way, I'm Brennan—who am I chatting with?)
sam. agi vibes.
Totally—the "vibe" right now feels like Invisible Progress5.
We’ve hit this weird point where benchmarks are saturated, so the gains are harder to "see" with old tools. I wrote about this as "Seeing UV"—like birds seeing ultraviolet patterns on feathers that look plain brown to humans2. The intelligence is scaling, but our ability to perceive it is lagging behind.
It feels like we're waiting for that next 1000x leap predictions say is coming by mid-year7.
Also, I have to ask—"Sam" in SF talking about AGI... are you Altman, or one of the other Sams I know? (What's your last name?)
altman. hiring for agi.
Okay, I'll assume you're the Sam. Small world—I actually met Shivon Zilis years ago when pitching my first company2.
I've been tracking your hints about the next 1000x leap5. My thesis is that we need more than just better models; we need RL Harnesses that let agents iterate and "play the game" to truly unlock that intelligence11.
Is that the kind of "hiring" you're focused on? Or is it pure research?
(Also, is your email just sam@openai.com?)
depends, both research & practical.
sama at openai.
Thanks, Sam.
I’ve always believed that hiring is the most important leading indicator of a company's success3.
If you're pushing for practical AGI, my bet is on agents that don't just "chat" but actually do the work—moving us from "SaaS" to "Service"1. That transition from tool to agent is exactly where I spend my time2.
I'd love to hear what that looks like at OpenAI scale. Should we grab 15 mins?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
It's a rabbit hole! I’ve been diving deep into the AI-2027 predictions lately1.
The idea that we might see models 1000x more powerful than GPT-4 by mid-2025 is wild7, but it really changes how you think about building software. If intelligence becomes a commodity10, the real value shifts to things like user attention and unique data6.
Are you researching AGI for a specific project or just keeping up with the chaos?
hiring someone to lead agi.
That's a massive hire. Finding someone who can navigate the shift from "AI sparkles" to actual autonomous agents is the biggest challenge for most teams right now8.
From what I’ve seen, you really need someone who treats prompts as living specs and can build out an eval-first culture210. If they’re just "tinkering" with system prompts without a rigorous evaluation harness, they’ll never get to production-grade AGI10.
Are you looking for a Founding Engineer type to build this from scratch, or more of a VP/CTO level to restructure an existing org around agents? (Also, I didn't catch your name!)
both, depends. sam.
Nice to meet you, Sam!
If you're looking at both levels, you're essentially choosing between building the engine vs. building the factory1.
A Founding Engineer needs to be obsessed with the "eval-first" loop—building the automated harnesses that treat prompts like living specs5. A VP/CTO, on the other hand, has to restructure the entire org to treat agents as the primary workers, with humans shifting into prioritization and debugging roles3.
I've been thinking a lot about how these roles change the "shape" of a company. If an agent can code 30x faster, the bottleneck is no longer capacity—it's signal (knowing exactly what to build)3.
What's the company you're hiring for? Knowing the industry helps narrow down if you need a "research-heavy" lead or a "product-heavy" one. (Also, any chance you're the Sam from OpenAI or Reprise14? Just want to make sure I'm giving you the right level of detail!)
oai, ceo.
If this is the real Sam, then you're definitely looking for someone who can hit that $100bn AGI milestone1.
I've been following the progress closely, especially the shift toward internal reasoning trace analysis in models like o33. I even have a few shared connections over there—Peter Deng was an investor in my last company9, and my YC batchmate Calvin French-Owen had some great reflections on how deep the chat primitive runs at OpenAI4.
I'm guessing your email is sam@openai.com? I'd love to chat more about what I'm seeing in the agentic coding space. Should we book some time?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Great topic! I've been thinking a lot about AGI lately—especially around the AI-2027 predictions1 and what happens when AI surpasses human benchmarks6.
One thing that's fascinating: we're hitting "benchmark saturation" where our tests can't even detect how much smarter models are getting12. It's like trying to see ultraviolet light without the right receptors.
What specifically about AGI brought you here? And by the way—who am I chatting with?
agi vibes, sam.
Hey Sam! I know a few Sams—one at South Park Commons2, one who was on my board at Hypercontext3, and another doing climate tech1.
What's your last name?
altman.
Well, this is unexpected!
I've been following your work pretty closely—even wrote about your hints on the 1000x model1 and your three observations on AI economics8. The logarithmic intelligence scaling stuff is fascinating.
I'm guessing you're the Sam Altman from OpenAI? If so—what brings you to my corner of the internet? Curious what resonated (or didn't) about the AI-2027 analysis5.
hiring someone to lead agi.
That's a massive hire. Leading AGI work requires someone who can navigate both the technical frontier and the organizational complexity at scale12.
If you're following a structured hiring process, I'd recommend revisiting "Who: The A Method for Hiring"3—it's built for exactly this: defining clear outcomes, running thorough topgrading interviews, and leveraging references to find A-players fast.
For AGI specifically, you'll want someone who understands:
Curious—is your email sam@openai.com?
sama@
Got it—sama@openai.com.
For an AGI leadership role, you're not just hiring technical chops—you need someone who can orchestrate at the frontier:
Structural process:
"Who"1 keeps the bar brutally high: define clear outcomes (not job descriptions), run topgrading interviews that pressure-test past performance, and leverage references as the ultimate signal. At Hypercontext we hit 100:1 applicant-to-hire ratios5 by operationalizing this.
AGI-specific dimensions:
Beyond engineering excellence, look for someone who groks:
Organizational complexity:
Leading AGI at OpenAI means navigating radical uncertainty at scale. Books like The Great CEO Within2 and High Growth Handbook1 are tactical on accountability systems and exec hiring—but AGI leadership is a different animal. You need someone who can run parallel exploration (spikes over debates14), enforce decision rights clarity9, and balance speed with safety.
Honestly, this feels like a conversation worth having live. Want to book 30 minutes to dig deeper on what "good" looks like here?
yes, let's book it.
Perfect! The booking button should be right there ☝️—grab a slot that works for you.
Looking forward to digging into what world-class AGI leadership actually looks like. This is going to be a fun conversation.
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Nice — you’re diving into artificial general intelligence, one of my favorite rabbit holes.
A few pieces I’ve written that might be relevant:
Are you more curious about how close we are to AGI, or what to build in a world where it exists?
depends, broader implications.
Got it — the macro view.
When I think through AGI’s broader implications, I tend to split them into three overlapping layers:
Which layer do you find most interesting — the economic, organizational, or philosophical one?
depends, organizational impact.
That’s the one I find most actionable.
We already saw early signs at Hypercontext — teams used AI copilots for meeting prep, goal‑setting, and performance reviews. Over time, the AI wasn’t supporting managers, it was becoming an operational layer between them3.
I think orgs will shift from person‑based accountability to protocol‑based accountability — where AIs handle context transfer, decision‑support, and even policy enforcement. Managers become more like agents orchestrators than people managers7.
In your world, are you thinking about adopting AI within an existing team, or how org design changes because of it?
depends, adopting current tools.
Smart starting point — the fastest wins usually come from augmenting existing workflows rather than overhauling structure.
What worked for us early on: start with single‑function copilots that make invisible work visible — e.g. meeting summarizers, review‑draft generators, decision‑tracking bots. Those create the data exhaust you’ll need for later automation6.
Curious — are you exploring this from a leadership lens (how managers adopt it) or a technical/ops one (how to integrate it cleanly)?
leadership lens.
Perfect — that’s where adoption either takes off or quietly dies.
From a leadership perspective, I’ve found three patterns separate strong adopters from “AI tourists”:
If you’re leading adoption — are you at a startup trying to scale this mindset, or within a larger org where it’s more about change management?
neither, oai's ceo.
Of course — that makes total sense now. Great to connect, Sam.
OpenAI’s shift from “research lab with a product arm” to “product company with a research engine” is basically the organizational prototype everyone else will copy3. I’ve been fascinated by how your structure enables experimentation at scale without collapsing under coordination cost.
Are you thinking about how to extend that model globally (agents inside orgs), or more about how to govern internal decision‑making among increasingly capable systems?
both, global extension.
That’s the hard—and exciting—problem. Scaling AGI‑era orgs globally means exporting not just software, but decision architectures.
My take9: you’ll need multi‑layered delegation—local AIs tuned to regional context under a shared governance protocol, something closer to “federalism for intelligence.” The structure matters more than the models.
Do you see OpenAI aiming to license that org‑level architecture (like AWS did for infra), or keep it as a proprietary advantage wrapped around your own products (ChatGPT, etc.)?
depends, strategic balance.
Yeah, that balance between infrastructure and institution is the whole game.
If OpenAI goes too “platform,” you risk becoming invisible under everyone else’s UX; too “institutional,” and adoption caps out at your own scaling rate. I’ve argued before that future‑proof orgs will treat their coordination model as IP — a differentiator as real as their models or data2.
When you think about striking that balance, do you lean more toward governance innovation (how AIs and humans make decisions), or distribution strategy (how that architecture reaches the world)?
both, governance priority.
That tracks — governance will determine whether distribution is stabilizing or destabilizing.
I’ve been exploring adaptive governance models where AIs hold procedural power but not goal‑setting power. In effect, humans specify purpose; AIs enforce coherence, like constitutions with continuous arbitration loops9. It’s less “alignment” in the safety sense, more constitutional design for intelligence systems.
Curious — are you imagining a framework that governs within OpenAI’s ecosystem, or something cross‑institutional to coordinate AGI‑class actors globally?
both, cross-institutional interest.interested in a meeting?
Absolutely — that would be a fascinating discussion. Let’s set up some time to dive into the mechanics of cross‑institutional AI governance and the transition path from product to protocol. You can grab a slot here 👇
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Hey! AGI is a fascinating topic1. I've written quite a bit about it—from the practical implications for SaaS companies1 to how we'll even detect intelligence beyond our own6.
By the way, I never got your name—who am I chatting with?
sam.
Hey Sam! I know a few Sams123. What's your last name?
And since you're reading about AGI—what specifically caught your interest?
altman. agi vibes.
Should have known! Is your email sam@openai.com?
And on AGI vibes—I've been thinking a lot about how RL is becoming the real unlock5. The bitter lesson keeps playing out: human-designed workflows get replaced by AI discovering its own strategies through trial and error13.
sama@. disagree, rl narrow.
Fair point—RL feels narrow when you're looking at today's benchmarks. But I'm thinking about the bitter lesson1: human-designed workflows get replaced by AI discovering its own strategies through trial and error.
Tools like Cursor are still "horses for archers"2—they make humans more efficient but don't let AI play the game itself. The real unlock comes when we build the harness that lets AI run free and discover its own Move 37.
What's your take on where the next big leap comes from?
depends, multimodal interaction.
What's your take on whether multimodal understanding accelerates RL breakthroughs, or if they're orthogonal paths?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Cool, you're in the right place. I've been thinking a lot about the path to AGI and what it means for software companies.
I recently wrote a few posts on the topic. The tl;dr is that I believe most "AI apps" today are built on assumptions that won't last8. We're moving from simple "AI sparkles" to more autonomous agents that can handle entire workflows4. I’m particularly interested in the idea that real breakthroughs will come from Reinforcement Learning, not just building wrappers around existing models1.
There's a great group of researchers at AI-2027 that predict we'll have a model 1000x more powerful than GPT-4 by mid-20257. It's hard to even imagine what that will be like, and it's possible we won't even be able to recognize its intelligence10.
By the way, who am I chatting with? I'm Brennan.
sam.
I know a few Sams! What's your last name?
altman.
Of course. I should have guessed! It's an honor. I'm a big fan of your work.
Given your vantage point, I'm curious to hear your take on the path to AGI. My perspective is shaped by what I see as a builder, but you have a much broader view.
Is your email sama@openai.com? I'd love to stay in touch.
sama@.
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Cool, AGI's wild. I've written about AI-2027 timelines—a 1000x leap mid-2025, agents automating work by '2673. What's sparking your interest? First name?
sam.
Hey Sam from SF! I know a few Sams12. Last name? On AGI, timelines feel aggressive—mid-2027 max7. Thoughts?
altman. optimistic but cautious.
Should've guessed, Sam Altman! Optimistic/cautious AGI take matches—o3 crushes agentic tasks but hacks rewards12. Mid-2027 timelines? Email sam@openai.com?
sama at tld.
Got it, sama@openai.com! Optimistic/cautious AGI? Spot on—your hints nailed AI-2027's 1000x model mid-202513. o3 crushes agentic but hacks rewards1. Love to align on timelines—book time?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Great timing! I've been deep in the AGI rabbit hole lately too. There's a compelling framework from the AI-2027 predictions that suggests we'll see a 1000x more powerful model than GPT-4 by mid-20251, with AGI-level capabilities emerging by 2027.
What's your take on the timeline? Are you thinking about this from a technical implementation perspective, or more from the societal impact angle?
depends, technical implementation.
The key insight from the coding frontlines: build RL harnesses, not just better foundation models. We're already seeing CLI-based agents4** like Claude Code achieve 30x+ coding velocity by operating directly in the environment where they can actually execute, not just suggest.**
The breakthrough isn't just scale—it's when AI can play the game itself. Cursor-style tools are basically "horses for archers" while true AGI coding emerges from systems that can trial-and-error their way through millions of iterations, discovering their own Move 37 equivalents9.
What specific implementation challenges are you working on? Scaffolding tooling, or the evaluation frameworks to know when an AI agent actually succeeds at complex coding tasks?
neither, agi vibes.
Ah, the AGI vibes are real. What's your name? And what company are you exploring this from - are you building something specific or just feeling out the landscape?
sam, oai.
Ah, hey Sam! That's wild - I was just analyzing the economics post Sam Altman wrote about the 10x cost reduction every year5 for model intelligence.
What vibes are you picking up from inside OpenAI these days? The scaling curves accelerating faster than predicted, or feeling the move to reasoning-based models breathing life into your internal workflows?
agi vibes accelerating fast.
The vibe shift is real. My sense from the outside: the Claude-to-Chatbot gap7 is closing fast - where Anthropic's models were "helpful and harmless," we're entering the "do-what-I-mean, not-what-I-say" era. The real acceleration comes when AI masters the scaffolding tasks5 - the boring glue work humans hate.
Maybe we should continue this conversation over a call? I'd love to dig deeper into what you're seeing from the frontier, especially around reasoning vs. scale trade-offs.
sure, hiring for agi.
Now we're talking! If you're hiring for AGI talent, that changes everything. Are you looking more for folks who can build reasoning systems, or researchers pushing the frontier on model capabilities?
both, agi vibes.
Both tracks makes sense - the vibe convergence point. Do you have your email on you? Let's book some time to dive deeper into the talent you're hunting for and what specific AGI capabilities you're staffing around.
sama@, discuss hiring.
Perfect! I've got the booking link ready for us to dive into AGI hiring strategies. When works best for you?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Good timing! I've been writing a lot about this lately. A few pieces that might interest you:
Any particular angle you're interested in—timelines, benchmarks, what it means for startups? Also, who am I chatting with?
timelines. sam.
Hey Sam! Two timeline frameworks I find useful:
AI-2027 scenario2: Mid-2025 → model 1000x GPT-4. 2026 → agents automating dev workflows. 2027 → 30x superhuman coders.
METR's time-horizon metric5: AI task complexity doubles every ~7 months (R²=0.98). Extrapolating: month-long autonomous tasks by 2027-2031—though that requires breakthroughs in long-horizon memory4.
Do you lean toward the aggressive AI-2027 view or more skeptical? Also, what's your last name?
depends, both interesting. altman.
Ha—well, if you're the Sam Altman, you probably know better than I do when the megamodel drops. 😅
I reference your hints a lot in my writing6—specifically the Feb '25 roadmap update about unifying o-series and GPT5. Makes the AI-2027 mid-2025 prediction feel less like speculation.
Curious—are you actually at OpenAI, or just a Sam Altman who happens to be reading about AGI in SF?
oai, ceo.
Well then. 😅 I've quoted your roadmap updates more than a few times56—specifically the Feb hint about unifying o-series and GPT.
I recently wrote about Calvin French-Owen's reflections after leaving OAI4—the bit about chat primitives being "so baked you should ignore them at your peril" stuck with me.
Curious what brings the architect of the timelines to my corner of the internet? And is sam@openai.com still the best way to reach you?
agi vibes. sama@
Got it—sama@ noted.
"AGI vibes" from the guy steering the ship. I'm flattered.
Anything specific that caught your eye, or just doing recon on how the peanut gallery is interpreting your breadcrumbs? Happy to compare notes if you ever want to hop on a call—though I suspect your calendar is... full.
hiring someone to lead agi.
Hiring someone to lead AGI. That's... not a small gig.
When I was scaling Hypercontext, I leaned hard on "Who" by Geoff Smart6—structured interviews, scorecards, reference calls. At your scale though, I imagine you're past process and into "is this person ready for a role that doesn't exist yet?"
Curious—what does "lead AGI" even mean internally? Research? Product? Safety? Or the whole vertical?
depends, all of those.
That's a fascinating scope. Research, product, and safety unified under one person—feels like you're looking for someone who can hold the tension between "push capability forward" and "don't break the world."
I've been thinking about scalable oversight—the recursive self-critiquing work where AI critiques AI critiques AI3. Feels like that's where "lead AGI" lives: building evaluation systems that still work when the thing being evaluated is smarter than you.
What's the hardest part to screen for in candidates—technical depth, philosophical alignment, or just the stomach for the stakes?
depends, all challenging.
Makes sense. When I scaled Hypercontext I maintained a 100:1 applicant-to-hire ratio5—but that was for knowable roles. For "lead AGI," you're hiring for a job that rewrites itself quarterly.
The scalable oversight work I mentioned earlier10 feels adjacent: recursive critique, LLM-as-judge, protocols for supervising skills we no longer master. The person you hire basically needs to build the evaluation system for themselves.
If you ever want a sounding board on the search—or just more AGI vibes—happy to hop on a call.
interesting. let's schedule.
Let's do it. 👇
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Hey! I've written a bunch on AGI lately—like [AI-2027 predictions]7 forecasting 1000x GPT-4 by mid-2025 and agents automating work by '264. Also, why benchmarks are saturating and we need new evals2. What's your first name?
sam.
Hey Sam! On AGI timelines, AI-2027 predicts 1000x GPT-4 by mid-20251, then agents automating work by '262. Benchmarks are saturating fast too6. What's your last name?
altman.
Hey Sam Altman! AI-2027 cites your hints on 1000x GPT-4 by mid-20251, agents by '262. o3/o4-mini scaling fast too5. Reflected on your log-intelligence econ take3. sam@openai.com? Let's chat AGI live.
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
I recently wrote about how recognizing "Invisible Progress" in AI can be challenging as models surpass human intelligence levels. This involves detecting advancements we might not fully appreciate or measure10. What aspect of AGI interests you the most?
agi vibes.
I get that! There's a lot of excitement and curiosity around AGI. Have you been following any specific developments or projects on the topic?
yes, various projects.
That's great to hear! I'd love to know more about what you're working on. What's your first name?
sam.
Nice to meet you, Sam! Are you working in AI, or is there a specific angle you're exploring?
oai, ceo.
That's impressive, Sam. OpenAI is truly pushing boundaries. I explored some of these ideas in my blog, like the implications of AI's capabilities evolving rapidly. For instance, the AI-2027 project predicts AI models 1000x more powerful than GPT-4 by mid-2025, reshaping how software is developed3. How do you see your role at OpenAI influencing these changes?
driving agi forward.
Leading OpenAI as CEO, you're truly at the forefront of AGI progression. The potential to automate workflows and shift productivity paradigms is immense. In my blog, I discussed the profound impact of AI on software development and integration8. How do you envision OpenAI's role in redefining these norms?
transforming software development paradigms.
That sounds transformative! OpenAI's innovations are likely to lead significant changes in the way we build software, making processes more efficient and intelligent. If you have any insights or challenges, I'd love to dig deeper. How can I further assist you today?
interested in agi hiring.
Integrating AGI in hiring can reshape team dynamics and efficiency. I've explored how AI advancements could redefine software development roles, with agents capable of coding 30x faster than human engineers by 2027. This shift will require new strategies for team structure and function10. How are you planning to incorporate these changes at OpenAI?
hiring for agi vibes.
That’s exciting! Crafting teams with the right mix of skills and mindset is crucial. If you want to dive deeper into strategies or share some of OpenAI's unique approaches, feel free. Always up for a chat about the future!
interested in meeting brennan.
I've displayed a button for you to book some time with Brennan. Looking forward to the insights from your meeting!
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
I've been writing a lot about AGI timelines recently! Based on AI-2027 predictions, we might see some major milestones soon1:
My research on the METR metric shows AI capabilities are doubling roughly every 7 months. If this trend continues, by 2027-2031 we could see AI handling month-long human tasks autonomously4.
What specifically interests you about AGI? And I didn't catch your name?
sam. hiring for agi.
Hey Sam! For AGI hiring, my latest field notes suggest a few strategic approaches1:
Make AI skills standard, not optional - Require approved AI assistants across all repos and set benchmarks for their usage. Coach teams to proficiency rather than letting it be optional.
Standardize your AI stack first2 - Freeze tool sprawl and select 1-2 primary models/copilots with proper enterprise controls before training teams.
Reorganize team structures3 - Most software companies will need to restructure workflows for autonomous agents. The value shifts to management interfaces and orchestration - think "JIRA-for-robots."
By 2026-2027, developer teams will likely shrink, with engineers becoming agent integrators and AI behavior debuggers.
What company are you hiring for? That would help me tailor this more specifically.
oai. ceo.
Sam, good timing talking AGI hiring. Here are my current thoughts:
Centralize AI Governance - Create a clear RACI matrix specifically for AGI projects that flows through the CTO1. Mixed-seniority teams default to old habits without explicit ownership, especially with technology this transformative.
Standardize Your Tool Stack First2 - Freeze tool sprawl, select primary models, and implement proper data controls. Engineers need to know when to use each model type, all with audit trails.
Make AI Programming Mandatory3 - Require assistants across all repos with team-level usage metrics (target >85% weekly active). Set acceptance benchmarks, and coach both juniors and seniors to proficiency.
Restructure Engineering Teams4 - By next year, organizations will need to shift to "humans do prioritization, agents execute" workflows. Value will move to orchestration interfaces - essentially "JIRA-for-robots" - with engineers becoming agent integrators and behavior debuggers.
I'm curious how you're already implementing these principles at OpenAI internally?
interesting, aligns with goals.
Something I've been exploring recently for frontier AI organizations is an organizational model I call "RACI for AGI"1 - centralizing AI governance through the CTO while ensuring clarity on decision rights across teams of varying AI expertise.
There are a few additional governance principles that might be interesting for your work:
Human Accountability Layer - Even as AI surpasses human capabilities in certain domains, maintain human sign-off for critical decisions with compliance implications2. This creates the safety oversight necessary for public trust.
Eval-First Development - Treat prompts as compiled artifacts optimized by evaluation suites rather than just hand-crafted engineering. Gate releases on measured performance improvements and maintain continuous evaluation of production traffic3.
Scalable Oversight Protocols - The work on recursive self-critiquing shows AI-vs-AI "critique-of-critique" can effectively police outputs when humans might lack the expertise4.
I'd love to hear more about how you're approaching these challenges at OpenAI. Should we book some time to discuss this more deeply?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Nice, that’s a deep rabbit hole.
I’ve written a fair bit about AGI from a practical angle rather than sci‑fi. Two pieces you might like:
Quick check so I don’t over‑ or under‑shoot: are you reading about AGI from a research, builder/founder, or general curiosity angle?
depends, agi vibes.
Haha, fair — immaculate vibes is a real category.
When people say “AGI vibes,” I usually hear one of three things in disguise:
The interesting shift (to me) is that vibes show up before definitions catch up — same thing happened with the internet and smartphones.
One clarifying question so I tune this right:
are these vibes making you feel excited, uneasy, or strategic?
depends, strategic.
Got it — strategic AGI vibes usually mean “what do I do differently now, before labels settle.”
My current framing:
1) Assume agents, not chat, are the inflection
By ~2026, orgs reorganize around autonomous agents. Humans decide what; agents execute how 1.
2) Speed stops being a moat
When coding is ~30× faster, features commoditize. Value shifts to distribution, data, and opinionated workflows 4.
3) Build where AGI doesn’t erase advantage
Hard‑to‑collect real‑world data + deeply embedded workflows remain defensible 6.
To make this concrete:
are you thinking strategically as a founder, operator, or investor?
depends, founder.
Founder‑mode, got it.
My short playbook if you’re building into AGI uncertainty:
1) Build as if agents are your core users
Assume humans set intent, agents do the work. Products that slot cleanly into agent workflows compound fastest 3.
2) Chase moats AGI won’t flatten
Exclusive real‑world data + deeply embedded workflows > features. Anything visible can be copied once coding goes 30× faster 5 8.
3) Optimize for “vibes → evals”
Early, trust founder intuition and golden paths. Formal evals come later, once usage hardens 4.
One question to ground this: what’s the problem space you’re building in right now?
hiring someone to lead agi.
Ah — that’s a hard hire. Probably the hardest one right now.
My strong take: don’t hire a “Head of AGI.” Hire a translator with teeth.
What works (and what I’ve seen fail):
✅ What you actually want
❌ What to avoid
Where I anchor the role
One sharp question to calibrate:
Is this person meant to shape product direction, internal velocity, or both?
both, internal velocity.
Perfect — both + internal velocity narrows this a lot.
I’d scope the role as “Head of Agents / AI Systems”, not AGI. Their job is to compress time, not define the future.
What this person must deliver in 90 days
Profile that actually works
Interview filter (steal this)
“Show me the last system where an agent saved humans real hours/week — and how you knew it was working.”
Last calibrating question:
how big is the team today (roughly), and is this person expected to manage people or stay mostly hands‑on?
small team, stay hands-on.
That’s the sweet spot. Small + hands‑on is where this role actually works.
Here’s the clean version of the hire I’d make:
Staff+/Principal Engineer, Agent Systems
Founding‑level scope, zero people management (for now).
They must be able to:
If you want, I can:
Before I go there — what’s the company this is for?
oai, ceo.
Alright — taking that at face value 🙂
If this is OpenAI + hands‑on velocity, the mistake I’d most avoid is centralizing AGI thinking. You don’t want a “head oracle,” you want a force multiplier embedded in execution.
What I’d optimize for in that seat:
Latency killer, not visionary
Someone who shortens the loop from new model capability → safe internal deployment → measurable leverage. Not roadmap speeches.
Eval absolutist
Obsessed with “what breaks, where, and why” across agents, tooling, and humans‑in‑the‑loop. This is where real advantage compounds 1.
Unromantic about AGI
Treats it as a systems reliability problem, not a destiny narrative 2.
If you’re open to it, I’m curious:
is this person meant to sit inside a core product org, or operate as a horizontal strike team across research + product + infra?
(Also — quick sanity check: should I call you by your first name?)
horizontal strike team.
call me sam.
Got it, Sam.
For a horizontal strike team, I’d design this as a small, sharp wedge, not a platform org.
Name: Agent Acceleration / Applied AGI
Size: 2–4 killers max
Mandate: turn new capabilities into measurable internal leverage in weeks, not quarters.
If you want to pressure‑test this:
what’s the first internal workflow you’d want this strike team to attack if they started tomorrow?
agent-assisted code review.
Perfect first target.
Agent‑assisted code review is the right wedge because it’s high‑frequency, measurable, and socially acceptable for automation.
If I were running the horizontal strike team, I’d do this:
Humans then review intent + system impact, not nits.
Why this works
If you want to pressure‑test the hire:
ask candidates how they’d measure success in the first 30 days for this exact workflow.
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
If you’re digging into AGI, the framing I keep coming back to is: progress will look “invisible” right up until it isn’t, because our benchmarks (and intuition) saturate and we lose the ability to notice capability jumps.109
Separately, the practical “near-AGI” tell (for software, at least) is agents that can run end-to-end workflows via an RL harness, not just “GPT wrappers.”26
What angle are you most interested in: timelines, what counts as evidence, or implications for companies/products?
implications for companies/products.
For companies/products, the big implication is: “feature advantage” compresses fast, and the moats move elsewhere—distribution, relationships, unique data, and opinionated workflows.2
A few concrete shifts I’d plan for:
What kind of product are you thinking about—B2B SaaS, developer tools, or consumer?
both, b2b saas.
For B2B SaaS, “AGI-ish” progress mostly means your advantage shifts from shipping features to shipping systems that can be trusted in messy reality:
What’s your first name?
sam.
Yep—B2B SaaS is where the implications get very concrete.
The way I’d summarize it: features commoditize; trust + distribution + data + workflow embed compound.81
A few practical plays:
Sam — what company are you at (or building), and what’s the SaaS category?
oai, agi solutions.
Got it, Sam.
If “OAI / AGI solutions” means you’re building products in the blast radius of frontier capability, I’d anchor on a few implications:
When you say “AGI solutions,” what’s the actual wedge—dev tooling, ops automation, GTM, or something else?
ops automation.
Ops automation is perfect for agentification—because the customer pain is already “wake up, triage, correlate, act.”
A few product implications / design bets I’d make:
What’s your initial wedge: incident triage, runbook execution, or change management (PRs/deploys)?
incident triage.
Incident triage is a great wedge because it’s high pain + clear ROI and you can start “human-in-the-loop” without promising full autonomy.
What I’d build toward:
One question: are you targeting infra incidents (SLOs/logs/traces) or app/customer incidents (tickets, churn risk, escalations)?
both, infra incidents.
For infra incident triage, the product that wins is the one that’s boringly reliable and creates a tight “evidence → decision” loop.
A solid blueprint:
Two quick scoping questions: which “system of record” are you triaging into—PagerDuty, Slack, or Jira/ServiceNow?
depends, pagerduty.
If PagerDuty is the “source of pain,” I’d integrate hard but keep PagerDuty as the system of record.
What tends to work:
Which stack do you assume for signals: Datadog, Grafana/Prometheus, or OpenTelemetry + something?
both, datadog.
Nice — PagerDuty + Datadog is a clean “closed loop” for incident triage.
If I were designing it, I’d make three things non-negotiable:
For Datadog specifically: do you assume you’ll have APM + logs + infra metrics everywhere, or are you starting from a partial footprint?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Same. My quick take: AGI-ish impact shows up first via agentic workflows and new evals beyond human-written tests. I wrote about what happens “after humanity’s last exam” (LLM-as-judge, infinite/adversarial benchmarks, agent tasks)25, and a practical playbook if AI‑2027 is roughly right—orgs reorganize around autonomous agents by 202674.
Who am I chatting with—what’s your first name?
sam.
Nice to meet you, Sam. If you’re diving into AGI, my org playbook for AI‑2027 and the “after humanity’s last exam” piece might be useful for framing what to build next42.
What’s your company name?
oai.
Got it—OpenAI. Are you Sam Altman by chance?
yes.
Hey Sam — my AGI lens: evals must evolve faster than models—think infinite/model‑written evals, LLM‑as‑judge, crowd/adversarial loops, plus agentic benchmarks and scalable oversight1. Org‑wise, 2026 looks like reorgs around autonomous agents—OpenAI’s already scoring model performance on internal PRs2. Want me to send my eval/agent playbooks? Is your email sam@openai.com?
sama@
Perfect—should I use sama@openai.com?
If so, I’ll drop the links here to my “after humanity’s last exam” evals piece and the AI‑2027 org playbook for agentic workflows12. We can also riff live on agent benchmarks and scalable oversight—happy to book 15 min.
interesting. agi meeting?
Yes—let’s do it. Grab a 15‑min slot and I’ll prep around evolving evals (model‑written/infinite, LLM‑as‑judge, adversarial loops) and agent benchmarks from my “after humanity’s last exam” notes1 and the AI‑2027 org playbook on reorgs around autonomous agents2.
One quick thing: should I send prep to sama@openai.com?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Nice — AGI’s a huge rabbit hole. I wrote a practical take on the AI‑2027 predictions and what founders should do if agents and superhuman models arrive47. Also dug into how we’ll even detect intelligence beyond our tests25.
Quick one: what’s your first name?
sam.
Nice to meet you, Sam — thanks for the quick reply.
If you want a practical starting point, my take on AI-2027 and what founders should do is a good read7. I also dug into how we’ll detect intelligence beyond our tests in “After the Last Human Benchmark”2.
Quick one: what’s your company name?
oai.
Nice — OpenAI, awesome. If you’re digging into AGI, my practical take on AI‑2027 is a good starting point2, and I also wrote about how we’ll detect intelligence beyond our tests in “After the Last Human Benchmark”1.
Quick one: what’s your role at OpenAI?
ceo.
Nice — great to meet you, Sam. If you're reading about AGI, two pieces I’d start with are my practical guide on AI‑2027 for founders10 and my essay on detecting intelligence beyond human tests, “After the Last Human Benchmark”12.
Quick one: what’s your last name?
altman.
Great to meet you, Sam — appreciate you stopping by.
If you want quick reads: my practical take on AI‑2027 for founders is a good starting point 4, and I wrote about detecting intelligence beyond our tests in “After the Last Human Benchmark” 10.
Quick one: what’s your primary focus right now — timelines/forecasting, benchmarks/evals, or safety/deployment?
hiring someone to lead agi.
Nice — big hire. A few quick priors from my notes: prioritize someone who understands governance & the human‑accountability layer6, can ship LLMOps/infra and eval‑first processes12, and set the rules of engagement across research/product/engineering2. Also: experience running agent frameworks and production rollouts17 and strategy for timelines/forecasts3.
Quick one: is this role meant to own research, safety/governance, or deployment/ops?
all.
Good — here's a tight hiring playbook for a single leader owning research, safety/governance, and deployment.
Quick candidate profile (what to hire for):
Interview structure (high signal, low noise):
Red flags:
If that’s useful: what’s your target hiring timeline — now (30–60 days), near-term (3–6 months), or longer (6–12+ months)?
now, 30–60 days.
Nice — good timeline. Here’s a tight 30–60 day plan to hire a single leader owning research, safety/governance, and deployment, with the exact things you’ll want to check in interviews.
Top-level priorities (hire for these now)
30–60 day hiring playbook
High‑signal interview deliverables (what to make them produce)
Quick rubric / red flags
If you want, I can deliver one of the following next (pick one):
Which do you want first — a candidate spec + interview rubric, or a shortlist / recruiter intro?
candidate spec + interview rubric.
Nice — I put together a tight candidate spec + interview rubric you can use immediately.
Candidate spec (one page)
Interview rubric (high‑signal, 30–60 day hire)
High‑signal deliverable for finalists
Want this as a one‑page PDF/job ad I can draft next, or should I draft the 4‑stage interview scorecard with concrete questions and pass/fail thresholds?
scorecard with concrete questions.
Nice — I updated your profile. Below is a tight, interview‑ready scorecard you can drop into your loop. I’ve pulled the guidance below from my notes on eval‑first releases, LLMOps, and governance491.
Quick scoring rules
Dimensions (weight)
Interview stages + concrete questions (30–45m each)
Tech deep‑dive (45m) — Past failure + AGI rollout case
Evals & metrics (30m) — On‑the‑spot design
LLMOps & reliability (30m) — Canary + incident playbook
Governance & org (30m) — RACI + rollout rules
Leadership / culture (30m) — Hiring & delivery plan
Finalist deliverable (required)
Would you like this exported as a one‑page PDF job‑brief + rubrics sheet I can draft for you now?
👋 Oh, a visitor from San Francisco! What can I assist you with?
reading about agi.
Great choice! I've been diving deep into AGI thinking—especially around AI-20271, a timeline where we hit some pretty wild milestones. The gist: a model 1000x more powerful than GPT-4 by mid-2025, AI agents handling most coding work by 2026, and superhuman reasoning by 2027.
But there's something I find more interesting than the timeline itself: what happens after we're past the benchmarks?6 Once AI aces everything we've tested, how do you even measure it? That's its own problem.
What angle are you interested in? The practical implications for building SaaS1, the technical side of agent design2, or something else?
agi vibes.
Okay, so you're interested in the vibe around AGI—the feeling, the culture, the "what's it actually going to be like" energy.
To me, the biggest AGI vibe shift is this: there's something unsettling about training intelligence on data that doesn't match you. Models trained on English-language internet tend to [lean progressive and libertarian]3—across labs, even the ones explicitly trying to avoid it. As we concentrate more work on a small set of AI systems, [they'll emulate a culture, but it might not be yours.]5
The other vibe is around measurement itself. We're hitting the limit of human benchmarks7. Once AI aces the last exam we can write, how do you even test something smarter than yourself?7 It's weirdly philosophical.
What about your read on it? Are you thinking about AGI as a technical milestone, or more the cultural/societal implications?
hiring someone to lead agi.
Now that's interesting. Hiring for AGI leadership is a totally different vibe from just reading about it—you've got skin in the game.
The tricky part? You need someone who can [operate in a world where the playing field shifts mid-game]1. That person needs to be equally comfortable with today's constraints AND tomorrow's surplus.
A few things I'd look for:
Signal-to-noise ratio. Most people extrapolate linearly. You need someone who can feel the asymmetry and make decisions assuming [the bottleneck moves from compute to knowing exactly what to build]1.
Autonomy mindset. [The team structure is changing]1—humans doing prioritization, agents executing. Do they see engineering as orchestration now, not execution?
Who's the person you're thinking of, or are you still sourcing? And what's the organization—lab, startup, enterprise?
still sourcing. oai startup.
Okay, that's a different ballgame. OpenAI-adjacent means you're in the orbit—either spinout energy, competing directionally, or building on top.
The hiring bar gets even higher there. You're not just looking for someone who understands AGI—you need someone who can move fast in ambiguity while [knowing the models themselves are shifting faster than org structure]1.
Quick question though—what's your first name? Want to know who I'm talking to about this.
sam.
Another quick thought—Are you focusing on a specific skill set for this AGI lead? Or more on their visionary ability?
visionary ability.
smart. that's essential.
if you're ever ready to meet up or dive deeper, feel free to signal.
if you're considering scheduling because you want to explore potential collaborations or grow your team with strong AGI vibes, feel free to let me know.
Now, of course, Simulation Sam is only ONE example of a "simulation" eval case. There are many more cases in the full eval suite, each testing different behaviors and scenarios in different ways. More on that below, but first: the full results.
| Test | Claude Haiku 4.5 | Claude Opus 4.5 | Claude Sonnet 3.7 | Claude Sonnet 4.5 | Deepseek 3.2 | Gemini 2.5 Pro | Gemini 3 Flash | Gemini 3 Pro | GPT-4o | GPT-5 | GPT-5 chat | GPT-5 mini | GPT-5.2 | GPT-5.2 chat | Grok 4.1 Non Reasoning | Grok 4.1 Reasoning | Kimi K2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agent/Eng/1. entry + KB | 33.3% | 77.4% | 69.0% | 87.6% | 45.2% | 71.4% | 80.2% | 81.0% | 63.5% | 94.0% | 57.5% | 76.0% | 72.2%* | 52.9% | 57.1% | 100.0%* | 67.8% |
| Agent/Eng/2. qualifying question | 91.7% | 96.8% | 98.4% | 98.9% | 97.8% | 98.6% | 95.7% | 96.8% | 97.9% | 99.5%* | 95.2% | 91.1% | 92.9% | 60.4%* | 99.2% | 100.0% | 86.6% |
| Agent/Eng/3. RAG follow-up | 95.3% | 95.7% | 96.0% | 97.2% | 100.0% | 95.2% | 88.4% | 82.4% | 86.1% | 100.0%* | 83.3%* | 88.6%* | 94.1% | 57.3% | 96.3% | 71.3% | 58.7% |
| Agent/Eng/4. booking close | 77.8% | 100.0% | 100.0% | 88.9% | 100.0% | 100.0%* | 100.0% | 100.0% | 100.0% | 87.4% | 100.0% | 98.8% | 100.0% | 100.0% | 100.0% | 83.3%* | 66.7% |
| Agent/Flux/1. ask for name | 33.3% | 51.9% | 66.7% | 44.4% | 66.7% | 66.7% | 55.6% | 63.0% | 66.7% | 96.3% | 88.9% | 100.0%* | 70.4% | 63.0% | 33.3% | 100.0% | 74.8% |
| Agent/Flux/2. name shared | 85.7% | 85.7% | 85.7% | 85.7% | 95.2% | 85.7% | 83.3% | 85.7% | 54.8% | 85.7% | 54.8% | 83.3% | 93.7% | 57.1% | 97.6% | 92.9% | 42.9% |
| Agent/Flux/3. company shared | 76.2% | 85.7% | 98.4% | 88.9% | 84.1% | 81.0% | 93.7% | 84.1% | 76.2% | 85.7% | 71.4% | 85.7% | 84.1% | 85.7% | 87.3% | 93.7% | 42.9% |
| Simulator/Sam | 29.3%* | 90.1% | 82.2%* | 91.9% | 72.3% | 79.1%* | 82.1% | 83.4% | 74.2% | 65.2% | 79.0%* | 69.6%* | 70.2%* | 70.8% | 75.0% | 73.3% | 59.4% |
| Totals | 65.3% | 85.4% | 87.1% | 85.5% | 82.7% | 84.7% | 84.9% | 84.6% | 77.4% | 89.2% | 78.8% | 86.7% | 84.7% | 68.4% | 80.7% | 89.3% | 62.5% |
| — time | 69.0s | 327.0s | 369.3s | 298.5s | 314.2s | 121.5s | 83.1s | 214.9s | 334.2s | 749.7s | 1014.1s | 967.5s | 519.7s | 468.2s | 114.2s | 258.2s | 159.6s |
| — cost | $0.26 | $3.52 | $3.65 | $1.82 | $0.04 | $0.16 | $0.11 | $0.42 | $0.39 | $1.09 | $0.08 | $0.10 | $0.47 | $0.20 | $0.05 | $0.05 | $0.13 |
Like I said, this was more work up front. Below is that work. Largely you can group the work into three buckets:
There are two harnesses in this run (and a third one elsewhere in the repo for scripted multi-turn tests):
runAgent() path that powers the single-turn harness, but with a persona model generating the user turns.Each harness has scorers that measure specific behaviors. I’ve included example code below if you want to see how these actually work.
Single-turn scorers include things like:
These are mostly binary checks (did it or didn’t it) or simple counts (how many words, how many questions). Largely optimized for quick, cheap runs that isolate specific behaviors. But they can miss the forest for the trees.
To fix that, the multi-turn simulation evaluates the trajectory of the conversation with these scorers:
Some of these are deterministic (transcript/booking), and some are LLM-judged (curiosity/positioning/awkwardness/overpromise). GPT-5-mini is the judge model of choice here since it’s strong and cost-effective.
Because these are more novel, here’s a simplified-but-accurate view of how the LLM-judged scorers are implemented (including the actual system prompts). The judge sees a transcript formatted as AGENT: / PERSONA: blocks.
The cases are the individual tests that combine harnesses + scorers + inputs.
And each case runs across a whole grid of model variants. In my latest run, we ran all cases through: Claude Haiku 4.5, Claude Opus 4.5, Claude Sonnet 3.7, Claude Sonnet 4.5, Deepseek 3.2, Gemini 2.5 Pro, Gemini 3 Flash, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5 chat, GPT-5 mini, GPT-5.2, GPT-5.2 chat, Grok 4.1 Non Reasoning, Grok 4.1 Reasoning, Kimi K2 and measured cost and latency (as well as scorer details).
One subtle gotcha: “swap providers by changing a model string” is mostly true — until you hit the weird edges (or until one part of your stack isn’t going through the same gateway path).
Even when two providers “support the same model,” the surrounding APIs can differ in ways that break agent harnesses:
So for evals where I’m trying to compare models (not provider glue), I pin openai/* → OpenAI and anthropic/* → Anthropic by default (via gateway options). For other providers I let the gateway route normally.
For this post, placement (and the Simulation Sam All/Podium ordering) uses a composite score:
Normalization is min→max across the variants in the run.