User’s Attention is All You Need

GPT Wrappers Are Still a Bad Idea: The Bitter Lesson in AI

Feb 10, 2025

Balaji

@balajis

·Follow

It might turn out that the GPT wrapper has more of a moat than GPT.

2:40 PM · Jan 30, 2025

6.7K

Read 252 replies

When Balaji tweeted that I couldn’t shake the feeling that this was both true in the short term and completely wrong in the long term. It made me think deeply about where we're really headed.

The tweet stems a core idea that intelligence is about to be a commodity good. When Deepseek's R1 was released, it spooked markets because it was a SOTA model, that was open-weights, and it was trained on a fraction of the compute of the big players. It shoot markets, collapsed stocks, and caused panic within the major american research labs.

Maybe great intelligence is "easy"? Once you know the recipe, you can bake the cake -- And the hyperscalers seem to have the recipe.

Maybe, Balaji postures, the people who wrap AI have a more lasting business than the people who build it?

Over the past few months, I’ve pushed AI to its limits—experimenting with every coding assistant, every model, every image generator, building test apps, running AI-driven GTM experiments, and playing with hundreds of new AI apps.

What’s clear to me is this: most "AI apps" today are built on assumptions that won’t last.

Right now, we see AI being tacked onto existing products like an afterthought. SaaS platforms add AI buttons to add sparkles, AI copilots pop up to assist within the software we already use, and companies build thin GPT wrappers around LLMs, assuming (hoping) they are ready for the next-gen. But the deeper I go, the more I see people viewing AI as an evolution of existing products.

But it's not.

AI is a revolution for software, and we haven’t fully grasped what that means yet.

From AI-Sprinkled SaaS to Something Else Entirely

AI applications today evolve through three clear phases: AI sparkles, GPT-wrappers, and agents.

1. AI sparkles

This is where most companies are sitting right now—it’s the extra button in Gmail/Slack/Salesforce, prefixed with ✨, to summarize the thread or to analyze the report. Neat, maybe even useful, but far from the transformative value that was promised.

2. GPT-wrappers

Wrappers are the next stage—it’s the copilots that sit beside the app and can do all the sparkle-buttoned things in one place. Great! Other early variants included blog post writing apps (like gen 1 of Copy.ai ) or the "chat with your PDF" sites.

And while these apps are so clearly better than the above buttons, they simply gave an illusion of an autonomous workflow. In reality they just asked an LLM on your behalf, maybe more conveniently by saving you a copy/paste step. But they didn't do anything new and we're left a little disappointed.

3. Agents

These are now—the cores of systems designed with an AI-first mindset. Consider how Jace.ai approaches email versus how Gmail does. The agent is designed to follow the workflows of an awesome EA and will just do 80% of the work.

While these AI-driven entities perform tasks autonomously, they're still following human-defined workflows and often explicitly requiring a human in the loop (pitched as a feature, but is a flaw).

And yet again, the market is left wanting. The agent shows so much promise, appears brilliant, then gets stuck and becomes annoying.

The start of something else entirely?

Consider OpenAI's Operator, which doesn’t encode a workflow at all but attempts to complete a data-entry goal online (using 3 tools: a screenshot, a keyboard, and a mouse). Or look at OpenAI’s DeepResearch, which doesn't encode a workflow either, yet tries to complete multi-step research task for you (also using 3 tools: search, web browsing, and python).

These might look like just another agent — but they're not! These agent built with end-to-end reinforcement learning (RL)!

This fact was easily missed in the press, and completely missed by the OSS knockoffs trying to replicate it with basic agentic workflows.

People saw a clean conversational interface, a more autonomous workflow, and thought that was the innovation.

It wasn’t.

The real innovation was removing human-encoded workflows entirely and letting the AI find the best way to complete tasks through RL. The open-source world is about to relearn the Bitter Lesson—hard. They built around what they thought mattered (task automation, multi-step workflows, prompt chaining), but they missed why these models actually work.

It’s tempting to assume that the future is just fewer buttons.

But the deeper shift is this: AI isn’t executing workflows anymore—it’s creating its own.

Let me back up a bit.

The Bitter Lesson and Why Everything That Flows from GPT Wrappers Is a Losing Bet

When DeepSeek released R1—an open-source model that rivalled OpenAI's SOTA models—the world was reminded, again, of the bitter lesson.

Rich Sutton’s Bitter Lesson lays out an unavoidable truth: AI does not progress by encoding human expertise—it progresses by leveraging exponentially increasing computation. The long arc of AI research shows that whenever we try to handcraft domain-specific knowledge, those methods are inevitably outpaced by general-purpose, compute-driven models.

Chess & Go: Decades of handcrafted heuristics beaten overnight by brute-force search and self-play learning.
Speech Recognition: Phoneme models and linguistics replaced by deep learning crunching raw audio at scale.
Computer Vision: Edge detection and SIFT handcrafted features discarded in favor of CNNs brute-forcing patterns from pixels.
Language Models: Grammar trees and handcrafted ontologies abandoned—transformers scaling text and emergent meaning instead.

Most AI applications today are still trapped in workflows designed around human intuition. GPT wrappers are a prime example—a last attempt to encode structured human reasoning into software. But what happens when AI models become capable of generating, optimizing, and discarding these wrappers dynamically? The moat collapses entirely.

Karpathy’s shares that AI can learn in two ways: imitation (pretraining, supervised fine-tuning) and trial-and-error (RL/reinforcement learning). And that the most jaw-dropping AI breakthroughs—AlphaGo’s unexpected strategies, the paddle in Breakout discovering trick shots, o1/DeepSeek’s self-directed reasoning—are all from trial and error.

Andrej Karpathy

@karpathy

·Follow

I don't have too too much to add on top of this earlier post on V3 and I think it applies to R1 too (which is the more recent, thinking equivalent). I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed

Andrej Karpathy

@karpathy

DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M). For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being

6:13 PM · Jan 27, 2025

14.5K

Read 376 replies

These skills weren't designed. They emerged.

No one programmed AlphaGo to invent new Go strategies—it discovered them through trial and error. The paddle in Breakout wasn’t told how to do a trick-shot—it figured it out by playing the game millions of times. GPT-4 didn’t learn humor from labeled examples—it just absorbed enough data through raw compute to have it emerge.

The Bitter Lesson isn’t just that compute wins—it’s that at scale, ML develops an understanding that far surpasses what humans could hope to encode—abilities we didn’t even know to look for.

It's Move 37.

Enabling Move 37: You're building Stirrups

AlphaGo’s Move 37 in 2016 wasn’t just a surprising move—it was a glimpse into the kind of intelligence that emerges when AI isn’t bound by human intuition. It wasn’t a move of human genius; it was machine creativity. Move 37 wasn’t designed—it was invented by a machine.

This is the real lesson: AI breakthroughs don’t come from encoding human intuition. They come from creating the right harness—the right reward function, the right training environment—and allowing AI to explore, iterate, and find solutions beyond human conception.

Most software today still isn’t built for this. But it could be.

Cavalry without Stirrups

Before the stirrup, horses were used in warfare, but they were limited to support roles—as mobile archery platforms or transport. Cavalry existed, but riders lacked the stability to fully harness the raw power and momentum of a charging horse.

The stirrup changed everything. It gave riders the stability to remain in the saddle while striking with full force (i.e., a lance can now use the horse's force instead of the human). It allowed for shock cavalry—the ability to charge, absorb impact, and dominate the battlefield in ways no pre-stirrup force ever could.

This wasn’t a marginal improvement in military effectiveness—it was a fundamental shift in what was possible. It led to the rise of professional heavy cavalry, knights, and feudal military dominance, reshaping all of society thereafter. Countries that harnessed it won.

Right now, AI is still in its pre-stirrup phase.

Paul Graham@paulg · Nov 6, 2015

White's Medieval Technology and Social Change is the most fabulous book. goo.gl/cPujcr

Brennan McEachran 👨‍🚀

@i_am_brennan

·Follow

It's 2025. I'm here 10 years later to say thanks. I bought this book based on this tweet... I still feel like a crazy person talking about the impact of stirrups

4:57 PM · Feb 5, 2025

Read 1 reply

What Makes a True AI Harness?

The real AI revolution won’t come from GPT-wrapped software. It will come from AI-harnesses that:

Have a clear reward function (not “assist the user,” but “discover optimal strategies humans wouldn’t find”).
Enable large-scale iteration (not just responding to prompts, but playing entire workflows millions of times).
Aren’t constrained by human intuition (the best AI systems won’t be copilots—they’ll be autonomous agents that make copilots obsolete/pilots).

If we use another example from Medieval Technology and Social Change... we can point to the horse collar being a major innovation that simply stopped choking the horse and allowed it to pull 3x the weight, unlocking the fertile soils of the north. And if history has taught us nothing in AI research, it's that we are the ones oft choking the machines.

History tells us that infrastructure precedes transformation. The stirrup wasn’t just a convenience—it was the foundation that enabled the full power of cavalry warfare, creating the knightly class, military feudalism, and a new era of dominance on the battlefield.

AI needs the same.

The real question isn’t what AI app should I build? It’s what part of the economy is waiting for its warhorse?

That’s where the biggest AI breakthroughs will happen. The companies that don’t just build AI-powered tools, but build the harness that lets AI run free—those are the ones that, like AlphaGo, will stumble into their own Move 37.

Accounting for Cursors

Look at VS Code forks like Cursor. According to marketing copy it's the future: an AI-first coding environment, deeply integrated into the development workflow. The financial metrics back that up: $1M -> 100M ARR in ~12 months (the fastest ever)!

But it has a ways to go to become the future:

Cursor doesn’t let AI play the game itself—it just makes humans more efficient within an existing paradigm.
It isn’t an RL harness—it’s a GPT wrapper that helps humans write code, it's not how AI learns to make software.
It doesn’t let AI discover its own Move 37—it’s still a tool for human-guided execution, not an environment where AI can play millions of iterations toward better architectures and debugging strategies.

Meaning Cursor and its peers, right now, are horses for archers. They let AI assist, but they don’t unlock power. At least right now. (I know Cursor has team members internally who know this and I know they think of themselves as a research lab focused on automating coding.)

But currently, the "Human-in-the-loop" at best can become "RLHF", not RL, and might cause a lot of wasted effort and more bitter lessons.

Andrej Karpathy

@karpathy

·Follow

# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely

8:08 PM · Aug 7, 2024

8.8K

Read 407 replies

We can visually contrast this with something like Github's Project Padawan or the OSS project All Hands, which has been working to build an AI-coding harness—one where AI can learn to execute workflows beyond human intuition. I haven't seen any RL used with these, yet, but these feel like harnesses where RL can be introduced. And I'm sure by the end of 2025 we will start to see some RL in these projects.

But

Cursor has something else that matters: User attention.

Is User Attention All You Need?

Now that we know OpenAI isn’t just making better models or exemplary GPT-wrappers, we can see what they are doing: They're deploying end-to-end RL agents.

And we know that RL, with a well-defined reward function (win a game, maximize engagement, optimize workflows), can brute-force its way to superhuman skill.

Then, to me, we can safely assume the big players will dominate any of the easy-to-measure agentic tasks. Things that end with: factual correctness, valid logic, verifiable strategy — even softer things like persuasion and negotiation with clear winners. At least it feels safer to assume the labs will figure out the RL reward function for these tasks before companies who don't employ ML engineers.

So, if you’re building an LLM wrapped workflow “agent” around any obvious tasks, you’re about to be blindsided by the Bitter Lesson. You can’t build a moat around a commodity workflow when RL is staring down the barrel at you.

Unless, of course, you have user attention.

300M people/week going to chatgpt.com. That's the verb. Not claude, gemini, perplexity, or deepseek.

Millions of engineers are using Cursor as their default editor, with every debounced keystroke triggering an LLM call. There aren't millions of engineers using All Hands.

That seems to be the only thing research labs can't copy.

Advancements with RL, on the other hand, is just a matter of time. Maybe Apple's approach isn't so embarrassing after all.

So What’s Left to Do?

Look for the next major shift—the “stirrup” moment that lets AI do entire tasks from start to finish. Here’s a three-part approach:

1. Own the User’s Attention

You can’t brute force loyalty—be the entry point where users begin (and finish) their tasks.

Task Initiation: Make your product the first place users go. If they trust you to start the job, AI can handle the rest.
Engagement Loops: Every interaction refines the AI’s understanding of user intent.
Habit Creation: Integrate AI so deeply that switching away feels cumbersome.

2. Build an RL Harness (Not Just a GPT Wrapper)

Prompt engineering can work, but true breakthroughs will come from Reinforcement Learning (RL).

Reward Signals: Find objectives that are absolute so you can point ML at it to self-improve without drift.
Iterative Environment: Let AI play the game itself with a simple set of tools (perhaps the same ones a human would use) and let it attempt to win.
Scalable Autonomy: Focus RL on end-to-end workflows, allowing AI to figure out the workflow.

3. Look for Warhorse Level of Societal Change

Identify areas stuck behind human constraints and let AI’s “force” multiply impact. Where we're currently "scaling" with humans, not electricity.

Automation Potential: Where do endless, repeatable tasks bog humans down?
High-Impact Targets: Find processes where small improvements unlock massive value.
Limited by Human Effort: If a job is too big, complex, or slow for people, an AI harness can charge through.

The Final Thought: From SaaS to “Service as Software”

As intelligence itself becomes commodity, the real edge lies in owning the entire service flow—no more shipping partial tools and hoping users piece them together. The shift is from Software-as-a-Service (helping humans do work) to Service-as-Software (letting AI do the work end-to-end). If you:

Control the user’s starting point,
Provide an RL-driven environment,
And apply AI force to high-impact tasks,

then you’re not just building a product—you’re redefining a service category.