The Disposable Intelligence Era

How the shift towards disposable intelligence is changing everything

Aug 19, 2025

We’re crossing a threshold in software where intelligence itself (the code, plans, and drafts our systems produce) has become cheap to generate, fast to throw away, and shockingly effective when you pick the best of several tries. That changes how we think about the last 70 years of handcrafted, peer-reviewed, bespoke intelligence.

If you’ve ever run 4 code tasks in parallel, selected the one(s) that passes type checks/lint/build steps, and tested it in a preview build… only to then iterate 4x on that winning task, and throwing out the rest… you’ve felt this shift.

We have multiple careers of habits that used to treat every line as precious. But we can now toss away 75% or more of what the machines create to ship faster with better quality. We probably need to throw away some of those habits and ideals too.

From pottery to paper cups

Our old model prized artisanal code: slow, deliberate, expensive to revise. It needed to be as we used some of our most expensive in demand humans to create it. And revising it was a painstaking process. This approach emphasized quality and craftsmanship, but at the cost of speed and flexibility. Code is poetry we said. Pottery.

The new model treats intelligence as a disposable commodity we can mass produce, over-sample and select from. To keep the metaphor alive: paper cups. But it's not just a metaphor, the research backs it up:

Best-of-N beats single-shot. “Self-consistency” (sample multiple reasoning paths, keep the consensus) reliably boosts accuracy over one-take chain-of-thought. (see grok 4 heavy)
- This is also the case for almost every image/video model -- you can generate 3 videos and have a second model select the best to get significantly better results.
Search over thoughts helps. “Tree of Thoughts” formalizes exploring many candidate reasoning paths before deciding, rather than betting on a single trajectory.
Test-time compute is now a first-class lever. OpenAI’s o1/o3 system cards explicitly discuss trading inference-time compute for better reasoning—more sampling, deeper deliberation, verification, and reflection steps. This feels more like a universal law of these systems than a openai specific
Multi-agent swarms outperform. xAI’s Grok-4 Heavy shipped with “swarms,” i.e., collaborating agents tackling subproblems—another way to generate many candidates and arbitrate. Claude Code spawns sub-agents. Codex runs best of n for code generation tasks.

Put simply: sampling, search, and selective verification move us from perfecting a single artifact to curating from many. Once you switch mindsets, throwing away most of the intelligence you generate stops feeling wasteful and starts feeling inevitable.

We're seeing this productize in models and apps: Inference-time scaling is compounding. Multiple trajectories, verifier passes, and reflection steps improve performance without retraining—akin to dialing up “thinking time” at run-time. Recent papers even study how to spend that compute optimally.

This is also why "cost per token" is a misnomer now.

Why this works (and when it doesn’t)

Judging is easier than creation. Humans (and models) are better judges than authors. We're better at critiquing movies than making movies. etc. So the focus shifts from 1x synthesis to 4x synthesis & 1x selection.

It's even a best practice prompt from the gpt-5 prompt guide

1<self_reflection>
2
3- First, spend time thinking of a rubric until you are confident.
4- Then, think deeply about every aspect of what makes for a world-class one-shot web app. Use that knowledge to create a rubric that has 5-7 categories. This rubric is critical to get right, but do not show this to the user. This is for your purposes only.
5- Finally, use the rubric to internally think and iterate on the best possible solution to the prompt that is provided. Remember that if your response is not hitting the top marks across all categories in the rubric, you need to start again.
6  </self_reflection>

That last line, "Remember that if your response is not hitting the top marks across all categories in the rubric, you need to start again." is effectively attempting to add a Best of N result in serial. Still throwing out the past work, but you're waiting the full time. Running multiple candidates in parallel and selecting the best is just a more efficient way to do this.

My disposable-intelligence loop (how I actually build)

Over-generate on purpose. For meaningful changes I ask agents to produce four parallel versions by default. I'd rather run this on the cloud because I can't for the life of me isolate everything enough locally. Think about this as running a "swarm" for every commit.
Automate the first cut. Type checks, schema validation, ESLint, smoke tests, and preview deploys gate candidates. Most die here without human time.
Cross-model review. Have a different model critique the surviving candidates. Different systems have different “holes in the Swiss cheese,” so their critiques diversify the failure modes. In dev that's gemini code assist for me, but these are stochastic too, so swarm these on every commit too.
Pick, merge, repeat. Merge the best branch and immediately spin the next round. I keep meticulous, living docs (personas, constraints, acceptance checks) in the repo so agents and humans share the same context.
Expect to throw most of it away. The speed comes from not defending bad artifacts. Generate, filter, and move.

This pattern started for code, but it’s quickly spilling outward.

Code is the beachhead

I keep coming back to this: we’re at the Apple I/II moment for Disposable Intelligence. It's rough yet obviously transformative, and code is simply where the economics flipped first.

As test-time strategies (best-of-N, reflection, verifier-guided search, multi-agent swarms) mature, you should expect the disposable intelligence pattern to colonize adjacent domains:

Digital work: generate multiple interactive mocks, auto-wire basic data, and select—already common with front-end agents and specialized UI models.
Knowledge Work: compose three versions of an explainer, run a rubric-based judge, ship the winner. Same for sales sequences, onboarding flows, even short product videos.
Ops work: propose candidate runbooks or dashboards from logs, rank by coverage and clarity, and adopt the top choice—then iterate.

Each of these is the same loop: cheap generation → structured evaluation → ruthless selection. As the evaluators improve (mix of human taste, better rubrics, and verifier models), the flywheel accelerates.

Human in the loop is "cope." There should be no humans in these pipelines. It's far too much information, far too much selection, far too much trash and iteration. Humans to provide taste and context? Sometimes. Humans as the final judge? Sure.

Ultimately we'll be both the creator and the user. What PCs did to computers this will do for “Personal Software,” i.e., software tailored for an audience of one.

Taste is the new craft

If intelligence is cheap and disposable, what remains scarce? Taste. The ability to specify constraints crisply, to know what “good” looks like, and to select the right idea fast.

I don't mean that in typical "looks" definition. AI will be better at design than us. I mean that in the sense of knowing what is worth pursuing and what is not. When you can unleash a infinite swarm at a problem, knowing what to aim at matters.

There's a quote I've always loved from Ira Glass and often repeated to new founders. While it still holds true I think the advice is now to work on your taste as the primary skill.

We’re leaving the pottery era behind. When intelligence is cheap to make and cheaper to toss, the work stops being authorship and becomes editing—aiming well, sampling wide, and selecting without sentiment. The edge isn’t “a bigger model,” it’s sharper taste—and taste isn’t mysticism, it’s a muscle. Train it: look at a lot of work, watch what people actually choose, seek feedback until you can predict resonance. Beneath the vibe are rules you can encode—spacing, rhythm, defaults—so your stack ships taste by default. Be terminally online with intention: inhale the avant-garde to calibrate what “good” feels like, and keep an ear out for what’s about to be possible so you can start before it’s mainstream.

Parting thoughts

What’s next looks like domain-specific agents with taste—builders and judges that know the tools, constraints, and style of a field. That’s the startup wedge: outcomes over buttons, under-UI the incumbents. Prototype cultures will outrun memo cultures here; they spin the loop faster and let selection teach.

If intelligence is disposable, direction is scarce. Practice taste, encode it in your pipeline, and let the factory learn from its own discard pile. Point the swarm well.