I Built a Visual Autoresearch Battleground

Upload an image, let Pi grind on HTML/CSS, and watch an agent try to close the visual diff.

Mar 18, 2026

I’ve been watching Andrej Karpathy’s autoresearch with a lot of interest. The idea is simple and potent: don’t ask the model for one answer. Give it a harness, a metric, and enough room to keep trying.

When Tobi started showing what that kind of loop could do for Liquid performance, it clicked for me. I didn’t want to optimize a model. I wanted a toy battleground I could actually see.

So I built Pi Visual Autoresearch: upload a target image, let Pi edit only HTML/CSS, score it with a visual diff, and let the agent keep iterating.

Brennan McEachran 👨‍🚀

@i_am_brennan

·Follow

Recorded a quick video on how this works! Check it out!

Watch on X

Brennan McEachran 👨‍🚀

@i_am_brennan

OK over the weekend I got "visual autoresearch" working. Upload a target image. And a model of your choice will make html/css, best it can. - Inspired by @karpathy's autoresearch - Uses autoresearch-pi from @tobi & @davebcn87 - Uses @wesbos's diff algo which was tuned to human

5:29 PM · Mar 16, 2026

Why this grabbed me

What I like about autoresearch is that it moves the intelligence one layer up.

The magic is not "the model wrote a clever div." The magic is: the model can try, measure, keep, discard, and try again.

That is a different category of leverage.

I keep coming back to the same thing I wrote in You. Are. Not. Using. Enough. Compute.: the bottleneck is increasingly the harness, not the raw model. 2026 feels like the year we start putting a suit and tie on the RL loop. Better evals. Better guardrails. Better iteration surfaces. More work on the loop, less work inside the loop.

This repo is my version of that idea for HTML/CSS:

Upload a target image
Pi can only touch candidate.html and candidate.css
Render it with Playwright
Score it with a visual diff
Feed the target, candidate capture, and diff heatmap back into the loop
Repeat until the curve stops falling or the agent loses the plot

Simple, but weirdly satisfying.

What surprised me

First: these agents are shockingly good on attempt one.

They can see a lot. They notice gradients, proportions, spacing, shape language, little details. And when they go to write code, they can also write a lot of detail.

But the translation layer between "I see it" and "I can move my code toward it by 2 pixels" is still shaky.

That’s the fascinating part.

They are much better at broad reconstruction than tiny controlled improvement.

A human doing this would nudge, check, nudge, check, and slowly ratchet upward. The agents feel much more like table-flippers. They try a strategy, miss, get annoyed, and then often abandon the whole lane instead of carefully exploiting the delta between two candidates.

They know they’re close. They just don’t have great instincts yet for inching closer.

That feels important. The vision is strong. The code generation is strong. The bridge between the two is still messy.

Two agents, one harness

This was also my first real time using Pi, and I ended up liking it more than I expected.

The plugin system is one of the stars. Seeing scorer images come back into the terminal loop is a bit absurd in the best way. It makes the whole thing feel much more alive.

The other funny part is that I mostly used Codex to build the Pi battleground.

So whenever Pi got confused inside the arena, I’d have Codex read Pi’s session history, understand what went wrong, and patch the harness. Which meant I was basically prompting one coding assistant to improve the working conditions of another coding assistant.

That feels extremely 2026.

Parting Thoughts

I don’t think the biggest opportunities this year are inside the model loop. I think they’re around it.

Give a model a constrained workspace, a clean feedback signal, a way to keep score, and a reason to try again, and suddenly a bunch of "that was a neat demo" ideas start turning into real systems.

This project is mostly just a fun toy. But it’s also a good reminder: the multiplier is increasingly in the harness.

If you want to play with it, the repo is here, the launch thread is here, and the battleground lineage runs through Karpathy’s autoresearch plus pi-autoresearch.

Blog...

A blog post is loading...

Jan 1, 2025

Loading…

I Built a Visual Autoresearch Battleground

Upload an image, let Pi grind on HTML/CSS, and watch an agent try to close the visual diff.

Mar 18, 2026

When Tobi started showing what that kind of loop could do for Liquid performance, it clicked for me. I didn’t want to optimize a model. I wanted a toy battleground I could actually see.

So I built Pi Visual Autoresearch: upload a target image, let Pi edit only HTML/CSS, score it with a visual diff, and let the agent keep iterating.

Brennan McEachran 👨‍🚀

@i_am_brennan

·Follow

Recorded a quick video on how this works! Check it out!

Watch on X

Brennan McEachran 👨‍🚀

@i_am_brennan

5:29 PM · Mar 16, 2026

Why this grabbed me

What I like about autoresearch is that it moves the intelligence one layer up.

The magic is not "the model wrote a clever div." The magic is: the model can try, measure, keep, discard, and try again.

That is a different category of leverage.

This repo is my version of that idea for HTML/CSS:

Upload a target image
Pi can only touch candidate.html and candidate.css
Render it with Playwright
Score it with a visual diff
Feed the target, candidate capture, and diff heatmap back into the loop
Repeat until the curve stops falling or the agent loses the plot

Simple, but weirdly satisfying.

What surprised me

First: these agents are shockingly good on attempt one.

They can see a lot. They notice gradients, proportions, spacing, shape language, little details. And when they go to write code, they can also write a lot of detail.

But the translation layer between "I see it" and "I can move my code toward it by 2 pixels" is still shaky.

That’s the fascinating part.

They are much better at broad reconstruction than tiny controlled improvement.

They know they’re close. They just don’t have great instincts yet for inching closer.

That feels important. The vision is strong. The code generation is strong. The bridge between the two is still messy.

Two agents, one harness

This was also my first real time using Pi, and I ended up liking it more than I expected.

The plugin system is one of the stars. Seeing scorer images come back into the terminal loop is a bit absurd in the best way. It makes the whole thing feel much more alive.

The other funny part is that I mostly used Codex to build the Pi battleground.

That feels extremely 2026.

Parting Thoughts

I don’t think the biggest opportunities this year are inside the model loop. I think they’re around it.

Give a model a constrained workspace, a clean feedback signal, a way to keep score, and a reason to try again, and suddenly a bunch of "that was a neat demo" ideas start turning into real systems.

This project is mostly just a fun toy. But it’s also a good reminder: the multiplier is increasingly in the harness.

If you want to play with it, the repo is here, the launch thread is here, and the battleground lineage runs through Karpathy’s autoresearch plus pi-autoresearch.