Blog...
A blog post is loading...
Jan 1, 2025
Loading…
A blog post is loading...
Jan 1, 2025
Loading…
Upload an image, let Pi grind on HTML/CSS, and watch an agent try to close the visual diff.
Mar 18, 2026
I’ve been watching Andrej Karpathy’s autoresearch with a lot of interest. The idea is simple and potent: don’t ask the model for one answer. Give it a harness, a metric, and enough room to keep trying.
When Tobi started showing what that kind of loop could do for Liquid performance, it clicked for me. I didn’t want to optimize a model. I wanted a toy battleground I could actually see.
So I built Pi Visual Autoresearch: upload a target image, let Pi edit only HTML/CSS, score it with a visual diff, and let the agent keep iterating.
Recorded a quick video on how this works! Check it out!
OK over the weekend I got "visual autoresearch" working. Upload a target image. And a model of your choice will make html/css, best it can. - Inspired by @karpathy's autoresearch - Uses autoresearch-pi from @tobi & @davebcn87 - Uses @wesbos's diff algo which was tuned to human
What I like about autoresearch is that it moves the intelligence one layer up.
The magic is not "the model wrote a clever div." The magic is: the model can try, measure, keep, discard, and try again.
That is a different category of leverage.
I keep coming back to the same thing I wrote in You. Are. Not. Using. Enough. Compute.: the bottleneck is increasingly the harness, not the raw model. 2026 feels like the year we start putting a suit and tie on the RL loop. Better evals. Better guardrails. Better iteration surfaces. More work on the loop, less work inside the loop.
This repo is my version of that idea for HTML/CSS:
candidate.html and candidate.cssSimple, but weirdly satisfying.
First: these agents are shockingly good on attempt one.
They can see a lot. They notice gradients, proportions, spacing, shape language, little details. And when they go to write code, they can also write a lot of detail.
But the translation layer between "I see it" and "I can move my code toward it by 2 pixels" is still shaky.
That’s the fascinating part.
They are much better at broad reconstruction than tiny controlled improvement.
A human doing this would nudge, check, nudge, check, and slowly ratchet upward. The agents feel much more like table-flippers. They try a strategy, miss, get annoyed, and then often abandon the whole lane instead of carefully exploiting the delta between two candidates.
They know they’re close. They just don’t have great instincts yet for inching closer.
That feels important. The vision is strong. The code generation is strong. The bridge between the two is still messy.
This was also my first real time using Pi, and I ended up liking it more than I expected.
The plugin system is one of the stars. Seeing scorer images come back into the terminal loop is a bit absurd in the best way. It makes the whole thing feel much more alive.
The other funny part is that I mostly used Codex to build the Pi battleground.
So whenever Pi got confused inside the arena, I’d have Codex read Pi’s session history, understand what went wrong, and patch the harness. Which meant I was basically prompting one coding assistant to improve the working conditions of another coding assistant.
That feels extremely 2026.
I don’t think the biggest opportunities this year are inside the model loop. I think they’re around it.
Give a model a constrained workspace, a clean feedback signal, a way to keep score, and a reason to try again, and suddenly a bunch of "that was a neat demo" ideas start turning into real systems.
This project is mostly just a fun toy. But it’s also a good reminder: the multiplier is increasingly in the harness.
If you want to play with it, the repo is here, the launch thread is here, and the battleground lineage runs through Karpathy’s autoresearch plus pi-autoresearch.