Seeing UV: Detecting Intelligence Beyond Ours

Once we solve Humanity’s Last Exam, we move to a new problem: how can we detect, evaluate, and improve upon intelligence when we don't have the intelligence to test it?

Apr 18, 2025

Imagine you’re a bird. To your human eyes, your mate’s feathers are brown—plain, ordinary brown. But to another bird, those feathers glow vividly in ultraviolet patterns. Dogs sniff with ~300 million olfactory receptors versus our ~6 million, resolving scents thousands of times finer than ours, as shown in this overview. Mantis shrimp wield up to 16 photoreceptor types, detecting polarized light we don't even have names for (study).

Here's the uncomfortable truth: as AI surpasses human intelligence, we risk becoming like humans staring blankly at feathers they can't appreciate—unable to notice, understand, or measure the leaps occurring right in front of us.


Invisible Progress

I think people felt this when GPT-4.5 came out. With 2 years of work by the smartest people in AI and 10x more compute than GPT-4, people thought "Meh."

Even OpenAI sheepishly described it at being better in nuanced ways and things like creative writing. People disagree in a blind test. Personally, I thought this really just displayed how well 4o was refined to human preferences.

But it was great in such subtle ways you either had to pay attention to details to see them, or you needed to take the old models to their limits to notice that this one didn't break down. It's so expensive to run and the payoff is so little, they're shutting it down after ~2 months of access. Though they did give us a nice video to say goodbye:

AI benchmarks that seemed impossible five years ago are becoming trivial today. Models like OpenAI’s o3 now match or surpass human performance on standard tests like MMLU and BIG‑Bench, according to the technical report. We've hit benchmark saturation—our tests no longer detect meaningful differences between increasingly smarter models, a trend flagged by Stanford’s piece on AI benchmarks hitting saturation.

Yet scaling laws clearly show AI capabilities rise smoothly with more computational power (paper). The gains haven't stopped; we've just lost the ability to perceive them.

When OpenAI’s o3‑mini subtly out‑performed GPT‑4o on the challenging multimodal MMMU evaluation, most users shrugged—"it feels about the same." MMMU’s own leaderboard shows the gap, but our “wow” meters stayed flat. The ultraviolet feathers were there; we just lacked the retinal cones—our evals—to register them.


The Writer's Ceiling

Think about the genius hacker in your favorite movie. The dialogue probably sounds complex, until you actually know something about computer (see the Instant Hacker trope). Or, the brilliant detective who solves a case with a single glance at the crime scene, but the explanation of how they did it is so convoluted the entire show relies on a forced confession to make sense of it.

Writers inevitably fail when scripting characters smarter than themselves because you can't convincingly write beyond your own intelligence. Psychologists call this the “Illusion of Explanatory Depth”: we think we grasp complex ideas until forced to articulate them clearly (original paper).

If even skilled storytellers can't convincingly portray intelligence beyond their own, why would we trust our current tools—or our own judgment—to evaluate an AI that's genuinely smarter than we are?

BlindspotHow It Blinds Us
Dunning‑Kruger EffectBeginners over‑estimate AI; experts under‑estimate its leaps (study).
ELIZA EffectWe project meaning onto shallow AI behavior—and miss deeper intelligence when style shifts (paper).
Benchmark SaturationModels outpace existing tests, forcing us to lean on subjective “vibes” (article).

After the Last Human Benchmark

Given the above, I wanted to know what would happen when AI finally aces Humanity’s Last Exam? Here's of the next tricks the research community is working on to keep up with the new models:

Frontier strategyWhat it does now ( 2024‑25)Why it matters next
Infinite BenchmarksAnthropic released 150 + model‑written evals that regenerate on demand and surface odd behaviours like inverse‑scaling and hidden sycophancy (paper)Tests grow as fast as models—no more fixed syllabus.
LLM‑as‑a‑JudgeGPT‑4 scores peer models with ≈ 80 % human agreement on MT‑Bench & Chatbot Arena (study) . 2024 follow‑ups add “trust‑or‑escalate” guarantees for bias control.Cheap, repeatable grading when human panels can’t keep up.
Crowd / Adversarial LoopsOpenAI Evals turns every real‑world failure prompt into a permanent test, while external red‑team datasets flow straight into automated checks (repo & overview)Keeps the bar moving—models can’t overfit yesterday’s exam.
Scalable Oversight Protocols2025 work on recursive self‑critiquing shows AI‑vs‑AI “critique‑of‑critique” can police superhuman outputs when humans can’t (paper) . A new benchmark directly compares oversight schemes (benchmark)A roadmap for supervising skills we no longer master.
Agentic BenchmarksAgentBench v1.4 and METR’s RE‑Bench pit models against multi‑step tool‑use tasks; humans still win 32‑hour marathons, but models dominate five‑minute sprints (paper, blog)Measures process competence—not just one‑shot answers.
Automated Interpretability GatesOpenAI’s automated‑interpretability toolkit plus Anthropic’s Transformer Circuits 2024 updates aim to flag dishonest circuits before deployment (repo, essay)Shifts evaluation from outputs to internal reasoning.
Hard‑as‑Possible ExamsHumanity’s Last Exam itself iterates: bug‑bounty hunts (Mar‑2025) and a live leaderboard now drive question quality upward (Scale blog)Even the “final exam” is no longer static.

Bottom line: the yardsticks are mutating right alongside the models. We may never see the UV feathers unaided, but we’re busy inventing microscopes, spectral filters, and AI‑powered ornithologists to point them out.


Staring at Ultraviolet

Perhaps we’re approaching a future where our instruments for measuring intelligence must themselves evolve beyond human design constraints. If we can’t trust our eyes, tests, or intuition to gauge the limits of superhuman AI, we’ll need radically novel ways to detect progress.

Like birds seeing ultraviolet feathers, AI may soon perform cognitive feats right in front of us—completely invisible to our current senses. The urgent challenge ahead isn’t just building smarter AI; it’s inventing entirely new ways to notice it.