A New “Time Horizon” Metric and Doubling Trends

METR's new metric for measuring AI progress, and the surprising consistency of AI task complexity doubling every seven months.

Mar 21, 2025

I initially scoffed at the idea of measuring AI progress using a "50%-task-completion time horizon" (basically, the time it takes a skilled human to do tasks that AI models can complete half the time).

But, after diving into the latest METR paper, I'm starting to appreciate its value. It's intriguing to see AI progress framed as a kind of "Moore's Law," with task complexity doubling roughly every seven months since 2019.

METR's paper on measuring AI progress

The "Time Horizon" Metric

How the METR team's process works
  • The idea is simple: if a task typically takes a human 30 minutes and a given AI can complete it autonomously 50% of the time, that AI's "time horizon" is 30 minutes.
  • METR gathered 169 diverse tasks from three benchmark suites, tested subsets of them rigorously with human baselines to understand human completion time, and plotted AI capabilities across multiple model generations.
  • Tasks range from trivial (seconds) to substantial (hours), giving a broad view of capability growth.
  • Task selection matters here so let's assume this roughly measures doubling time of coding ability.

Doubling Every Seven Months

METR's doubling trend chart
  • Models went from struggling with seconds-long tasks (GPT-2, GPT-3) to reliably completing tasks that skilled humans spend hours on (Claude 3.7 Sonnet).
  • Notably, the doubling trend has been surprisingly consistent, with an R² of 0.98—insanely tight for something this complex.

Implications of the Doubling Trend

Predictions based on the doubling trend
  • If the trend holds, by around 2027–2031, we might see AI handling month-long human tasks autonomously.
  • Imagining a "month-long horizon" in the next few years is exciting (and scary). It'd be hard to measure, but for now it's easy to imagine.
  • However, the blocker to achieve these tasks is not raw intelligence. Tasks at this scale require actual breakthroughs in long horizon memory and learning or insane breakthroughs on context windows. There's talk of some of that with Titans but there hasn't been much since.
  • So my gut says, ooph, that's a stretch. But I'm reminded of how people reacted to Moore’s Law too (which was hypothesized on only 4 datapoints). It required future breakthroughs to keep coming, and they did come. And commentary like this might not be as goofy as they seem today.

Context and Nuance

Length of tasks by model comparison
  • Anthropic's Claude Sonnet 3.7's time horizon is 59 mins. Released Feb 2025 / 25 days ago.
  • OpenAI's o1 sits at 39 mins. Released in preview Sept 2024 / 190 days ago.
  • Curious to see where things like o1-pro and o3 would sit.

Real-World Tasks vs. Benchmark Tasks

Real-world vs. benchmark tasks
  • Tasks in benchmarks are self-contained and well-defined, unlike typical messy, ambiguous real-world tasks people face daily.
  • Humans weren't deeply familiar with the specific tasks or tools. This feels unfair, but upon reflection, I suppose that levels the playing field for AI.
  • Real-world tasks involve task discovery and iterative feedback loops. Imagine if every manager provided benchmark-level clarity to their teams? The world would be a less stressful place.
  • Human participants completed only around 61% of their tasks, and they did have internet access (AI did not).

How does the doubling trend hold up at higher success rates?

Doubling trend at 80% success rate
  • Initially, I criticized the 50% success threshold as irrelevant for real-world deployment (where near-100% is needed).
  • @swyx correctly nudged me to read to the end.
  • The 80% success rate shows... basically the same doubling trend. Every 213 days rather than 212 days if we used 50%. Huh! That's a lot more robust than I expected.
  • Looks like even if we push to 99% completion rates it doesn't change things much.

Also, scaffolding plays a major factor beyond 50% success rate:

  • Better prompting techniques, iterative frameworks, and search-enhanced strategies can boost success rates 2–3x on identical models.
  • Looking at SWE-bench, you'll see models like 3.5 Sonnet ranging from 33% to 62% success rates on the benchmark. A ~2x difference simply based on how it's been harnessed.

Parting Thoughts

AGI and the future

At Hypercontext, we aimed to break tasks into ~4-hour chunks—this was our sweet spot for accurate scoping. AI capabilities entering the 4-hour to 8-hour range are especially exciting as they align precisely with real-world developer complexity.

It's wild to contemplate, but the data clearly points toward this future. Even factoring in skepticism about extrapolating trends, there's no ignoring the rapidly expanding capabilities.

Maybe this is the first tangible glimpse into what's coming.