What the next seven years could look like—and how to use it today

TL;DR. A clear way to track real‑world AI utility is its time horizon: the length of tasks (measured in human time) that an AI can finish with a given reliability. Recent measurements show that this horizon has doubled about every seven months since 2019. Today’s best generalist systems reliably handle tasks a human completes in ~50–60 minutes; they’re near‑certain on tasks under ~4 minutes and falter above ~4 hours. If the long‑running trend holds, day‑, week‑, and even month‑scale projects become automatable—with the right guardrails—over the next few years. metr.orgarXiv

METR-Horizon-v1: Long-Task Time Horizons by Model Release (log scale)

Notes: Y-axis is log scale. Points show estimated hours for p50/p80 horizons; vertical bars are 95% CIs. Hover for details. Click legend toggles to show/hide series.

Why “time horizon” matters more than leaderboard scores

Traditional tests (quizzes, benchmarks) tell us whether a model knows things. They don’t tell us whether it can carry a multi‑step project from start to finish.

The time‑horizon metric fixes this by asking: How long a task—timed on real experts—can this AI reliably complete? On METR’s diverse, multi‑step software and reasoning suite, success is ~100% for “human‑under‑4‑minute” tasks and <10% for “human‑over‑4‑hour” tasks; the leading models cluster around a ~1‑hour 50% horizon. metr.org

Key point: METR finds the 50% horizon has risen exponentially with a ~7‑month doubling. A stricter 80% reliability horizon is shorter—roughly shorter—but rises in parallel. metr.orgarXiv

AI Progress: P50 Horizon Length

AI Model Progress on METR-Horizon-v1 Benchmark

This graph shows the p50 horizon length of various AI models over time, using a logarithmic scale on the y-axis to highlight exponential growth. Hover over a point to see model details.


Diagram: The long‑task horizon, illustrated

  • Horizon curve (log scale): shows a simple projection anchored to the March 19, 2025 baseline (~1‑hour 50% horizon) with a 7‑month doubling.
  • Year‑by‑year bars: sample the 50% horizon on each “anniversary” (Mar 19).

Note: These visuals are illustrative (not METR’s original plots). They use METR’s reported doubling time and baseline to communicate the practical magnitude of change. metr.org


The next seven years (illustrative base case)

Assuming the ~7‑month doubling persists and using March 19, 2025 as time zero:

Year (anniversary)~50% horizon (human‑time)What becomes feasible (with tests & oversight)
20251.0 hrShort, well‑scoped tasks with strong checkers.
20263.3 hr“Day‑part” tickets; multi‑doc synthesis; small feature PRs with tests.
202710.8 hr (~1 day)End‑to‑end one‑day work packages and multi‑app tool use.
202835.3 hr (~1.5 days)Mini‑sprints: prototype features; full‑pipeline research → draft → figures.
2029116 hr (~4.8 days)Week‑long projects: bug bash + triage across repos; AB test setup.
2030380 hr (~2.3 weeks)Multi‑team integration, procurement cycles, compliance evidence collection.
20311,248 hr (~7.4 weeks)Month‑scale initiatives become hit‑and‑miss but tractable with scaffolding.
20324,096 hr (~5.6 months)Cross‑functional projects possible under strict guardrails & staged rollout.
Screenshot

Independent coverage has drawn similar conclusions—while warning that any long‑range extrapolation deserves caution. Domain‑by‑domain analyses also suggest some areas may move faster or slower, but broadly continue on exponential tracks. Naturemetr.org


What this means for teams right now

You can’t wish a 6‑hour task into 6 minutes. But you can compress end‑to‑end time by (1) atomicizing work, (2) writing checkers before doers, and (3) shrinking the human loop to review/approval. This is where today’s systems shine—and where tomorrow’s will scale.

High‑leverage patterns

  1. Test‑oriented automation: Convert deliverables into code‑checkable artifacts (validators, linters, schema checks, golden examples).
  2. Self‑verification loops: The agent must explain its plan, run the checks, repair failures, then seek sign‑off.
  3. Parallelism by default: Fan‑out retrieval, candidate solutions, or data pulls; auto‑rank with your checkers.
  4. Policy & permissions: Least‑privilege keys, explicit “allowed actions,” auditable logs.
  5. Budgets & SLAs: Token/runtime/cost ceilings; auto‑stop on runaway jobs.

Reality check: A July 2025 study with experienced OSS developers found that naïve use of coding AIs could slow work; the fix was better scaffolding, tests, and workflow design—not abandoning AI. Build the runbook around the tool, not the other way around. metr.orgTIME


Drop‑in SOP checklist (copy‑paste)

Use this to convert recurring workflows into reliable, agent‑driven runbooks.
Download as Markdown:

1) Scope & safety

  • Define the task boundary (inputs/outputs, success criteria, non‑goals)
  • Assign a risk tier and required approvals
  • Provide least‑privilege credentials
  • List allowed tools and disallowed actions

2) Tests before tasks

  • Create unit checks (validators, schema checks, linters)
  • Add golden examples + edge cases
  • Build an end‑to‑end test the agent must pass before “done”

3) Plan → Do → Verify loop

  • Agent proposes a step‑by‑step plan and confirms assumptions
  • Structured action logs (who/what/when/why)
  • Self‑repair on failure with explanations and retries
  • Human gate after tests and diffs pass

4) Parallelism & reuse

  • Fan‑out long steps; rank results
  • Save artifacts (plans, drafts, datasets, diffs) in an artifact store
  • Have the agent search prior playbooks first

5) Budgets & SLA

  • Set token/runtime/cost budgets + timeout
  • Track success rate, retries, lead time per task type
  • Auto‑escalate on stalling, policy violations, or low confidence

6) Deployment hygiene

  • Staging‑first; promote via canaries & automated diffs
  • Maintain audit trails (PII, data residency)
  • Run blameless post‑mortems; update tests and playbooks

Sources & further reading

  • METR, “Measuring AI Ability to Complete Long Tasks” (blog + figures, Mar 19, 2025) — time‑horizon method; ~7‑month doubling; success vs. task length. metr.org
  • METR (arXiv), paper preprint — ~50‑minute 50% horizon for top models; 80% horizons ~5× shorter. arXiv
  • Nature news, “AI could soon tackle projects that take humans weeks” — contextualizes forecasts, cautions on extrapolation. Nature
  • METR, “How Does Time Horizon Vary Across Domains?” (Jul 14, 2025) — domain‑specific growth rates and possible acceleration. metr.org
  • METR, developer productivity study and coverage — why workflow design matters for real‑world gains. metr.orgTIME
You May Also Like

Google’s new ‘Flight Deals’ lets you find cheap flights with a plain‑English prompt

Beta rolls out in the U.S., Canada, and India; classic Google Flights…

AI for Productivity and Pricing: JPMorgan’s Coding Copilot and Delta’s Super‑Analyst

Introduction Two very different industries—finance and commercial aviation—have recently unveiled early evidence…

Adeia’s patent lawsuits affect AMD’s AI and server roadmap: strategic and technical analysis

Below is a strategic and technical analysis of how Adeia’s patent lawsuits…

AI Revolution Drives Layoffs: Inside Atlassian’s Controversial Workforce Reshuffle

The layoff that shocked tech workers On 30 July 2025 Atlassian’s 12,000‑person workforce received…