Insights

From Minutes to Months: AI’s Rising Ability to Finish Long Tasks

9 minute read

Up next

Expanded, Data-Driven Essay: Staying Ahead in the Age of AI

Share article

What the next seven years could look like—and how to use it today

TL;DR. A clear way to track real‑world AI utility is its time horizon: the length of tasks (measured in human time) that an AI can finish with a given reliability. Recent measurements show that this horizon has doubled about every seven months since 2019. Today’s best generalist systems reliably handle tasks a human completes in ~50–60 minutes; they’re near‑certain on tasks under ~4 minutes and falter above ~4 hours. If the long‑running trend holds, day‑, week‑, and even month‑scale projects become automatable—with the right guardrails—over the next few years. metr.org arXiv

METR-Horizon-v1: Long-Task Time Horizons by Model Release (log scale)

Show p50 Show p80 Show CI bars

Notes: Y-axis is log scale. Points show estimated hours for p50/p80 horizons; vertical bars are 95% CIs. Hover for details. Click legend toggles to show/hide series.

Table of Contents

Why “time horizon” matters more than leaderboard scores

Traditional tests (quizzes, benchmarks) tell us whether a model knows things. They don’t tell us whether it can carry a multi‑step project from start to finish.

The time‑horizon metric fixes this by asking: How long a task—timed on real experts—can this AI reliably complete? On METR’s diverse, multi‑step software and reasoning suite, success is ~100% for “human‑under‑4‑minute” tasks and <10% for “human‑over‑4‑hour” tasks; the leading models cluster around a ~1‑hour 50% horizon. metr.org

Key point: METR finds the 50% horizon has risen exponentially with a ~7‑month doubling. A stricter 80% reliability horizon is shorter—roughly 5× shorter—but rises in parallel. metr.org arXiv

AI Progress: P50 Horizon Length

AI Model Progress on METR-Horizon-v1 Benchmark

This graph shows the p50 horizon length of various AI models over time, using a logarithmic scale on the y-axis to highlight exponential growth. Hover over a point to see model details.

Diagram: The long‑task horizon, illustrated

Horizon curve (log scale): shows a simple projection anchored to the March 19, 2025 baseline (~1‑hour 50% horizon) with a 7‑month doubling.

Year‑by‑year bars: sample the 50% horizon on each “anniversary” (Mar 19).

Note: These visuals are illustrative (not METR’s original plots). They use METR’s reported doubling time and baseline to communicate the practical magnitude of change. metr.org

The next seven years (illustrative base case)

Assuming the ~7‑month doubling persists and using March 19, 2025 as time zero:

Year (anniversary)	~50% horizon (human‑time)	What becomes feasible (with tests & oversight)
2025	1.0 hr	Short, well‑scoped tasks with strong checkers.
2026	3.3 hr	“Day‑part” tickets; multi‑doc synthesis; small feature PRs with tests.
2027	10.8 hr (~1 day)	End‑to‑end one‑day work packages and multi‑app tool use.
2028	35.3 hr (~1.5 days)	Mini‑sprints: prototype features; full‑pipeline research → draft → figures.
2029	116 hr (~4.8 days)	Week‑long projects: bug bash + triage across repos; AB test setup.
2030	380 hr (~2.3 weeks)	Multi‑team integration, procurement cycles, compliance evidence collection.
2031	1,248 hr (~7.4 weeks)	Month‑scale initiatives become hit‑and‑miss but tractable with scaffolding.
2032	4,096 hr (~5.6 months)	Cross‑functional projects possible under strict guardrails & staged rollout.

Screenshot

Independent coverage has drawn similar conclusions—while warning that any long‑range extrapolation deserves caution. Domain‑by‑domain analyses also suggest some areas may move faster or slower, but broadly continue on exponential tracks. Nature metr.org

What this means for teams right now

You can’t wish a 6‑hour task into 6 minutes. But you can compress end‑to‑end time by (1) atomicizing work, (2) writing checkers before doers, and (3) shrinking the human loop to review/approval. This is where today’s systems shine—and where tomorrow’s will scale.

High‑leverage patterns

Test‑oriented automation: Convert deliverables into code‑checkable artifacts (validators, linters, schema checks, golden examples).
Self‑verification loops: The agent must explain its plan, run the checks, repair failures, then seek sign‑off.
Parallelism by default: Fan‑out retrieval, candidate solutions, or data pulls; auto‑rank with your checkers.
Policy & permissions: Least‑privilege keys, explicit “allowed actions,” auditable logs.
Budgets & SLAs: Token/runtime/cost ceilings; auto‑stop on runaway jobs.

Reality check: A July 2025 study with experienced OSS developers found that naïve use of coding AIs could slow work; the fix was better scaffolding, tests, and workflow design—not abandoning AI. Build the runbook around the tool, not the other way around. metr.org TIME

Drop‑in SOP checklist (copy‑paste)

Use this to convert recurring workflows into reliable, agent‑driven runbooks.
Download as Markdown:

1) Scope & safety

Define the task boundary (inputs/outputs, success criteria, non‑goals)
Assign a risk tier and required approvals
Provide least‑privilege credentials
List allowed tools and disallowed actions

2) Tests before tasks

Create unit checks (validators, schema checks, linters)
Add golden examples + edge cases
Build an end‑to‑end test the agent must pass before “done”

3) Plan → Do → Verify loop

Agent proposes a step‑by‑step plan and confirms assumptions
Structured action logs (who/what/when/why)
Self‑repair on failure with explanations and retries
Human gate after tests and diffs pass

4) Parallelism & reuse

Fan‑out long steps; rank results
Save artifacts (plans, drafts, datasets, diffs) in an artifact store
Have the agent search prior playbooks first

5) Budgets & SLA

Set token/runtime/cost budgets + timeout
Track success rate, retries, lead time per task type
Auto‑escalate on stalling, policy violations, or low confidence

6) Deployment hygiene

Staging‑first; promote via canaries & automated diffs
Maintain audit trails (PII, data residency)
Run blameless post‑mortems; update tests and playbooks

Sources & further reading

METR, “Measuring AI Ability to Complete Long Tasks” (blog + figures, Mar 19, 2025) — time‑horizon method; ~7‑month doubling; success vs. task length. metr.org
METR (arXiv), paper preprint — ~50‑minute 50% horizon for top models; 80% horizons ~5× shorter. arXiv
Nature news, “AI could soon tackle projects that take humans weeks” — contextualizes forecasts, cautions on extrapolation. Nature
METR, “How Does Time Horizon Vary Across Domains?” (Jul 14, 2025) — domain‑specific growth rates and possible acceleration. metr.org
METR, developer productivity study and coverage — why workflow design matters for real‑world gains. metr.org TIME

ai enhances but doesn t replace

Reality Check

AI in Healthcare: Why Your Doctor Isn’t Obsolete Yet

Keen advancements in AI are reshaping healthcare, but discover why your doctor remains…

Thorsten Meyer
January 11, 2026

Uncategorized

Free White Paper: NVIDIA Alpamayo and the New Era of Reasoning-Based Autonomy

How “open” autonomy models, NVIDIA hardware dependency, and new simulation/data…

Thorsten Meyer
January 10, 2026