By Thorsten Meyer — May 2026

The most persuasive part of Jack Clark’s Import AI #455 is not the 60%/2028 forecast or the implications discussion. It is the evidence section, where Clark walks through six specific benchmarks that measure the precise skills required to automate AI research and shows that every one of them has either been saturated or is being saturated on a timeline of months rather than years. The pattern across the six benchmarks is the structural argument. Any single benchmark might be saturated through some combination of overfitting, data contamination, evaluation methodology problems, or measurement noise. Six benchmarks saturated in the same time window, measuring substantively different aspects of AI engineering and research capability, with consistent rates of improvement — that is not noise. That is a curve.

This piece is the catalog of the six benchmarks with the actual numbers, the time windows, and what each one measures. Anyone modeling AI capability trajectories — for investment purposes, policy purposes, workforce planning, or personal positioning — should know these numbers cold. They are the public-domain answer to the question “is the AI capability trajectory actually moving as fast as the discourse suggests.”

The short answer, after reading the six benchmarks together, is: yes, it is moving that fast, and Clark’s forecast in the companion piece on his 60%/2028 estimate is not eccentric. It is what the public data says.






Every Benchmark Launched 2023-2024 Has Fallen — The Benchmark Saturation Cascade


DISPATCH / MAY 2026CLARK SERIES · 2 OF 5 · BENCHMARK CASCADE

▲ Clark Series 02 The Cascade · 6 Benchmarks · May 2026

The Benchmark Saturation Cascade · 2023-2026

Six benchmarks.
Six saturations.

Every benchmark launched 2023-2024 to measure AI R&D capability has fallen — or is falling on the same cadence. The pattern across the six is the structural argument.

Jack Clark’s 60%/2028 forecast on automated AI R&D rests on six benchmarks measuring different facets of AI engineering and research. Any single benchmark could be noise. Six benchmarks saturating in the same window — that is a curve. This is the catalog of the numbers everyone modeling AI deployment should know cold.

Thorsten Meyer/ThorstenMeyerAI.com/May 2026

Benchmarks measuring AI R&D capability

6 / 6tracking toward saturation

All six benchmarks selected to measure AI R&D capability have either saturated, been declared solved, or are tracking toward saturation on a cadence of months, not years.

93.9%

SWE-Bench · Claude Mythos Preview · May 2026

From 2% in late 2023 · 47× improvement · saturated

12hr

METR time horizon · Opus 4.6 · 2026

From 30 seconds in 2022 · 1,440× growth in 4 years

52×

CPU training speedup · Mythos · April 2026

From 2.9× in May 2025 · human baseline 4×

95.5%

CORE-Bench · Opus 4.5 · Dec 2025

From 21.5% Sept 2024 · author declared SOLVED

● SWE-BENCH 2% → 93.9% IN 30 MONTHS · SATURATED · 47× IMPROVEMENT● METR TIME HORIZONS 30 SECONDS → 12 HOURS IN 4 YEARS · 1,440× GROWTH · ~7 MONTH DOUBLING● CORE-BENCH 21.5% → 95.5% IN 15 MONTHS · DECLARED SOLVED BY AUTHORS● MLE-BENCH 16.9% → 64.4% IN 16 MONTHS · TRACKING TOWARD SATURATION● POSTTRAINBENCH AI 28% · HUMAN BASELINE 51% · THE INNER LOOP● CPU SPEEDUP 2.9× → 52× IN 11 MONTHS · 13× HUMAN BASELINE● SWE-BENCH 2% → 93.9% IN 30 MONTHS · SATURATED · 47× IMPROVEMENT● METR TIME HORIZONS 30 SECONDS → 12 HOURS IN 4 YEARS · 1,440× GROWTH

The cascade · six benchmarks, six trajectories

Six tests. Same pattern.

Every benchmark in the set was launched with the explicit goal of being challenging for AI systems. Every benchmark has either been saturated or is being saturated on a timeline of months rather than years. The pattern is the structural finding.

The six benchmarks · status as of May 2026

Selected to measure AI R&D capability specifically. All six tracking toward or past saturation.

SWE-BenchReal-world software engineering · GitHub issues

▼ Saturated

Late 20232%

May 202693.9%

30 months · 47× improvement. Claude Mythos Preview at noise floor. The deployment manifestation: frontier-lab researchers code “entirely through AI systems” per Clark.

METR Time HorizonsDuration of tasks AI completes at 50% reliability

▲ Continuing exponential

2022 · GPT-3.530 sec

2026 · Opus 4.612 hr

4 years · 1,440× growth · 7-month doubling cadence. Extrapolation: end-2026 ~100hr (Cotra forecast), end-2027 ~1,000hr, end-2028 ~10,000hr — a full research project end-to-end.

CORE-BenchResearch paper reproduction · install · run · verify

▼ Solved · declared

Sept 202421.5%

Dec 202595.5%

15 months · 4.4× improvement. Opus 4.5 saturated; benchmark author publicly declared the benchmark “solved.” Research reproduction as a discrete task: closed chapter.

MLE-BenchEnd-to-end ML engineering · 75 Kaggle competitions

▶ Tracking

Oct 202416.9%

Feb 202664.4%

16 months · 3.8× improvement. Gemini 3 with search now completes two-thirds of competitive ML engineering projects autonomously. Saturation cadence projects to early 2027.

PostTrainBenchAI fine-tuning AI · the inner loop · Qwen / SmolLM / Gemma

▶ Early but rapid

Apr 2026 · AI25-28%

vs

Human baseline51%

The meta-benchmark. AI doing the actual work humans at frontier labs do. When AI matches the 51% baseline, the inner loop of frontier development closes. 28% in 2 months of measurement is the runway, not the destination.

Anthropic CPU Speedup TaskOptimize LM training implementation · human baseline 4×

▲ Past human baseline · 13×

May 2025 · Opus 42.9×

Apr 2026 · Mythos52×

11 months · 18× improvement · each model ~2× prior.The recursive-self-improvement story made operational. AI compressing the compute requirements of producing the next AI system.

Six benchmarks. One cadence. The pattern is the structural argument.

METR time horizons · the master curve

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

Evals for AI Engineers: Systematically Measuring and Improving AI Applications

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Thirty seconds. Then ten thousand hours.

The single most important chart in the AI capability literature. Task duration AI can complete with 50% reliability — translating capability into time units humans can intuitively reason about. Naive extrapolation hits Clark’s automated-AI-R&D threshold by end-2028.

METR time horizons · 2022 – 2028 (observed + extrapolated)

~7-month doubling cadence observed for 4 years · no inflection visible · extrapolation continues until it doesn’t.

2022

GPT-3.5baseline year

~30 secbaseline

2023

GPT-4+8× year

~4 min8× prior

2024

o1+10× year

~40 min10× prior

2025

GPT 5.2 (High)+9× year

~6 hr9× prior

2026 NOW

Opus 4.6current frontier

~12 hr2× prior · YTD

End 2026

Frontier modelCotra forecast

~100 hr~8× · 2 weeks of work

End 2027

Frontier modelnaive extrapolation

~1,000 hr6 months of work

End 2028

Frontier modelClark threshold

~10,000 hr5 years · one research project

~10,000 hours per task = autonomous research project end-to-end.

What the pattern actually says

CLAUDE AI UNLEASHED From First Prompts to Pro: The Complete Guide to Claude AI for Writing, Research, Coding, and Business (The Claude AI Mastery Series)

CLAUDE AI UNLEASHED From First Prompts to Pro: The Complete Guide to Claude AI for Writing, Research, Coding, and Business (The Claude AI Mastery Series)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Four implications. Four stakeholders.

Six benchmarks all hitting their noise floors simultaneously on the cadence observed is not a noise artifact. The pattern is more robust than any individual data point. Specific implications follow for who needs to update what.

What the cascade implies, by stakeholder

The benchmarks are not noise. The trajectories translate to specific obligations.

▲ Engineering leaders

SWE-Bench saturation is the writing on the wall.

Frontier-lab researchers already code through AI systems. Same pattern reaches broader engineering market on 12-24 month delay. Productivity-per-engineer baseline rising rapidly. Workforce structure adjustment follows. The mechanism behind the labor displacement signal is here.

▲ Frontier-lab capex planners

METR + CPU speedup imply compute compression.

Compute requirements per unit of capability advance may compress materially. Marginal value of new capex may shift from training capacity to inference deployment capacity. Different infrastructure problem. Different geographic and physical constraints. The $500B+ capex allocation needs to rebalance.

▲ Policy professionals

Static benchmarks are structurally inadequate.

If insiders are right that benchmarks become uninformative within 1-3 years of release, evaluation policy depending on fixed benchmarks has a problem. Evaluation needs to be continuous process with rapidly evolving instruments. The institutional capacity for this does not exist at sufficient scale.

▲ Knowledge workers

The cascade reveals the order of impact.

Engineering: nearest. Research reproduction: solved. ML pipelines: 12-18 months out. Fine-tuning: 18-30 months. Research direction selection: possibly years out. The order matters for career planning. Execution + judgment combined remains durable; pure execution does not.

What’s notably absent · the missing benchmark

HPE NVIDIA Tesla V100 32GB HBM2 PCIe 3.0 x16 Passive GPU Computational Accelerator for AI Machine Learning HPC Deep Learning 699-2G500-0216-400 (Renewed)

HPE NVIDIA Tesla V100 32GB HBM2 PCIe 3.0 x16 Passive GPU Computational Accelerator for AI Machine Learning HPC Deep Learning 699-2G500-0216-400 (Renewed)

NVIDIA Volta GV100 Architecture — 5,120 CUDA Cores, 640 1st-Gen Tensor Cores delivering 14 TFLOPS FP32 and 112…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Six measured. One missing.

The benchmarks measure execution — given a target, can the AI system achieve it? None of them measure judgment — given an open research question, can the AI system pick the productive direction? This is why Clark’s forecast is 60% rather than 95%.

What the cascade measures · what it doesn’t

The engineering side is on a clear trajectory. The research side has fewer benchmarks and harder-to-interpret results.

▼ Measured · on trajectory

AI engineering · the schlep

  • Writing code against existing codebases — SWE-Bench saturated
  • Reproducing research with full toolchain — CORE-Bench solved
  • Building ML pipelines from scratch — MLE-Bench tracking
  • Fine-tuning models against benchmarks — PostTrainBench rapid
  • Optimizing training code — CPU speedup past human baseline
  • Long-horizon execution — METR exponential continuing

▲ Unmeasured · open question

AI research · the taste

  • Research direction selection — no good benchmark
  • Novel hypothesis generation — Erdős results suggestive only
  • Identifying productive variations — sparse literature
  • Recognizing surprising results — judgment-side capability
  • Integration into broader programs — long-context taste
  • Math centaur results exist but thin and recent

The numbers are public. The trajectories are real. Anyone modeling AI deployment over the next 32 months without engaging with this data is operating on weaker information than the public record actually provides.

— The structural read · May 2026

Colophon

Set in IBM Plex Serif, Inter Tight, & JetBrains Mono. Composed for ThorstenMeyerAI.com, May 2026. Free to embed with attribution.

thorstenmeyerai.com


T5AI-Board Voice AI Development Kit – WiFi 2.4GHz + BLE 5.4, 3.5" TFT Display & DVP Camera Support, 2 MIC + 1 Speaker, 56 GPIOs, ARMv8-M MCU for Smart Home & IoT Projects

T5AI-Board Voice AI Development Kit – WiFi 2.4GHz + BLE 5.4, 3.5" TFT Display & DVP Camera Support, 2 MIC + 1 Speaker, 56 GPIOs, ARMv8-M MCU for Smart Home & IoT Projects

VOICE AI & DISPLAY DEVELOPMENT KIT: Built-in dual microphones and speaker support voice interaction, combined with a 3.5"…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

The frame: what these benchmarks are measuring

Before the individual numbers, it is worth being clear about what the benchmarks are collectively trying to measure. Clark’s argument is not that AI systems are improving on arbitrary tasks. Clark’s argument is that AI systems are improving specifically on the tasks required to automate the work of building AI systems. The selection of benchmarks is therefore not arbitrary — each one corresponds to a specific component of the AI R&D pipeline:

SWE-Bench measures general software engineering capability — the ability to solve real-world GitHub issues in production codebases. AI engineering is fundamentally a software engineering problem. If you cannot write code, you cannot build AI systems.

METR time horizons measure the duration of tasks AI systems can complete autonomously. AI research involves chains of multi-hour or multi-day work. If you cannot maintain context and pursue objectives across long horizons, you cannot do research.

CORE-Bench measures the ability to reproduce existing research papers — read the paper, install the dependencies, run the experiments, verify the results. Research builds on prior work. If you cannot reproduce existing work, you cannot extend it.

MLE-Bench measures the ability to build entire machine learning systems from scratch to solve novel problems. This is the closest available benchmark to “do an ML research project end-to-end.”

PostTrainBench measures the ability of AI systems to fine-tune other AI systems — to take a base model and improve its performance on a target benchmark. This is the inner loop of frontier AI development.

Anthropic’s CPU speedup task measures the ability to optimize AI training code. Training optimization is a recurring bottleneck in frontier AI work; improvements compound across the entire research pipeline.

Six benchmarks. Six different facets of AI engineering and research. All six showing similar improvement patterns over similar time windows. That is the structural finding.


SWE-Bench · 2% → 93.9% in 30 months

SWE-Bench launched in late 2023 as a benchmark for “real-world software engineering” — the agent is given a GitHub repository, an issue, and is asked to produce a patch that resolves the issue and passes the project’s tests. The tasks are drawn from open-source projects with documented issue histories, which means the benchmark measures genuine engineering competence rather than synthetic exercises.

The trajectory:

  • Late 2023 launch · Claude 2 · ~2%
  • 2024 · GPT-4 and Claude 3 variants with agent scaffolds · ~15-30%
  • 2025 · Frontier models with mature agent harnesses · 60-70%
  • May 2026 (Clark essay) · Claude Mythos Preview · 93.9%

A 47x improvement over 30 months. The benchmark is, per Clark’s own framing, effectively saturated — the remaining 6.1% gap is at the noise floor of the benchmark itself, given that some fraction of the underlying tasks have ambiguous correct answers or imperfect test coverage. The standard reference for benchmark saturation noise floor is the ImageNet validation set, which has approximately 6% of labels documented as wrong or ambiguous. Hitting 93.9% on SWE-Bench is functionally hitting the ceiling.

The structural implication: software engineering as a discrete benchmark task is solved. AI systems can write code that resolves real-world GitHub issues at rates that are statistically indistinguishable from “always.” Clark’s observation — “the vast majority of people I meet at frontier labs and around Silicon Valley now code entirely through AI systems” — is the deployment manifestation of the benchmark saturation. The capability shows up in benchmark scores; the behavior shows up in actual developer workflows.

What SWE-Bench does not measure: novel software design from a blank page, large-scale architecture decisions, performance optimization across system boundaries, communication with non-technical stakeholders, judgment about what to build. These are real engineering skills that remain less benchmarked. But the schlep of resolving documented issues against existing codebases — which is the majority of what working engineers spend their time on — has been mostly automated by frontier-lab AI systems in production.


METR Time Horizons · 30 seconds → 12 hours in 4 years

The METR (Model Evaluation and Threat Research) time horizons measurement is the most important single chart in the AI capability literature. It measures the duration of tasks AI systems can complete with 50% reliability — translating capability into time units that humans can intuitively reason about.

The full trajectory Clark cites:

YearModelTime horizon
2022GPT-3.5~30 seconds
2023GPT-4~4 minutes
2024o1~40 minutes
2025GPT 5.2 (High)~6 hours
2026Opus 4.6~12 hours
End 2026 (forecast)~100 hours

The forecast for end-2026 (~100 hours) is attributed to Ajeya Cotra, METR researcher and longtime AI forecaster. The doubling cadence over the observed period has been roughly 7 months. Extrapolating that cadence forward:

  • End of 2026: ~100 hours per task (matches Cotra forecast)
  • Mid-2027: ~300 hours per task
  • End of 2027: ~1,000 hours per task (about 6 months of focused human researcher work)
  • End of 2028: ~10,000 hours per task (about 5 years of focused human researcher work)

An AI system with a 10,000-hour task horizon is, by definition, an AI system that can pursue a research project end-to-end. That is what Clark means when he says automated AI R&D may arrive by end of 2028. The METR curve, extrapolated naively, hits that threshold.

The two ways the curve could fail to deliver the forecast:

First, the curve could be sigmoid rather than exponential. Every exponential improvement curve in technology eventually flattens. The question is whether the inflection point is before or after the 10,000-hour threshold. The current data is consistent with sigmoid behavior — but no inflection has been visible in 4 years. The honest read is that we will know whether the curve is sigmoid only when we see the inflection, and we have not seen it yet.

Second, the time horizon measurement could decouple from the underlying capability. It is possible that AI systems get better at long-horizon tasks in benchmark conditions while remaining brittle in production conditions. The METR curve measures task completion at 50% reliability; the production threshold for “autonomous AI research” is probably more like 95% reliability across long sequences. The reliability gap might not close even as the time-horizon gap closes.

Both failure modes are real. Neither is currently visible in the public data. The curve continues as it has continued. Until the inflection arrives, the working assumption should be that the curve continues — because that is what curves do until they don’t.


CORE-Bench · 21.5% → 95.5% in 15 months · declared “solved”

CORE-Bench (Computational Reproducibility Agent Benchmark) launched in September 2024 with a specific test: given a research paper and its repository, can the AI system install the dependencies, run the experiments, and answer questions about the outputs? The benchmark targets a specific component of AI research workflow — reproduction of existing work.

The trajectory:

  • September 2024 launch · GPT-4o in CORE-Agent scaffold · 21.5% on the hardest set
  • December 2025 · Opus 4.5 · 95.5% — benchmark author publicly declared the benchmark “solved”

A 4.4x improvement over 15 months, with formal declaration of saturation by the benchmark designers. Per Clark’s framing, this is the cleanest available evidence that AI systems can read, understand, and execute existing research papers — the schlep work of reproduction that occupies a large fraction of actual AI research workflows.

The structural implication: research reproduction is solved as a benchmark task. The benchmark designers’ decision to publicly declare the benchmark solved is itself meaningful. Benchmark designers do not generally declare benchmarks solved unless the benchmark has lost its ability to discriminate between systems. The fact that one of the benchmark authors publicly closed the chapter on CORE-Bench is the equivalent of a referee waving off the rest of the round.

What CORE-Bench does not measure: research direction selection, novel hypothesis generation, decisions about which experiments to run when the answer is not predetermined, integration of results into a broader research program. Reproduction is the floor of research, not the ceiling. But the floor has been built. The remaining question is the ceiling.


MLE-Bench · 16.9% → 64.4% in 16 months

MLE-Bench is OpenAI’s benchmark for “Machine Learning Engineering” — competing in 75 Kaggle competitions across natural language processing, computer vision, signal processing, and other domains. The benchmark requires the AI system to build entire ML pipelines from scratch, train models, and submit predictions. The human baseline is established Kaggle competitor performance.

The trajectory:

  • October 2024 launch · o1 in agent scaffold · 16.9%
  • February 2026 · Gemini 3 in agent harness with search · 64.4%

A 3.8x improvement over 16 months. Not yet saturated, but tracking toward saturation on the same cadence as the other benchmarks. The 64.4% score means AI systems can now complete two-thirds of competitive ML engineering projects autonomously — pipelines, training, evaluation, submission, all of it.

The Kaggle competition framing is particularly relevant because Kaggle is the closest publicly visible analogue to “ML research as a series of discrete projects with measurable outcomes.” Top Kaggle competitors are professional ML engineers, often working at frontier labs or comparable institutions. The 64.4% score is not a benchmark for ability to do ML in some abstract sense; it is a benchmark for ability to be a credible Kaggle competitor. The trajectory implies that by end-2026 or early 2027, AI systems will be saturating the ability to compete with top human Kaggle competitors.

The structural implication: end-to-end ML engineering is on a clear path to saturation. Whether this translates to ML research is the open question, but the engineering layer is moving on the same curve as everything else.


PostTrainBench · 0 → 28% (humans at 51%) · the meta-benchmark

PostTrainBench (introduced in Import AI #449, March 2026) is the most important benchmark in the set because it measures something different from the others: the ability of AI systems to fine-tune other AI systems. The benchmark sees how well a frontier model, given a CLI agent harness, can take smaller open-weight base models and post-train them to improve performance on standard evaluation benchmarks. The base models tested are Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B. The target benchmarks include AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, and HumanEval.

The trajectory and human baseline:

  • April 2026 · Opus 4.6 and GPT 5.4 · 25-28% uplift
  • Human baseline · existing instruct-tuned versions by frontier-lab researchers · 51% uplift

The AI systems are getting about half as much uplift as the human researchers who originally produced these models. This is, in Clark’s words, “already quite meaningful.” The benchmark has only been published for ~6 weeks at the time of Clark’s essay; based on the trajectory of the other benchmarks, the gap to human performance plausibly closes within 12-18 months.

The structural significance of this benchmark deserves emphasis. PostTrainBench is not measuring AI doing AI engineering on toy problems. It is measuring AI doing the actual work that humans at frontier labs do to produce the AI systems we use. The 51% human baseline is not “what an undergrad could do” — it is the work product of skilled professional researchers operating at the technical frontier, deployed into production systems that real users interact with.

When AI systems match the human baseline on PostTrainBench, the inner loop of frontier AI development becomes automatable. The recursive-self-improvement story that Clark sketches — AI systems training their successors — runs through exactly this kind of capability. The 28% score in April 2026 is not the destination; it is the runway. The destination is 51%, at which point the inner loop closes.

The honest disagreement here: even at 51%, the AI systems would be matching human researchers on benchmark uplift, which is not the same as matching human researchers on research judgment. Picking which fine-tuning approach to try is part of the work; if the AI system is doing well at the implementation but the human is still picking the approach, that is not full automation. PostTrainBench scoring does not fully resolve this distinction. But the trajectory is the trajectory. The implementation half of the work is on a clear path to closure.


Anthropic CPU Speedup Task · 2.9× → 52× in 11 months

The CPU speedup task is Anthropic’s internal benchmark for an interesting reason: it measures how well models can optimize the training code for a small language model. The task is: take an unmodified CPU-only LM training implementation and make it run as fast as possible. The score is the speedup factor. The human baseline is documented at approximately 4× speedup achievable in 4-8 hours of focused researcher work.

The trajectory across Anthropic’s frontier models:

  • May 2025 · Opus 4 · 2.9× mean speedup
  • November 2025 · Opus 4.5 · 16.5× mean speedup
  • February 2026 · Opus 4.6 · 30× mean speedup
  • April 2026 · Claude Mythos Preview · 52× mean speedup

The trajectory is in some ways the most striking of any benchmark in the set. The speedup factor went from below human (2.9× vs 4× human baseline) in May 2025 to 13x the human baseline by April 2026 — within 11 months. Each frontier model release approximately doubled the prior model’s performance on the same task.

This matters specifically because training optimization is a recurring bottleneck in frontier AI work. Every frontier model training run involves optimization work that compounds across the entire training pipeline. An AI system that can produce a 52× speedup on a CPU training implementation is an AI system that can substantially compress the compute requirements of producing the next AI system. This is, by definition, the recursive-self-improvement story made operational.

The honest caveat: CPU optimization is a different problem from GPU/TPU optimization at frontier scale. The CPU benchmark is a proxy for the harder problem, not the harder problem itself. The speedup numbers do not directly translate to “frontier training cost reduces 52×.” But the trajectory of capability on the proxy is the relevant signal. Frontier training optimization is on the same trajectory as the proxy. The compounding implications follow.


The pattern across the six benchmarks

Six benchmarks. Six trajectories. Side-by-side:

BenchmarkWindowTrajectoryStatus
SWE-Bench30 months2% → 93.9% (47×)Saturated
METR time horizons4 years30s → 12 hours (1,440×)Continuing exponential
CORE-Bench15 months21.5% → 95.5% (4.4×)Saturated · declared solved
MLE-Bench16 months16.9% → 64.4% (3.8×)Tracking toward saturation
PostTrainBench~2 months observed0 → 28% (humans 51%)Early but rapid
CPU speedup task11 months2.9× → 52× (18×)Past human baseline

The pattern is consistent. Every benchmark in the set was launched with the explicit goal of being challenging for AI systems. Every benchmark has either been saturated or is being saturated on a timeline of months rather than years. The benchmark designers — researchers who specifically aim to construct evaluations that distinguish capable AI systems from less capable ones — keep finding that their benchmarks become uninformative within 1-3 years of release.

This pattern is not unique to AI-research-relevant benchmarks. It is the broader pattern of AI capability advancement over the 2022-2026 period. The reason it matters specifically for Clark’s argument is that these particular benchmarks were chosen to measure the components of AI R&D automation. If the pattern continued only on tangential benchmarks while AI-R&D-relevant benchmarks resisted saturation, the trajectory toward automated AI R&D would be weaker. The fact that the pattern is most aggressive on exactly the benchmarks that measure AI engineering and research capability is the structural signal.


What’s notably absent · the benchmark Clark doesn’t have

There is one capability that none of these benchmarks measure well: research direction selection. The benchmarks measure execution — given a target, can the AI system achieve it? They do not measure judgment — given an open research question, can the AI system pick the productive direction to explore?

Clark acknowledges this gap explicitly. He frames it as the distinction between AI engineering (which he believes is approaching full automation) and AI research (where the creativity question is unresolved). The benchmarks in the set are all on the engineering side of the line. The research side has fewer benchmarks, and the ones that exist are harder to interpret.

The closest benchmarks for research-side capability are:

  • Erdős Problems (referenced by Clark): Gemini attempted ~700 Erdős mathematical conjectures, produced 13 solutions, 1 deemed “interesting” by the math researchers. The base rate of interesting AI-generated mathematical results is currently low but non-zero.
  • Centaur math discovery: A 2026 math proof published with significant contributions from Google’s Gemini-based math tools, with explicit researcher acknowledgment that “the proofs of the main results were discovered with very substantial input from Google Gemini and related tools.”
  • The MLE-Bench tail: The current 64.4% MLE-Bench score might mask interesting structure — the questions are: are the harder competitions disproportionately failing? Are they failing on execution or on direction? The granular data is not fully public.

The honest read on the research-side question: we have suggestive but not conclusive evidence that AI systems can produce novel research insights. The math results are real. The Centaur partnership results are real. But these are early data points in a thin literature. A confident statement that AI systems can autonomously do research, not just engineering, is not yet supported by the public benchmark data.

This is why Clark’s forecast is 60%/2028 rather than 95%/2028. The remaining 40% probability space includes the scenario in which engineering automation goes as expected but research-side automation requires additional capability breakthroughs that don’t arrive in time. The honest analytical move is to keep this scenario in the picture even while taking the benchmark cascade seriously.


What honest disagreement with the cascade interpretation looks like

The benchmark trajectories are public. The numbers are not in serious dispute. What is in dispute is what the trajectories mean. Three positions worth taking seriously:

The “benchmarks are not capability” position. Saturating benchmarks measures the ability to do specific tasks under specific conditions. The leap from “saturates SWE-Bench” to “can do real software engineering” is non-trivial. The leap from “saturates CORE-Bench” to “can do real AI research” is larger still. This is the most intellectually serious form of disagreement. It does not deny the trajectories; it denies that the trajectories translate to the capability claims being made.

The honest counter: deployment behavior provides additional evidence. The fact that frontier lab researchers, per Clark’s report, code “entirely through AI systems” suggests that the benchmark capability is translating to practical capability. The behavior is consistent with the benchmark scores. The disagreement is then about whether the behavior generalizes to research-level work, not whether it generalizes to engineering-level work. The latter seems clearly established.

The “saturation noise floor” position. Once benchmarks reach high single-digit error rates, the remaining gap is at the noise floor of the benchmark itself rather than at the capability frontier. SWE-Bench at 93.9% might mean “AI is competent at the underlying task” or might mean “the benchmark has 6% measurement error and AI is hitting the floor.” This is a real concern for individual benchmarks. It is a weaker concern for the pattern. Six benchmarks all hitting their noise floors simultaneously, on the cadence observed, is unlikely to be a noise artifact. The pattern is more robust than any individual data point.

The “curve breaks at the frontier” position. Benchmarks measure capability in known territory. The frontier-AI-R&D capability requires extending into unknown territory in ways that benchmark performance may not predict. This is the most honest form of the disagreement — it argues that the curve will encounter a capability ceiling that current benchmarks aren’t testing for, and that the ceiling will arrive before automated AI R&D does. The honest counter is that we don’t know what the ceiling looks like or when it arrives. Clark’s bet is that the ceiling is above the threshold for automated AI R&D. Disagreement with the bet is reasonable; it just requires articulating where the ceiling sits and why.

My own read: the trajectories are the strongest available evidence for short timelines on AI engineering automation, and decent-but-weaker evidence for short timelines on AI research automation. The 60%/2028 estimate Clark publishes is consistent with the public data even if it is at the more confident end of the range that the data supports.


What this means for everyone reading this

The six benchmarks have specific implications by stakeholder:

For engineering leaders. SWE-Bench saturation is the writing on the wall. The deployment manifestation — frontier-lab researchers coding through AI systems — is already in production at the institutions that build the AI systems. The same deployment pattern will reach the broader engineering market on a delay of 12-24 months. Workforce planning for engineering organizations needs to assume that the productivity-per-engineer baseline is increasing rapidly and that the workforce structure will adjust accordingly. The labor displacement piece covers the early signal in junior-engineer cohorts; the benchmark cascade is the underlying mechanism.

For frontier-lab capex planners. The METR curve combined with the CPU speedup curve implies that the compute requirements per unit of capability advance may compress materially over 2026-2028. This has implications for how to allocate the $500B+ in committed compute capex across labs and hyperscalers. If the inference compute requirements per task drop as fast as the speedup numbers suggest, the marginal value of new capex may shift from “more training capacity” toward “more inference deployment capacity” — which is a different infrastructure problem with different geographic and physical constraints.

For policy professionals working on AI evaluation. The benchmark saturation cadence has implications for evaluation policy. If frontier-lab insiders are correct that benchmarks become uninformative within 1-3 years of release, then evaluation policy that depends on specific benchmarks as static measurement instruments has a structural problem. Evaluation needs to be a continuous process with rapidly evolving instruments, not a fixed-point measurement against stable benchmarks. The institutional capacity for this kind of evaluation does not currently exist at sufficient scale.

For workers in cognitive-task labor markets. The pattern of benchmark saturation provides specific evidence for which categories of cognitive work are nearest to automation. Software engineering: nearest to fully automatable. Research reproduction: solved. ML pipeline construction: 12-18 months from saturation. Fine-tuning and optimization: 18-30 months from human-baseline parity. Research direction selection: harder to forecast, possibly years out. The order matters for individual career planning. The roles that combine execution-side skills with judgment-side skills retain more durable value than roles that are pure execution.

For investors. The benchmark cascade is the strongest available public evidence that AI capability is on the trajectory Clark forecasts. Investment models that assume slower trajectories are in tension with the public data. The honest analytical move is to update probability distributions on automated AI R&D arrival upward and adjust portfolio positioning accordingly — both in frontier-lab equity and in the broader knowledge-work economy that the cascade will reshape.


The honest assessment

Six benchmarks tracking the components of AI R&D capability have shown the same pattern over the same window. Two are saturated (SWE-Bench, CORE-Bench). One has been formally declared solved by its designers. One is tracking toward saturation (MLE-Bench). One has crossed the human baseline materially (CPU speedup). One is in early stages but on the same trajectory (PostTrainBench). The METR time horizons curve, which underlies all of these, has continued its exponential progression for four consecutive years without visible inflection.

The structural finding is clear: the engineering-side capabilities required to automate AI R&D are approaching saturation on a cadence of months. The research-side capabilities — direction selection, novel hypothesis generation, taste — are less well-measured and less clearly on the same trajectory, but the engineering-side trajectory alone produces a profound acceleration of AI development.

This is the evidence base for Clark’s 60%/2028 forecast. The forecast may be wrong. The forecast may be right but for different reasons than the benchmarks indicate. The forecast may be self-defeating in the way the companion piece describes. But the benchmark cascade is not a marketing artifact. The numbers are public. The trajectories are real. Anyone modeling AI deployment over the next 32 months without engaging with this data is operating on weaker information than the public record actually provides.

The next piece in this series — The Compounding Error Problem — examines why 99.9% alignment decays to 60% in 500 generations, and what that mathematical fact implies if the benchmark cascade does deliver automated AI R&D by end of 2028.


About the Author

Thorsten Meyer is a Munich-based futurist, post-labor economist, and recipient of OpenAI’s 10 Billion Token Award. He spent two decades managing €1B+ portfolios in enterprise ICT before deciding that writing about the transition was more useful than managing quarterly slides through it. More at ThorstenMeyerAI.com.



Sources

  • Jack Clark · Import AI 455: Automating AI Research · May 4, 2026 · jack-clark.net
  • SWE-Bench launch · late 2023 · Claude 2 baseline 2%
  • Claude Mythos Preview · 93.9% SWE-Bench · May 2026
  • METR · time horizons measurement curve · 2022-2026 · GPT-3.5 to Opus 4.6
  • Ajeya Cotra · METR · ~100 hour forecast for end-2026 (Import AI #448)
  • CORE-Bench launch · September 2024 · arxiv.org/abs/2409.11363
  • CORE-Bench declared “solved” · December 2025 · Opus 4.5 at 95.5%
  • MLE-Bench · OpenAI · October 2024 launch · 16.9% baseline
  • MLE-Bench · February 2026 · Gemini 3 in agent harness · 64.4%
  • PostTrainBench · Import AI #449 · March 2026
  • Anthropic CPU speedup task · May 2025 (2.9×) → April 2026 (52×)
  • ImageNet validation set · ~6% label error rate · arxiv.org/abs/2103.14749
  • Erdős Problems · Gemini · 700 attempted, 13 solutions, 1 interesting · Import AI #444
  • Centaur math discovery · UBC/UNSW/Stanford/DeepMind · Import AI #441
You May Also Like

NVIDIA’s AI Chip Dominance: What It Means for Businesses and Competitors

A deep dive into NVIDIA’s AI chip dominance reveals how it reshapes industry competition and business strategies—discover what this means for your future.

CloudWatch Adds Generative AI Observability: Watching the Agents at Work

Date: October 13, 2025 AWS has introduced CloudWatch Generative AI Observability, a…

Why DOE’s New $1B Critical‑Minerals Push Is Really an AI Strategy

How a supply‑chain program aimed at lithium, graphite, cobalt, rare earths, gallium,…

From GPUs to AI Factories

Assessing NVIDIA’s CES 2026 announcements (Vera Rubin, AI-native storage, and Alpamayo) and…