Last updated: August 10, 2025 • Estimated reading time: 6–8 minutes

TL;DR

Apple researchers introduced a lightweight way to let standard autoregressive LLMs predict several future tokens at once—then verify them—delivering ~2.5× speedups on general chat/QA and up to ~5× on code & math in their Tulu3‑8B tests, with no measured quality loss. The recipe adds mask tokens, a tiny gated‑LoRA adapter, a sampler head, and linear/quadratic verification.


AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Apple actually published

In the paper “Your LLM Knows the Future,” the authors argue that vanilla autoregressive models already contain latent information about upcoming tokens. The team shows how to elicit and use that knowledge with minimal retraining, reporting ~2.5× speedups on general chat and ~5× on code/math—without quality degradation in their evaluations. A research summary and an arXiv preprint accompanied the release.

Tip for editors: add links to the Apple ML Research page and arXiv preprint in the References section below.

Diagram of Apple’s multi‑token prediction pipeline (mask tokens, gated‑LoRA, sampler, verification)

Figure 1. Multi‑token prediction pipeline: masks appended to the prompt, backbone + gated‑LoRA propose several tokens, a sampler improves coherence, and verification (linear or quadratic) commits the accepted prefix.


Fine-Tuning with Python: Train, Align, and Deploy Custom LLMs Using LoRA, QLoRA, PEFT, Instruction Tuning, and DPO on Consumer Hardware (Python Series – Learn. Build. Master. Book 15)

Fine-Tuning with Python: Train, Align, and Deploy Custom LLMs Using LoRA, QLoRA, PEFT, Instruction Tuning, and DPO on Consumer Hardware (Python Series – Learn. Build. Master. Book 15)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

How the method works (plain English)

  1. Mask the future. Append k learned mask tokens to the prompt and train the model to jointly predict those future positions from the same prefix.
  2. Preserve the base model with gated LoRA. A gated‑LoRA adapter activates only for masked positions, so next‑token (standard) behavior stays intact while enabling multi‑token prediction (MTP).
  3. Sample coherent chunks. A small sampler head conditions each predicted token on the last one plus the hidden state so the speculative sequence reads naturally.
  4. Verify to stay accurate. During decoding, the model proposes k+1 tokens, then verifies the speculative block against standard decoding. Two strategies: linear (simple) and quadratic (interleaved masks to guarantee steady progress).
  5. Tiny, cheap fine‑tune. The team demonstrates the approach by fine‑tuning Tulu3‑8B to predict eight future tokens—evidence that you can retrofit this onto existing models.

Waveshare USB to LoRa Data Transfer Module, Based On SX1262, Suitable for Data Acquisition in Industry and Agriculture-TCXO Crystal oscillator

Waveshare USB to LoRa Data Transfer Module, Based On SX1262, Suitable for Data Acquisition in Industry and Agriculture-TCXO Crystal oscillator

USB-TO-LoRa-xF uses TCXO crystal oscillator and is recommended to be use in 0~85℃ temperature.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Reported results (and when you’ll see speedups)

  • Throughput: ~2.5× on general chat/knowledge; up to ~5× on code and math.
  • Quality: No measured quality loss in their evaluations, credited to gated‑LoRA preserving next‑token behavior.
  • Best domains: Predictable/structured text (code, math) where acceptance rates are higher.

Amazon

AI inference speedup hardware

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

How it compares to other acceleration tricks

  • Speculative decoding (draft + verify): Typically uses a second, smaller draft model to propose multiple tokens that the large model verifies. Apple’s method keeps a single model and adds adapters + masks, reducing system complexity.
  • Medusa (multi‑head decoding): Adds extra decoding heads and verifies candidates in a tree. Apple’s route—masks + gated‑LoRA + sampler + verification—targets similar or larger gains in some domains while preserving a single‑model architecture.

Why this matters

  • Lower latency on the same hardware: Fewer full passes per paragraph mean snappier UX.
  • Retrofit‑friendly: Light LoRA fine‑tuning makes it feasible to apply to many existing LLMs.
  • Composable with other infra: Works alongside KV‑cache, paged attention, and verifier‑style methods; exact stacking gains depend on acceptance rates and sampling settings.

Quick start for practitioners

  1. Add k mask tokens to your tokenizer/vocab.
  2. LoRA‑fine‑tune the backbone (freeze base weights) to jointly train NTP + MTP objectives.
  3. Train a sampler MLP that conditions on the previous sampled token.
  4. Decode with linear or quadratic verification; commit only the verified prefix each step.
  5. Benchmark latency + quality on your workloads; expect bigger wins on code/math.

FAQs

Is this the same as speculative decoding?
Not exactly. Classic speculative decoding uses a second (draft) model. Apple’s method equips the same model to propose and verify multiple tokens with mask‑based training and adapters.

How many tokens can it predict at once?
The demo shows k = 8 on Tulu3‑8B; other values are possible with trade‑offs in coherence and verification cost.

Does this degrade quality?
The authors emphasize no loss in measured quality; always validate against your own evals and sampling settings.


  • Apple ML Research summary: “Your LLM Knows the Future.”
  • arXiv preprint for the same paper.
  • Context: Medusa (multi‑head decoding); speculative decoding (draft + verify).

Suggested SEO elements

  • Meta title (≤60 chars): Apple’s Multi‑Token LLM: 2–5× Faster, Same Quality
  • Meta description (≤160 chars): Apple’s new multi‑token prediction speeds LLMs by 2–5× using mask tokens, gated‑LoRA, and verification—without quality loss.
  • URL slug: /apple-multi-token-prediction-llm-speedup
  • Primary keywords: multi‑token prediction, gated LoRA, Apple LLM, speculative decoding, Medusa, Tulu3‑8B, quadratic decoding
You May Also Like

The Forward-Deploy Pivot: Why Anthropic and OpenAI Are Becoming Consulting Firms in the Same Week

By Thorsten Meyer — May 2026 May 4, 2026 — Anthropic, Blackstone,…

The Agentic Web and Schema.org Action: Market Impact, Vertical Benefits & Competitive Dynamics

1 Introduction The agentic web is an emerging layer of the internet where…

The $725 Billion Question: Hyperscaler Capex Q1 2026 and What the Earnings Don’t Answer

By Thorsten Meyer — May 2026 April 29, 2026. The Big Four…

OpenAI × Broadcom: 10 GW of Custom AI Accelerators

Implications for AI compute economics, supply chains, and the data-center buildout (2026–2029)…