Last updated: August 10, 2025 • Estimated reading time: 6–8 minutes

TL;DR

Apple researchers introduced a lightweight way to let standard autoregressive LLMs predict several future tokens at once—then verify them—delivering ~2.5× speedups on general chat/QA and up to ~5× on code & math in their Tulu3‑8B tests, with no measured quality loss. The recipe adds mask tokens, a tiny gated‑LoRA adapter, a sampler head, and linear/quadratic verification.


What Apple actually published

In the paper “Your LLM Knows the Future,” the authors argue that vanilla autoregressive models already contain latent information about upcoming tokens. The team shows how to elicit and use that knowledge with minimal retraining, reporting ~2.5× speedups on general chat and ~5× on code/math—without quality degradation in their evaluations. A research summary and an arXiv preprint accompanied the release.

Tip for editors: add links to the Apple ML Research page and arXiv preprint in the References section below.

Diagram of Apple’s multi‑token prediction pipeline (mask tokens, gated‑LoRA, sampler, verification)

Figure 1. Multi‑token prediction pipeline: masks appended to the prompt, backbone + gated‑LoRA propose several tokens, a sampler improves coherence, and verification (linear or quadratic) commits the accepted prefix.


How the method works (plain English)

  1. Mask the future. Append k learned mask tokens to the prompt and train the model to jointly predict those future positions from the same prefix.
  2. Preserve the base model with gated LoRA. A gated‑LoRA adapter activates only for masked positions, so next‑token (standard) behavior stays intact while enabling multi‑token prediction (MTP).
  3. Sample coherent chunks. A small sampler head conditions each predicted token on the last one plus the hidden state so the speculative sequence reads naturally.
  4. Verify to stay accurate. During decoding, the model proposes k+1 tokens, then verifies the speculative block against standard decoding. Two strategies: linear (simple) and quadratic (interleaved masks to guarantee steady progress).
  5. Tiny, cheap fine‑tune. The team demonstrates the approach by fine‑tuning Tulu3‑8B to predict eight future tokens—evidence that you can retrofit this onto existing models.

Reported results (and when you’ll see speedups)

  • Throughput: ~2.5× on general chat/knowledge; up to ~5× on code and math.
  • Quality: No measured quality loss in their evaluations, credited to gated‑LoRA preserving next‑token behavior.
  • Best domains: Predictable/structured text (code, math) where acceptance rates are higher.

How it compares to other acceleration tricks

  • Speculative decoding (draft + verify): Typically uses a second, smaller draft model to propose multiple tokens that the large model verifies. Apple’s method keeps a single model and adds adapters + masks, reducing system complexity.
  • Medusa (multi‑head decoding): Adds extra decoding heads and verifies candidates in a tree. Apple’s route—masks + gated‑LoRA + sampler + verification—targets similar or larger gains in some domains while preserving a single‑model architecture.

Why this matters

  • Lower latency on the same hardware: Fewer full passes per paragraph mean snappier UX.
  • Retrofit‑friendly: Light LoRA fine‑tuning makes it feasible to apply to many existing LLMs.
  • Composable with other infra: Works alongside KV‑cache, paged attention, and verifier‑style methods; exact stacking gains depend on acceptance rates and sampling settings.

Quick start for practitioners

  1. Add k mask tokens to your tokenizer/vocab.
  2. LoRA‑fine‑tune the backbone (freeze base weights) to jointly train NTP + MTP objectives.
  3. Train a sampler MLP that conditions on the previous sampled token.
  4. Decode with linear or quadratic verification; commit only the verified prefix each step.
  5. Benchmark latency + quality on your workloads; expect bigger wins on code/math.

FAQs

Is this the same as speculative decoding?
Not exactly. Classic speculative decoding uses a second (draft) model. Apple’s method equips the same model to propose and verify multiple tokens with mask‑based training and adapters.

How many tokens can it predict at once?
The demo shows k = 8 on Tulu3‑8B; other values are possible with trade‑offs in coherence and verification cost.

Does this degrade quality?
The authors emphasize no loss in measured quality; always validate against your own evals and sampling settings.


  • Apple ML Research summary: “Your LLM Knows the Future.”
  • arXiv preprint for the same paper.
  • Context: Medusa (multi‑head decoding); speculative decoding (draft + verify).

Suggested SEO elements

  • Meta title (≤60 chars): Apple’s Multi‑Token LLM: 2–5× Faster, Same Quality
  • Meta description (≤160 chars): Apple’s new multi‑token prediction speeds LLMs by 2–5× using mask tokens, gated‑LoRA, and verification—without quality loss.
  • URL slug: /apple-multi-token-prediction-llm-speedup
  • Primary keywords: multi‑token prediction, gated LoRA, Apple LLM, speculative decoding, Medusa, Tulu3‑8B, quadratic decoding
You May Also Like

Labor Market Data in the Age of AI: What Businesses and Society Must Understand

Labor market data has become one of the most important — and…

Unpacking Google’s October 2025 Search Ranking Volatility

Google’s search landscape has felt unusually unsettled in 2025. While the company’s…

OpenAI Surpasses 1 Million Business Customers: The Fastest‑Growing Enterprise Platform

Overview On November 5, 2025, OpenAI announced that it had surpassed 1 million paying business…

Global AI Regulation Pivot: What It Means for Businesses in 2025

Introduction: A year of regulatory inflection Artificial intelligence (AI) is no longer…