Apple’s Multi‑Token LLM: how it makes models 2–5× faster (without hurting quality)

Last updated: August 10, 2025 • Estimated reading time: 6–8 minutes

Table of Contents

TL;DR

Apple researchers introduced a lightweight way to let standard autoregressive LLMs predict several future tokens at once—then verify them—delivering ~2.5× speedups on general chat/QA and up to ~5× on code & math in their Tulu3‑8B tests, with no measured quality loss. The recipe adds mask tokens, a tiny gated‑LoRA adapter, a sampler head, and linear/quadratic verification.

What Apple actually published

In the paper “Your LLM Knows the Future,” the authors argue that vanilla autoregressive models already contain latent information about upcoming tokens. The team shows how to elicit and use that knowledge with minimal retraining, reporting ~2.5× speedups on general chat and ~5× on code/math—without quality degradation in their evaluations. A research summary and an arXiv preprint accompanied the release.

Tip for editors: add links to the Apple ML Research page and arXiv preprint in the References section below.

Figure 1. Multi‑token prediction pipeline: masks appended to the prompt, backbone + gated‑LoRA propose several tokens, a sampler improves coherence, and verification (linear or quadratic) commits the accepted prefix.

How the method works (plain English)

Mask the future. Append k learned mask tokens to the prompt and train the model to jointly predict those future positions from the same prefix.
Preserve the base model with gated LoRA. A gated‑LoRA adapter activates only for masked positions, so next‑token (standard) behavior stays intact while enabling multi‑token prediction (MTP).
Sample coherent chunks. A small sampler head conditions each predicted token on the last one plus the hidden state so the speculative sequence reads naturally.
Verify to stay accurate. During decoding, the model proposes k+1 tokens, then verifies the speculative block against standard decoding. Two strategies: linear (simple) and quadratic (interleaved masks to guarantee steady progress).
Tiny, cheap fine‑tune. The team demonstrates the approach by fine‑tuning Tulu3‑8B to predict eight future tokens—evidence that you can retrofit this onto existing models.

Reported results (and when you’ll see speedups)

Throughput: ~2.5× on general chat/knowledge; up to ~5× on code and math.
Quality: No measured quality loss in their evaluations, credited to gated‑LoRA preserving next‑token behavior.
Best domains: Predictable/structured text (code, math) where acceptance rates are higher.

How it compares to other acceleration tricks

Speculative decoding (draft + verify): Typically uses a second, smaller draft model to propose multiple tokens that the large model verifies. Apple’s method keeps a single model and adds adapters + masks, reducing system complexity.
Medusa (multi‑head decoding): Adds extra decoding heads and verifies candidates in a tree. Apple’s route—masks + gated‑LoRA + sampler + verification—targets similar or larger gains in some domains while preserving a single‑model architecture.

Why this matters

Lower latency on the same hardware: Fewer full passes per paragraph mean snappier UX.
Retrofit‑friendly: Light LoRA fine‑tuning makes it feasible to apply to many existing LLMs.
Composable with other infra: Works alongside KV‑cache, paged attention, and verifier‑style methods; exact stacking gains depend on acceptance rates and sampling settings.

Quick start for practitioners

Add k mask tokens to your tokenizer/vocab.
LoRA‑fine‑tune the backbone (freeze base weights) to jointly train NTP + MTP objectives.
Train a sampler MLP that conditions on the previous sampled token.
Decode with linear or quadratic verification; commit only the verified prefix each step.
Benchmark latency + quality on your workloads; expect bigger wins on code/math.

FAQs

Is this the same as speculative decoding?
Not exactly. Classic speculative decoding uses a second (draft) model. Apple’s method equips the same model to propose and verify multiple tokens with mask‑based training and adapters.

How many tokens can it predict at once?
The demo shows k = 8 on Tulu3‑8B; other values are possible with trade‑offs in coherence and verification cost.

Does this degrade quality?
The authors emphasize no loss in measured quality; always validate against your own evals and sampling settings.

References (add your links)

Apple ML Research summary: “Your LLM Knows the Future.”
arXiv preprint for the same paper.
Context: Medusa (multi‑head decoding); speculative decoding (draft + verify).

Suggested SEO elements

Meta title (≤60 chars): Apple’s Multi‑Token LLM: 2–5× Faster, Same Quality
Meta description (≤160 chars): Apple’s new multi‑token prediction speeds LLMs by 2–5× using mask tokens, gated‑LoRA, and verification—without quality loss.
URL slug: /apple-multi-token-prediction-llm-speedup
Primary keywords: multi‑token prediction, gated LoRA, Apple LLM, speculative decoding, Medusa, Tulu3‑8B, quadratic decoding

Apple’s Multi‑Token LLM: how it makes models 2–5× faster (without hurting quality)

Up next

Startup Sofa Briefing: The Real Cost of Starting Up (It’s Not Just Money)

Author

Thorsten Meyer

Share article

TL;DR

What Apple actually published

How the method works (plain English)

Reported results (and when you’ll see speedups)

How it compares to other acceleration tricks

Why this matters

Quick start for practitioners

FAQs

References (add your links)

Suggested SEO elements

Labor Market Data in the Age of AI: What Businesses and Society Must Understand

Unpacking Google’s October 2025 Search Ranking Volatility

OpenAI Surpasses 1 Million Business Customers: The Fastest‑Growing Enterprise Platform

Global AI Regulation Pivot: What It Means for Businesses in 2025

Automation Neutralized: How New Jobs Keep Emerging Despite AI Gains

AI in Healthcare: Why Your Doctor Isn’t Obsolete Yet

Apple’s Multi‑Token LLM: how it makes models 2–5× faster (without hurting quality)

Up next

Author

Thorsten Meyer

Share article

TL;DR

What Apple actually published

How the method works (plain English)

Reported results (and when you’ll see speedups)

How it compares to other acceleration tricks

Why this matters

Quick start for practitioners

FAQs

References (add your links)

Suggested SEO elements

You May Also Like