Apple’s Multi‑Token LLM: how it makes models 2–5× faster (without hurting quality)

Last updated: August 10, 2025 • Estimated reading time: 6–8 minutes

TL;DR

Apple researchers introduced a lightweight way to let standard autoregressive LLMs predict several future tokens at once—then verify them—delivering ~2.5× speedups on general chat/QA and up to ~5× on code & math in their Tulu3‑8B tests, with no measured quality loss. The recipe adds mask tokens, a tiny gated‑LoRA adapter, a sampler head, and linear/quadratic verification.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

What Apple actually published

In the paper “Your LLM Knows the Future,” the authors argue that vanilla autoregressive models already contain latent information about upcoming tokens. The team shows how to elicit and use that knowledge with minimal retraining, reporting ~2.5× speedups on general chat and ~5× on code/math—without quality degradation in their evaluations. A research summary and an arXiv preprint accompanied the release.

Tip for editors: add links to the Apple ML Research page and arXiv preprint in the References section below.

Figure 1. Multi‑token prediction pipeline: masks appended to the prompt, backbone + gated‑LoRA propose several tokens, a sampler improves coherence, and verification (linear or quadratic) commits the accepted prefix.

Fine-Tuning with Python: Train, Align, and Deploy Custom LLMs Using LoRA, QLoRA, PEFT, Instruction Tuning, and DPO on Consumer Hardware (Python Series – Learn. Build. Master. Book 15)

As an affiliate, we earn on qualifying purchases.

How the method works (plain English)

Mask the future. Append k learned mask tokens to the prompt and train the model to jointly predict those future positions from the same prefix.
Preserve the base model with gated LoRA. A gated‑LoRA adapter activates only for masked positions, so next‑token (standard) behavior stays intact while enabling multi‑token prediction (MTP).
Sample coherent chunks. A small sampler head conditions each predicted token on the last one plus the hidden state so the speculative sequence reads naturally.
Verify to stay accurate. During decoding, the model proposes k+1 tokens, then verifies the speculative block against standard decoding. Two strategies: linear (simple) and quadratic (interleaved masks to guarantee steady progress).
Tiny, cheap fine‑tune. The team demonstrates the approach by fine‑tuning Tulu3‑8B to predict eight future tokens—evidence that you can retrofit this onto existing models.

Waveshare USB to LoRa Data Transfer Module, Based On SX1262, Suitable for Data Acquisition in Industry and Agriculture-TCXO Crystal oscillator

USB-TO-LoRa-xF uses TCXO crystal oscillator and is recommended to be use in 0~85℃ temperature.

As an affiliate, we earn on qualifying purchases.

Reported results (and when you’ll see speedups)

Throughput: ~2.5× on general chat/knowledge; up to ~5× on code and math.
Quality: No measured quality loss in their evaluations, credited to gated‑LoRA preserving next‑token behavior.
Best domains: Predictable/structured text (code, math) where acceptance rates are higher.

Amazon

AI inference speedup hardware

As an affiliate, we earn on qualifying purchases.

How it compares to other acceleration tricks

Speculative decoding (draft + verify): Typically uses a second, smaller draft model to propose multiple tokens that the large model verifies. Apple’s method keeps a single model and adds adapters + masks, reducing system complexity.
Medusa (multi‑head decoding): Adds extra decoding heads and verifies candidates in a tree. Apple’s route—masks + gated‑LoRA + sampler + verification—targets similar or larger gains in some domains while preserving a single‑model architecture.

Why this matters

Lower latency on the same hardware: Fewer full passes per paragraph mean snappier UX.
Retrofit‑friendly: Light LoRA fine‑tuning makes it feasible to apply to many existing LLMs.
Composable with other infra: Works alongside KV‑cache, paged attention, and verifier‑style methods; exact stacking gains depend on acceptance rates and sampling settings.

Quick start for practitioners

Add k mask tokens to your tokenizer/vocab.
LoRA‑fine‑tune the backbone (freeze base weights) to jointly train NTP + MTP objectives.
Train a sampler MLP that conditions on the previous sampled token.
Decode with linear or quadratic verification; commit only the verified prefix each step.
Benchmark latency + quality on your workloads; expect bigger wins on code/math.

FAQs

Is this the same as speculative decoding?
Not exactly. Classic speculative decoding uses a second (draft) model. Apple’s method equips the same model to propose and verify multiple tokens with mask‑based training and adapters.

How many tokens can it predict at once?
The demo shows k = 8 on Tulu3‑8B; other values are possible with trade‑offs in coherence and verification cost.

Does this degrade quality?
The authors emphasize no loss in measured quality; always validate against your own evals and sampling settings.

References (add your links)

Apple ML Research summary: “Your LLM Knows the Future.”
arXiv preprint for the same paper.
Context: Medusa (multi‑head decoding); speculative decoding (draft + verify).

Suggested SEO elements

Meta title (≤60 chars): Apple’s Multi‑Token LLM: 2–5× Faster, Same Quality
Meta description (≤160 chars): Apple’s new multi‑token prediction speeds LLMs by 2–5× using mask tokens, gated‑LoRA, and verification—without quality loss.
URL slug: /apple-multi-token-prediction-llm-speedup
Primary keywords: multi‑token prediction, gated LoRA, Apple LLM, speculative decoding, Medusa, Tulu3‑8B, quadratic decoding

Apple’s Multi‑Token LLM: how it makes models 2–5× faster (without hurting quality)

Up next

Startup Sofa Briefing: The Real Cost of Starting Up (It’s Not Just Money)

Author

Thorsten Meyer

Share article

TL;DR

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

What Apple actually published

Fine-Tuning with Python: Train, Align, and Deploy Custom LLMs Using LoRA, QLoRA, PEFT, Instruction Tuning, and DPO on Consumer Hardware (Python Series – Learn. Build. Master. Book 15)

How the method works (plain English)

Waveshare USB to LoRa Data Transfer Module, Based On SX1262, Suitable for Data Acquisition in Industry and Agriculture-TCXO Crystal oscillator

Reported results (and when you’ll see speedups)

AI inference speedup hardware

How it compares to other acceleration tricks

Why this matters

Quick start for practitioners

FAQs

References (add your links)

Suggested SEO elements

The Forward-Deploy Pivot: Why Anthropic and OpenAI Are Becoming Consulting Firms in the Same Week

The Agentic Web and Schema.org Action: Market Impact, Vertical Benefits & Competitive Dynamics

The $725 Billion Question: Hyperscaler Capex Q1 2026 and What the Earnings Don’t Answer

OpenAI × Broadcom: 10 GW of Custom AI Accelerators

Thrymvault: A System Around Your Content

Briefro: A Document That Tells the Truth

Mobilised, Not Spent: What’s Left of Europe’s €200 Billion AI Offensive

Mobilisiert, nicht ausgegeben: Was von Europas €200-Milliarden-KI-Offensive übrig bleibt

Apple’s Multi‑Token LLM: how it makes models 2–5× faster (without hurting quality)

Up next

Author

Thorsten Meyer

Share article

TL;DR

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

What Apple actually published

Fine-Tuning with Python: Train, Align, and Deploy Custom LLMs Using LoRA, QLoRA, PEFT, Instruction Tuning, and DPO on Consumer Hardware (Python Series – Learn. Build. Master. Book 15)

How the method works (plain English)

Waveshare USB to LoRa Data Transfer Module, Based On SX1262, Suitable for Data Acquisition in Industry and Agriculture-TCXO Crystal oscillator

Reported results (and when you’ll see speedups)

AI inference speedup hardware

How it compares to other acceleration tricks

Why this matters

Quick start for practitioners

FAQs

References (add your links)

Suggested SEO elements

You May Also Like