Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models, a speculative decoding technique that can deliver up to a 3x speedup in tokens-per-second without degrading output quality or reasoning logic. The drafters are available today under the same Apache 2.0 license as Gemma 4, with model weights on Hugging Face, Kaggle, and support in transformers, MLX, vLLM, SGLang, and Ollama.
Overview
Standard LLM inference is memory-bandwidth bound: the processor spends most of its time moving billions of parameters from VRAM to compute units just to generate a single token. This creates a latency bottleneck, especially on consumer-grade hardware. Speculative decoding decouples token generation from verification by pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model). The drafter predicts several future tokens at once in less time than the target model takes to process one token; the target model then verifies all suggested tokens in parallel.
If the target model agrees with the draft, it accepts the entire sequence in a single forward pass — and generates an additional token of its own in the process. This means an application can output the full drafted sequence plus one token in the time it usually takes to generate a single one.
What the MTP drafters do
The MTP drafters are specialized speculative decoding models designed for the Gemma 4 family, which includes the 26B mixture-of-experts (MoE) model, the 31B dense model, and the E2B and E4B edge models. Key architectural enhancements include:
- KV cache sharing: The draft models seamlessly utilize the target model's activations and share its KV cache, avoiding recalculation of context the larger model has already processed.
- Efficient embedder clustering: For the E2B and E4B edge models, where the final logit calculation is a major bottleneck, Google implemented an efficient clustering technique in the embedder to further accelerate generation.
- Hardware-specific optimizations: For the 26B MoE model on Apple Silicon, processing multiple requests simultaneously (batch sizes of 4 to 8) unlocks up to a ~2.2x speedup locally. Similar gains are seen with Nvidia A100 when increasing batch size.
Tradeoffs
- No quality degradation: Because the primary Gemma 4 model retains final verification, output quality and reasoning accuracy remain identical to standard inference.
- Batch-size dependency: The speedup varies by hardware and batch size. Single-request inference on some architectures (e