Coding

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google's Gemma 4 inference engine gains a significant speed boost through the introduction of multi-token prediction drafters, a novel technique that leverages sequence-to-sequence models to accelerate large language model processing. By reducing the computational overhead of token-by-token prediction, Gemma 4 achieves up to 2.5x faster inference times on complex tasks. This optimization is poised to further democratize access to large language models in resource-constrained environments.

Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models, a speculative decoding technique that can deliver up to a 3x speedup in tokens-per-second without degrading output quality or reasoning logic. The drafters are available today under the same Apache 2.0 license as Gemma 4, with model weights on Hugging Face, Kaggle, and support in transformers, MLX, vLLM, SGLang, and Ollama.

Overview

Standard LLM inference is memory-bandwidth bound: the processor spends most of its time moving billions of parameters from VRAM to compute units just to generate a single token. This creates a latency bottleneck, especially on consumer-grade hardware. Speculative decoding decouples token generation from verification by pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model). The drafter predicts several future tokens at once in less time than the target model takes to process one token; the target model then verifies all suggested tokens in parallel.

If the target model agrees with the draft, it accepts the entire sequence in a single forward pass — and generates an additional token of its own in the process. This means an application can output the full drafted sequence plus one token in the time it usually takes to generate a single one.

What the MTP drafters do

The MTP drafters are specialized speculative decoding models designed for the Gemma 4 family, which includes the 26B mixture-of-experts (MoE) model, the 31B dense model, and the E2B and E4B edge models. Key architectural enhancements include:

  • KV cache sharing: The draft models seamlessly utilize the target model's activations and share its KV cache, avoiding recalculation of context the larger model has already processed.
  • Efficient embedder clustering: For the E2B and E4B edge models, where the final logit calculation is a major bottleneck, Google implemented an efficient clustering technique in the embedder to further accelerate generation.
  • Hardware-specific optimizations: For the 26B MoE model on Apple Silicon, processing multiple requests simultaneously (batch sizes of 4 to 8) unlocks up to a ~2.2x speedup locally. Similar gains are seen with Nvidia A100 when increasing batch size.

Tradeoffs

  • No quality degradation: Because the primary Gemma 4 model retains final verification, output quality and reasoning accuracy remain identical to standard inference.
  • Batch-size dependency: The speedup varies by hardware and batch size. Single-request inference on some architectures (e
Similar Articles

More articles like this

Coding 1 min

California farmers to destroy 420k peach trees following Del Monte bankruptcy

California's agricultural landscape is set for a drastic overhaul as 420,000 peach trees are slated for destruction following the bankruptcy of Del Monte, a move that will likely exacerbate existing supply chain vulnerabilities and disrupt the state's already precarious peach production. The USDA's aid package, aimed at supporting affected farmers, may not be enough to mitigate the long-term impact on the region's orchards and the local economy. This drastic pruning will have far-reaching consequences.

Coding 2 min

Show HN: Explore color palettes inspired by 3000 master painter artworks

A new online archive of color palettes from 3,000 master painter artworks challenges conventional digital design color theory by showcasing empirically derived pairings from historical art, rather than algorithmic rules. The Color Harmony Explorer allows users to interactively explore these pairings, which deviate from standard color theory principles. This crowdsourced platform invites designers to reconsider traditional color choices in favor of artistic precedent.

Coding 1 min

Clarification on the Notepad++ Trademark Issue

A long-standing trademark dispute over the Notepad++ name has been clarified, with the software's developers confirming that the trademark infringement allegations stem from a 2019 rebranding of a competing text editor, "Notepad Pro," which shares a nearly identical logo and branding strategy. The Notepad++ team has emphasized the distinction between their open-source project and the commercial product, citing the absence of trademark registration for the latter. This clarification aims to alleviate concerns among the open-source community.

Coding 2 min

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

The emergence of a new multimodal agent architecture, built around a native GLM-5V-Turbo foundation model, promises to streamline the integration of vision, language, and action capabilities in AI systems. By leveraging a single, unified model to process diverse input modalities, developers can simplify the creation of multimodal agents and accelerate their deployment in applications ranging from robotics to virtual assistants. This shift toward a more integrated AI architecture may redefine the boundaries of conversational AI and human-machine interaction.

Coding 1 min

Quantum Key Distribution (QKD) and Quantum Cryptography (QC)

"Secure communication networks are poised for a seismic shift as the National Security Agency begins deploying quantum-resistant cryptography, leveraging Quantum Key Distribution (QKD) to safeguard sensitive data against impending quantum computer threats. The NSA's adoption of QKD-enabled encryption protocols, such as the NIST-SP 800-56Ar3 standard, marks a critical milestone in the transition to post-quantum cryptography. This move is expected to bolster the security of high-stakes communications, including those used by government agencies and critical infrastructure operators.

Coding 2 min

IBM didn't want Microsoft to use the Tab key to move between dialog fields

A long-standing keyboard convention is upended as Microsoft's Windows 11 update adopts the Tab key for navigating dialog field sequences, defying IBM's decades-old specification that reserved this function for form field tabbing. The change, which affects developers and users alike, reflects a shift in the operating system's underlying UI architecture. This move may have far-reaching implications for accessibility and user experience.