Gemma 4 becomes 3x faster thanks to MTP
AI Models

Gemma 4 becomes 3x faster thanks to MTP

May 20, 2025·Matteo Flora

Inference speed is one of the most underrated bottlenecks in the practical adoption of AI models. When discussing quality, attention almost always goes to reasoning benchmarks, the quality of generated code, or answer precision. But in production, latency matters as much as — if not more than — absolute quality. An excellent model that takes 10 seconds to respond is less useful than a very good model that responds in 2. Google knows this, and the update to Gemma 4 with MTP support is concrete proof of it.

MTP stands for Multi-Token Prediction, an inference technique that breaks the classic paradigm of autoregressive generation. In traditional language models, each token is generated one at a time, in sequence: the model produces an output, feeds it back as input, generates the next one, and so on. It is an inherently serial process, and this seriality is the fundamental limit of generation speed.

MTP addresses this limit elegantly: instead of predicting only one token at a time, the model learns to predict several tokens simultaneously, in parallel. This is not random speculation — the model uses its own understanding of the context to anticipate subsequent tokens with high probability, verify them, and accept them in blocks when correct. The result is a drastic increase in throughput without measurable degradation in output quality. Applied to Gemma 4, the declared increase is three times compared to the previous version: in practice, this means moving from latencies suitable mainly for batch or offline tasks to latencies compatible with real-time interactive applications.

Gemma is Google's family of open-weight models: freely available, deployable locally or on any cloud infrastructure, without dependency on Google APIs. This makes them particularly attractive for companies with privacy requirements, teams with limited budgets for external APIs, and developers who want complete control over the infrastructure. Gemma's historical limit compared to flagship proprietary models was precisely speed: great quality for the model size, but slower inference than acceptable in certain application contexts. With MTP, this gap is significantly narrowed.

The applications that benefit most from this update are those where latency is a critical parameter: real-time conversational assistants, text completion systems, AI agents that must process and respond quickly to sequential inputs, and processing pipelines that handle large volumes of documents in batches. In all these scenarios, tripling the inference speed is not an incremental improvement: it is a category shift.

The update should also be read within a broader strategic context. Google is increasingly investing in the open-weight distribution of its models, and Gemma is the primary vehicle for this strategy. Making Gemma 4 significantly faster increases its competitiveness against alternatives like Llama, Mistral, and GLM — all models competing for the same adoption space among developers and companies wanting infrastructure control. The more teams adopt Gemma as the foundation for their applications, the more the ecosystem of tools, integrations, and expertise around Google models grows. Open-source, in this sense, is not technological philanthropy: it is a distribution strategy that builds ecosystem dependency over time.

MTP is not techniques exclusive to Google: other labs are exploring similar approaches to accelerate inference without increasing the computational cost of training. What makes the implementation on Gemma 4 relevant is the scale of adoption it could reach: a fast, high-quality, and freely deployable open-weight model has all the characteristics to become a new reference standard for AI applications in production. For those evaluating which model to adopt as a basis for new projects, Gemma 4 with MTP now concretely enters the list of options to test — not just for quality, but because inference speed is finally up to real-world application needs.