Tech

DiffusionGemma: Google launches the first open source diffusion AI model that generates 700 tokens per second

June 13, 2026·Davide Stigliani

In the landscape of generative artificial intelligence, inference speed has always been one of the most critical bottlenecks. Large language models, however powerful, have historically suffered from high latencies that limited their applicability in real-time, embedded, or high-volume contexts. Google has decided to break this pattern with DiffusionGemma, the first open source diffusion AI model in history capable of generating up to 700 tokens per second of output.

This is not an incremental improvement. It is an architectural paradigm shift that could redefine how language models are designed, distributed, and used, especially in areas where latency is a critical factor.

To understand why DiffusionGemma represents a qualitative leap, it is necessary to understand what it means to apply diffusion architecture to text generation and why this choice is so different from the classic Large Language Model approach.

Traditional autoregressive models, such as GPT, Claude, or Gemini in their classic form, generate text one token at a time, in sequence. Each token is produced conditionally on all previous tokens, which makes the process intrinsically serial and difficult to parallelize. The longer the response, the more time it takes to generate it, in a linear fashion.

Diffusion models for text adopt a radically different approach, inspired by diffusion models already well-known in the image domain like Stable Diffusion or DALL-E. It starts from a sequence of completely noisy tokens, and the model iteratively applies a denoising process until it converges toward a coherent and meaningful text sequence. This process can be massively parallelized because the model works on the entire sequence simultaneously, not token by token.

The result is a structurally superior inference speed compared to autoregressive models, especially for long outputs. The reference benchmark positions DiffusionGemma at approximately 700 tokens per second on high-end consumer hardware, compared to the 30-80 tokens per second typical of autoregressive models of comparable size on the same hardware. This is a speed advantage in the range of 8-20x compared to traditional approaches.

Unlike previous experiments that adapted pre-existing diffusion architectures to text in a hybrid way, DiffusionGemma was natively designed as a textual diffusion model, with specific optimizations for the linguistic domain. Google has also made the model weights available on platforms like Hugging Face, with a license that allows for commercial use, positioning DiffusionGemma as a concrete alternative to proprietary models for companies and developers.

One of the critical points historically associated with textual diffusion models was output quality, which was lower than that of autoregressive models. Google claims that DiffusionGemma achieves competitive quality with models of the same parametric class, while maintaining the structural speed advantage.

The launch of DiffusionGemma marks an important moment not only from a technical standpoint but also strategically for the open source artificial intelligence ecosystem. Until today, diffusion models for text had remained predominantly in the domain of academic research, interesting as proof-of-concepts but never mature or accessible enough to become practical tools for developers and companies.

With DiffusionGemma, Google brings this technology out of the research labs and delivers it directly to the open source community. Applications that previously required expensive enterprise hardware to operate in real-time can now run on consumer GPUs; the availability of weights allows the community to study and improve the architecture, and a benchmark emerges that is difficult for commercial players to ignore.

The speed of DiffusionGemma is not just a number for its own sake. It concretely opens up application scenarios that were previously impractical with traditional language models. AI agents that must make decisions quickly in industrial automation environments, assisted algorithmic trading, or high-volume customer service benefit directly from reduced latency and can produce elaborate analyses and responses in fractions of a second.

High-volume batch processing also changes completely. Companies that need to process millions of texts for classification, summarization, translation, or data extraction can drastically reduce computational costs and processing times, making pipelines sustainable that were previously economically prohibitive.

The structural speed of diffusion models potentially makes them more suitable for execution on limited hardware, paving the way for language models capable of running directly on edge devices, smartphones, or embedded systems without depending on the cloud. Likewise, generating dialogues, environment descriptions, or NPC behaviors in real-time finally becomes practicable for gaming and interactive simulations.

On the software development front, programming assistance tools like GitHub Copilot or Cursor could benefit enormously from an underlying engine capable of suggesting completions and generating code blocks with almost zero perceived latency.

It would be incorrect to present DiffusionGemma as a solution without limitations. Textual diffusion models still present open challenges. Autoregressive models, by generating tokens in sequence, offer natural control over the output structure: it is possible to stop generation at any time, apply token-by-token constraints, and use techniques like beam search. Diffusion models, by working on the entire sequence in parallel, make this type of granular control more complex to implement.

Scenarios requiring very long and articulated reasoning, such as multi-step mathematical problems or detailed legal analyses, remain an area where autoregressive models with explicit chain-of-thought tend to perform better. Even stochastic sampling techniques like temperature, top-k, and top-p must be reimagined for diffusion models, where the denoising process works differently. And interpretability remains more difficult, making error analysis in production more complex.

The launch of DiffusionGemma should be read in the broader context of the competition in open source AI in 2026. Google and Meta are the two players investing the most in high-quality open source models, in a dynamic that some analysts define as a true infrastructure war to win the hearts of global developers.

Meta has focused on the Llama family, high-quality autoregressive models with permissive licenses. Google responds with the Gemma family, of which DiffusionGemma now represents the most innovative wing from an architectural standpoint. Google's choice to open up this technology as well, rather than simply publishing an academic paper, suggests a precise strategy: to build an ecosystem of developers loyal to the Google platform, using open source as a tool for diffusion and adoption.

For developers interested in experimenting, primary resources include model weights on Hugging Face in the official Google DeepMind repository, technical documentation published alongside the release, example Colab notebooks, and direct integration with Google AI Studio for initial no-code testing. The advice is to start with pure inference tasks, where speed benefits are immediately perceptible, before exploring more complex use cases requiring fine-tuning.

DiffusionGemma is not simply a faster model. It is the practical demonstration that the autoregressive paradigm, dominant in text generation since GPT-2 redefined the field in 2019, is not the only possible path and perhaps not even the optimal one for all applications. With 700 tokens per second, open source, and competitive quality, Google has launched a concrete challenge to the dominant architecture. One thing is already clear: the future of language models will not be monolithic, and DiffusionGemma has just opened a door that will not close again.

Tech

Fugu by Sakana AI: Japan didn't build a bigger AI, it built one that conducts all the others

Tech

GLM 5.2: the Chinese model that beats Claude Fable and rewrites the global AI hierarchy

Tech

Kimi K2.7 and Minimax M3: while the US blocks Mythos 5, China advances at an impressive speed

← Back to all articles

DiffusionGemma: Google launches the first open source diffusion AI model that generates 700 tokens per second

Related articles

Fugu by Sakana AI: Japan didn't build a bigger AI, it built one that conducts all the others

GLM 5.2: the Chinese model that beats Claude Fable and rewrites the global AI hierarchy

Kimi K2.7 and Minimax M3: while the US blocks Mythos 5, China advances at an impressive speed