Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

The story

Google Introduces MTP Drafters for Gemma 4 Family Using Speculative Decoding to Achieve Up to 3x Speedup The post Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss appeared first on MarkTechPost .
From the source
News Hub News Hub Premium Content Read our exclusive articles Facebook Instagram X Home Open Source/Weights AI Agents Tutorials Voice AI Robotics Promote with us News Hub Home Open Source/Weights AI Agents Tutorials Voice AI Robotics Promote with us Home Artificial Intelligence AI Infrastructure Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up... Artificial Intelligence AI Infrastructure Technology AI Shorts Editors Pick Language Model Large Language Model Machine Learning New Releases Open Source Software Engineering Staff Tech News Large language models are getting incredibly powerful, but let’s be honest—their inference speed is still a massive headache for anyone trying to use them in production. Google just launched Multi-Token Prediction (MTP) draft
Today s large language models operate autoregressively. They produce exactly one token at a time, sequentially. Every single token generation requires loading billions of model parameters from VRAM (video RAM) into compute units. This process is described as memory-bandwidth bound. The bottleneck is not the raw computing power of the GPU or processor, but the speed at which data can be transferred from memory to the compute units.
The consequence is a significant latency bottleneck: compute sits underutilized while the system is busy just moving data around. What makes this especially inefficient is that the model applies the same amount of computation to a trivially predictable token like predicting words after Actions speak louder than… as it does to generating a complex logical inference. There s no mechanism in standard autoregressive decoding to exploit how easy or hard the next token is to predict.
Who and what
Key names and topics in this story: Google AI Releases Multi, Token Prediction, Drafters, Gemma.
Where to follow next
- Read the full piece at www.marktechpost.com
- More from our AI & prompts coverage

Related stories

Google s Gemma 4 open AI models use "speculative decoding" to get up to 3x faster
Up to 3x the speed with no loss of quality—is it too good to be true?

Apple to pay $250M to settle lawsuit over Siri s delayed AI features
Apple has agreed to pay $250 million to settle a class action lawsuit for overpromising the arrival of Siri's AI features.

At TechCrunch Disrupt 2026, all your M A questions will be answered
Leaders from Coinbase, M13, and Mignano Law Group talk about how M A is an early-stage strategy at TechCrunch Disrupt 2026. Register to hear this live.

Google updates AI search to include expert advice from Reddit and other web forums
While citing web forums and discussion boards can help users find answers to more niche queries, this design choice could also prove chaotic.