Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

By Topline Newsroom

5 hours ago2 min readSource: www.marktechpost.com

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Google Introduces MTP Drafters for Gemma 4 Family Using Speculative Decoding to Achieve Up to 3x Speedup The post Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss appeared first on MarkTechPost .

From the source

News Hub News Hub Premium Content Read our exclusive articles Facebook Instagram X Home Open Source/Weights AI Agents Tutorials Voice AI Robotics Promote with us News Hub Home Open Source/Weights AI Agents Tutorials Voice AI Robotics Promote with us Home Artificial Intelligence AI Infrastructure Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up... Artificial Intelligence AI Infrastructure Technology AI Shorts Editors Pick Language Model Large Language Model Machine Learning New Releases Open Source Software Engineering Staff Tech News Large language models are getting incredibly powerful, but let’s be honest—their inference speed is still a massive headache for anyone trying to use them in production. Google just launched Multi-Token Prediction (MTP) draft

Today s large language models operate autoregressively. They produce exactly one token at a time, sequentially. Every single token generation requires loading billions of model parameters from VRAM (video RAM) into compute units. This process is described as memory-bandwidth bound. The bottleneck is not the raw computing power of the GPU or processor, but the speed at which data can be transferred from memory to the compute units.

The consequence is a significant latency bottleneck: compute sits underutilized while the system is busy just moving data around. What makes this especially inefficient is that the model applies the same amount of computation to a trivially predictable token like predicting words after Actions speak louder than… as it does to generating a complex logical inference. There s no mechanism in standard autoregressive decoding to exploit how easy or hard the next token is to predict.

Who and what

Key names and topics in this story: Google AI Releases Multi, Token Prediction, Drafters, Gemma.

Where to follow next

Read the full piece at www.marktechpost.com
More from our AI & prompts coverage

#ai#google-ai-releases-multi#token-prediction#drafters#gemma

Google s Gemma 4 open AI models use "speculative decoding" to get up to 3x faster

Up to 3x the speed with no loss of quality—is it too good to be true?

Apple to pay $250M to settle lawsuit over Siri s delayed AI features

Apple has agreed to pay $250 million to settle a class action lawsuit for overpromising the arrival of Siri's AI features.

At TechCrunch Disrupt 2026, all your M A questions will be answered

Leaders from Coinbase, M13, and Mignano Law Group talk about how M A is an early-stage strategy at TechCrunch Disrupt 2026. Register to hear this live.

Google updates AI search to include expert advice from Reddit and other web forums

While citing web forums and discussion boards can help users find answers to more niche queries, this design choice could also prove chaotic.