Abstract

On May 5, 2026, Google released the "Multi-Token Prediction (MTP) Drafter," an auxiliary model for its open-weight LLM "Gemma 4" that accelerates inference by up to 3x, under the Apache 2.0 license. Just as browser Ajax transformed UX through prefetching, MTP overturns the premise of "generating tokens one by one" and dramatically changes responsiveness by speculatively grabbing future tokens in batches. Silicon Valley VCs are positioning this as a symbolic move that validates the "investment thesis for the inference layer," and massive capital continues to flow into inference-optimization startups such as Inferact, Together AI, and Fireworks AI.


The Big Picture: On May 5, Google promoted "predictive prefetching" to a standard feature

On May 5, 2026, Google DeepMind released the "Multi-Token Prediction (MTP) Drafter" for the Gemma 4 family through its official blog post "Accelerating Gemma 4: faster inference with multi-token prediction drafters." Gemma 4, which was unveiled on April 2 of the same year on the Google Open Source Blog as "Gemma 4: Expanding the Gemmaverse with Apache 2.0," surpassed 60 million downloads in just a few weeks after launch, making it the most momentum-rich open-weight LLM right now. MTP serves as its "next move," taking on the role of accelerating already-running Gemma 4 by up to 3x with no additional training and no additional hardware.

The released auxiliary model suite supports all four sizes of Gemma 4 (E2B for mobile, E4B for edge, the 26B A4B Mixture-of-Experts for consumer GPUs, and the 31B Dense for workstations). Distribution has begun on Hugging Face and Kaggle, and major inference runtimes — Hugging Face Transformers, MLX, vLLM, SGLang, Ollama, and LiteRT-LM in the Google AI Edge Gallery — already have "Day 0" support. In response to the official Google release, vLLM announced on its official X account, "🚀 Day-0 MTP support for Gemma4 now available at vLLM," and simultaneously released dedicated Docker images for Hopper and Blackwell (vllm/vllm-openai:gemma4-0505-cu129 / cu130).

As for the numbers, while Google emphasizes "up to 3x," overseas media outlets that conducted primary reporting are carefully communicating a more realistic range. Outlets such as Decrypt, MarkTechPost, Eastern Herald, The Decoder, and claypier report that the up-to-3x figure is a "best case" obtained by running the 26B MoE on an NVIDIA RTX PRO 6000 with optimal batch size on conversational tasks, and that on consumer GPUs (RTX 4090-class) it settles at 1.8–2.5x, while on Apple Silicon (M3 Max / M4 Max-class) it lands at 1.6–2.2x — more modest, but practical figures.

Why we call it the "LLM-era Ajax": flipping the timeline through prefetching and validation

I want to first explain the technical essence at one level of abstraction. Why did I call it "LLM-era Ajax" in the title? Ajax (Asynchronous JavaScript and XML) was a technology that transformed UX by asynchronously prefetching and partially updating the parts a user was likely to request, instead of having the browser wait for a full page reload. The essential change that MTP brings to LLM inference is similar. That is, it is an approach in which "a lightweight model goes ahead and produces several tokens before the heavy upstream model has determined what tokens the user actually needs."

Normal Transformer inference works through a mechanism called autoregression, where producing each single token requires reading billions to hundreds of billions of parameters from memory. Even though the GPU's compute units themselves have plenty of headroom, memory bandwidth becomes the bottleneck and the compute units sit idle. The paper "Fast Inference from Transformers via Speculative Decoding" (accepted at ICML 2023), published in 2022 by Google Research under the names of Yaniv Leviathan, Matan Kalman, Yossi Matias and others, departs from exactly this observation. The paper showed that by having a small 60M-parameter T5 draft for T5-XXL (11B), one could achieve a 2–3× speedup "without changing the output distribution at all," and it has since taken root as the industry-standard acceleration layer.

MTP is the latest form in this lineage. The MTP drafter in Gemma 4 is a lightweight 4-layer model composed of "Q-only attention," with a major refinement: it shares the KV cache of the target model (the main body). The concrete mechanism works as follows. First, the drafter prefetches N future tokens in succession (typically 4–8), while sharing the main model's final-layer activations and input embedding table. Those N tokens are then verified in parallel by the main Gemma 4 in a single forward pass. Tokens that the main model judges as "matching its own prediction" are accepted wholesale; at the first point of divergence, the draft is truncated and the main model itself emits one correct token (since at least one token is guaranteed up to this point, no work is wasted). After that, the drafter resumes prefetching, and this cycle is spun rapidly.

It is easier to picture with a concrete example. Given the prompt "The weather in Tokyo is," the drafter prefetches four tokens like "sunny," ", and tomorrow," "cloudy," "later rainy." The main model, which would normally have to run four forward passes, evaluates these four candidates all at once in a single pass. If three tokens match, then 3 tokens + 1 correction token from the main model itself = 4 tokens total are finalized in essentially one step. This is what Google's official blog means by "the target model accepts the entire sequence in a single forward pass — and even generates an additional token of its own in the process."

What deserves attention is that this is not "speedup at the cost of accuracy." Because the main model always performs the final verification, the output distribution is mathematically kept identical to the case without MTP. As Hugging Face's official blog "Welcome Gemma 4" clearly states, "Same outputs as target model with no quality loss and no changes to reasoning behavior," this being a "lossless" acceleration layer is the decisive point that distinguishes it from quantization or distillation.

Breaking down "What is the drafter looking at?" a bit more

What is difficult for beginners is probably the intuitive part: why can a small drafter draw a "near-correct" answer from the same probability distribution as the main model? There are two implementation-level keys to this.

The first is "sharing of the embedding table." The drafter references the same input embedding table as the Gemma 4 main model. Because tokens such as "dog," "猫" (cat), and "東京" (Tokyo) are handled in exactly the same vector space as the main model, lexical discrepancies cannot occur in principle. The second is "utilization of target activations." The drafter receives, as input, the activation vector output by the final layer of the main model, and uses a lightweight 4-layer Transformer to produce predictions for N future tokens. In other words, the main model already holds quite strong clues about "what comes next," and the drafter performs look-ahead by inheriting those clues, so it is unlikely to stray contextually.

In the case of Gemma 4, especially for the edge-oriented E2B (effective 2.3B) / E4B (effective 4.5B) models, an additional technique called "embedder clustering" is incorporated, which narrows down the 256K vocabulary to 4K contextually "likely" clusters. As a result, even on devices with limited memory and computation such as smartphones, the drafter's logit computation does not become a bottleneck. In Google AI for Developers' document "Speed-up Gemma 4 with Multi-Token Prediction," the drafter is described as one in which "the model groups similar tokens into clusters."

The token acceptance rate is also an important metric. According to verification by buildfastwithai, the Gemma 4 MTP drafter shows 70–90% on conversational tasks, and somewhat lower values on code generation tasks. While code has lower randomness, it contains many long-distance dependencies (closures and syntax dozens of tokens ahead), so there are more situations where the drafter alone cannot fully predict. In fact, when running Gemma 4 MTP on vLLM, developer blogs such as dasroot and kaitchup introduce an operational practice of setting the recommended parameter "num_assistant_tokens" to 3–4 for code, 5–8 for conversation, and 10–15 for long-form prose, and dynamically adjusting it according to the acceptance rate via the "heuristic" schedule.

The Lineage of DeepSeek, Meta, and EAGLE: MTP Is "The Next Main Battleground"

As stated on Google's official blog, the MTP-style approach is not a sudden breakthrough but is positioned as the latest step in an accumulated lineage of research. In April 2024, Meta released "Better & Faster Large Language Models via Multi-token Prediction" (arXiv:2404.19737) under the names of Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve, showing that by having the model predict "the next N tokens" via independent output heads during training, a 13B model scored 12% higher on HumanEval and 17% higher on MBPP than existing next-token prediction models, and that a model predicting 4 tokens simultaneously achieved up to 3x faster inference. DeepSeek adopted this MTP in its V3, performing pre-training on 14.8 trillion tokens with n=4 prediction heads, and notes in its ArXiv technical report that the MTP1 acceptance rate exceeds 80% at inference time, yielding approximately a 1.8x improvement in generation throughput.

The DeepSeek type, which incorporates MTP into the objective function during training, and the Google type, which only adds an auxiliary drafter at inference time, are called by similar names but take different approaches. In the case of Google Gemma 4, the main body's training itself is completed with standard next-token prediction, and a lightweight drafter is separately trained and attached afterward. This affords significant operational flexibility, allowing acceleration to be retrofitted to already-trained 31B Dense and 26B MoE models without additional retraining.

Other related technologies include Tianle Cai et al.'s "MEDUSA" (an approach that grows multiple prediction heads directly on the main body), Yuhui Li et al.'s "EAGLE-3" (an external draft head that fuses three layers of early, middle, and late features), and "Lookahead Decoding" (parallel n-gram generation with a 2D window). According to SyncSoft.AI's blending explanation, EAGLE-3 maintains an acceptance rate of 0.75 to 0.85 in chat-style settings, earning an additional 1.7–2.1x speedup over MEDUSA and 1.5–1.6x over Lookahead. In fact, even for Gemma 4, the community had already trained EAGLE-3 drafters ahead of the official MTP release, publishing them as thoughtworks/Gemma-4-31B-Eagle3 and RedHatAI/gemma-4-31B-it-speculator.eagle3. Articles from Eastern Herald and claypier also point out that Google's current official release is positioned as "finally returning to the community, in official form, the MTP heads that had been removed when the initial weights of Gemma 4 were released."

Reading the Benchmarks: Where Does the 3x Come From, and What's the Real-World Multiplier?

What media outlets unanimously focused on was the validity of the "up to 3x" figure that Google touts. On this point, cross-referencing multiple sources has provided a relatively clear picture.

In high-end workstation environments, the numbers look good. Measurements on NVIDIA DGX Spark / GB10 posted to the NVIDIA Developer Forum recorded 108.78 tokens/sec for a single request (2.66x against the 40.85 tokens/sec baseline without MTP) when combining Gemma 4 26B A4B-it (FP8 quantized) with γ=4 MTP. With 8 concurrent requests, aggregate throughput reached 674 tokens/sec, scaling to 16.5x for the server as a whole while maintaining roughly 2x against conventional performance from an individual user's perspective. Validation data from vLLM-side PR #41745 (filed by Luciano Martins, merged on May 6, 2026) also reports large throughput improvements on H100: 130% for E2B, 178% for E4B, and 319% for the 31B Dense.

On the other hand, the felt experience on notebook-PC class machines and MacBooks is somewhat more modest. As Decrypt notes, on Apple Silicon with batch size 1 (that is, individual users' chat use cases), Gemma 4 26B MoE stays at around 1.5–1.7x. This is because the MoE (Mixture-of-Experts) architecture is designed so that different experts activate per token, meaning different expert weights must be loaded at each position in the token sequence the drafter reads ahead, which diminishes the memory-bandwidth savings. Raising the batch size to 4–8 and bundling parallel requests brings it back up to about 2.2x. The Dense version, the 31B model, lacks such routing constraints, so the consensus shared by the Hugging Face blog and the MLX community is that it more reliably delivers around 2x performance even on Apple Silicon.

Another point jointly noted by Google's official blog and MarkTechPost is that "the premise is the instruction-tuned (-it) model, not the base model." AI-Muninn's hands-on testing reports that attaching a drafter to the base model conversely drops speed to 0.61x — a caveat that Google's official announcement does not particularly emphasize.

Silicon Valley VCs' Outlook: Conviction That the "Inference Layer" Is the Next Main Battleground

The Silicon Valley VC community is reading this Google move not as a standalone product update, but as a sign that a new market category called the "inference layer" is maturing. The report "Welcome to LLMflation — LLM inference cost is going down fast," released by Guido Appenzeller of Andreessen Horowitz (a16z), demonstrates with numbers that the cost of LLM inference at equivalent performance is falling at a pace of 10x per year — from $60 per 1M tokens for GPT-3-class models in November 2021, down to $0.06 per 1M tokens for Llama 3.2 3B as of 2025 (a 1000x drop in three years) — and lists "reducing computation and memory bandwidth requirements through software optimization" as one of six pillars driving that decline. MTP is precisely the flagship example of that "bandwidth improvement through software optimization."

Backing up this thesis with capital, in January 2026 Inferact — founded by the lead vLLM maintainers (Simon Mo, Woosuk Kwon, Kaichao You, Roger Wang) — closed a $150 million (approx. ¥22.5 billion) seed round co-led by a16z and Lightspeed Venture Partners, launching at an $800 million (approx. ¥120 billion) valuation. Sequoia Capital, Altimeter Capital, Redpoint Ventures, and Databricks Ventures also joined the round. According to TechCrunch's reporting, a16z stated its investment thesis explicitly: "Just buying more H100s can't break past the 30–40% GPU utilization wall. It is the software layer that will unlock the remaining 70% of surplus compute." The vLLM that Inferact is seeking to commercialize is the very engine that just implemented Day 0 support for Gemma 4 MTP — a perfect alignment between thesis and real-world product.

Drawing equally intense investor attention are the inference clouds Together AI and Fireworks AI. In February 2025, Together AI raised a $305 million (approx. ¥45.75 billion) Series B co-led by General Catalyst and Prosperity7, vaulting its valuation to $3.3 billion (approx. ¥495 billion). The company officially explains that "we deliver performance by combining speculative decoding, quantization, and FP8 kernels," and is well-positioned to rapidly integrate MTP-style drafters into its own inference platform. Fireworks AI completed a $250 million (approx. ¥37.5 billion) Series C at a $4 billion (approx. ¥600 billion) valuation in October 2025. According to analysis by Sacra, the company's ARR reached $315 million (approx. ¥47.25 billion) as of February 2026 — a 416% year-over-year explosive growth.

In Y Combinator's "Summer 2026 Requests for Startups," General Partner Diana Hu has explicitly solicited "chips dedicated to agent loops." She states, "Today's GPUs only achieve 30–40% utilization on agent workloads (loops, tool calls, branches, backtracking, long-context retention). We want chips designed for fast context switching between models, native speculative decoding, and KV caches that span the entire execution graph" — a clear hardware-side echo of the trend. MTP is the core technology underpinning that "native speculative decoding."

In April 2026, Sequoia Capital announced a $7 billion (approx. ¥1.05 trillion) expansion fund for AI / late-stage investment, and in its reports "AI in 2026: A Tale of Two AIs" and "2026: This is AGI," citing IDC's forecast that inference demand in the agent era will balloon 1000x by 2027, declared that "structural decline in inference cost and demand explosion are unfolding in parallel." Synthesizing reporting from Bloomberg and finsmes, Sequoia is aggressively picking up — at stages ranging from seed to Series B — not only the inference-optimization-focused Inferact and Fireworks AI, but also startups (such as Pipeshift) that sell the speculative-decoding technology forming the very foundation of MTP.

The impact on the enterprise is also beginning to show in the numbers. The AICC report finds that "as of April 2026, the effective enterprise token unit price (blended) has fallen to $6.07 per 1M tokens, down 67% from $18.40 a year earlier." Fortune Business Insights projects that the AI inference market will grow from $103.73 billion (approx. ¥15.6 trillion) in 2025 to $117.8 billion (approx. ¥17.7 trillion) in 2026, reaching $312.64 billion (approx. ¥46.9 trillion) by 2034. For the edge AI market, Grand View Research forecasts $24.91 billion (approx. ¥3.7 trillion) in 2025 → $29.98 billion (approx. ¥4.5 trillion) in 2026 → $118.69 billion (approx. ¥17.8 trillion) by 2033 (CAGR 21.7%), and this release — in which the edge-targeted E2B / E4B run on a lightweight MTP — lands as a major tailwind right in the middle of that curve.

Reporting Tone: The Origin of "Lossless 3x" and a Sober Analysis

There are subtle differences in tone across how each media outlet reports the story. Eastern Herald, MarkTechPost, AIToolly, Pulse2.0, and Neuronad have broadly adopted a tone that straightforwardly echoes Google's official message of "3x faster with no quality degradation." In contrast, more technically-oriented outlets such as The Decoder (part of the Heise group), Decrypt, claypier, and buildfastwithai emphasize that the 3x figure is merely a ceiling under "specific hardware, specific batch sizes, and specific workloads," and that in real-world environments 1.7–2.2x should be considered the "expected baseline." In the Hacker News thread (item 48024540), veteran developers contributed numerous on-point explanations such as "this is essentially the same as self-batching against your own predicted future path" and "it's a mechanism for filling idle compute units on GPUs where memory bandwidth is the bottleneck," with voices praising Gemma 4's token efficiency standing alongside more measured assessments noting that it falls short of Claude or GPT in code generation and complex tool calling.

The community reaction on Reddit's r/LocalLLaMA is also noteworthy. According to Startup Fortune, on the day of the May 5 release, the subreddit accumulated 463 upvotes and 128 comments within three hours, and successful operation reports on llama.cpp, Ollama, vLLM, and LM Studio came in one after another that same day. The dominant assessment was that "this is the biggest impact on local inference on the same hardware since training-time MTP was introduced in DeepSeek V3" and "more than just a new model release, this is a move that will be a tipping point for the practical adoption of local inference."

Coverage in the Japanese-speaking sphere is still limited, but major tech media outlets have begun picking it up via translations of Google's official blog, with explanations increasingly mindful of the "practical realization of on-device agents on Pixel TPU and Apple Silicon," particularly in the context of edge and on-premise deployment. In "Bring state-of-the-art agentic skills to the edge with Gemma 4," released simultaneously by the Google Developers Blog, operational examples are presented in which Gemma 4 E2B/E4B runs multi-step autonomous agents fully offline in combination with a new feature called Agent Skills, and Tris Warkentin (Google DeepMind's Product Lead) posted on X (formerly Twitter) that "the local AI experience truly begins from here."

The Reach of Impact: Chat, Agents, and On-Device AI

From a technical standpoint, MTP is fundamentally effective in situations where "memory bandwidth is the bottleneck and the compute units are sitting idle." This directly impacts three particular use cases.

The first is the continuous generation of long passages, and chat tasks that produce lengthy outputs in succession, such as summarization and translation. In cases like having an AI write an entire blog post, format meeting minutes, or generate a long presentation draft, the perceived speed literally more than doubles. The second is voice interfaces. In domains where, within the speech synthesis pipeline, response text generation from the LLM had been the critical path for latency, the time-to-first-response feels 30% to 50% shorter. The release notes for Google AI Edge Gallery and the LiteRT-LM documentation mention with concrete figures that decoding speed becomes more than twice as fast on mobile GPUs, and there is a possibility that implementations of voice and conversational apps on Pixel and Android endpoints will advance all at once.

The third is "agent workloads," which Silicon Valley VCs have positioned as the single biggest theme of 2026. As symbolized by Sequoia's declaration that "2026 is the year of long-horizon agents" and Y Combinator's Diana Hu soliciting "chips dedicated to agent loops," in loops spanning dozens of steps that include tool calls, branching, and backtracking, LLM call latency accumulates. If a single call becomes twice as fast, a 10-step agent feels 5 to 8 times faster in perceived terms. Furthermore, if the KV cache can be shared between the drafter, the main model, and across steps, context reloading can be suppressed. Lined up alongside the news from Anthropic in May 2026 that "Claude Opus 4.6 Fast Mode" is delivering 2.5x throughput, and OpenAI's GPT-5.3-Codex being sped up by 25%, you can see that the entire industry is simultaneously converging on "dedicated engineering methods for delivering the same intelligence faster and cheaper."

Risks and Caveats from a VC Perspective: Not Everyone Can Reap the 3x Benefit

From a Silicon Valley VC perspective, three unresolved issues have been identified regarding the adoption of MTP.

First, the unevenness of hardware dependency. Because the effectiveness of MTP is strongly dependent on the ratio between memory bandwidth and compute density, while there are substantial benefits on high-end machines such as NVIDIA H100 / RTX PRO 6000 and the upper tiers of Apple Silicon, the effect is limited on true low-end devices such as the Raspberry Pi 5, as well as on microcontrollers with shallow memory hierarchies. According to the LiteRT-LM documentation, Gemma 4 E2B decoding on the Raspberry Pi 5 runs at 7.6 tokens/sec on CPU, and rises to 31 tokens/sec on the NPU of the Qualcomm Dragonwing IQ8. To be honest, how well MTP works on NPUs still depends on the implementation of each SoC vendor. When investors evaluate "On-Device AI" startups, they need to be aware that hardware selection and MTP compatibility have a significant impact on the numbers.

Second, accuracy trade-offs in code generation workloads. In the verifications by AI-Muninn and kaitchup, drafter acceptance rates drop on code generation tasks, and wasted speculative computation increases, so the best-case 3x figure is significantly diminished. Code-assist products such as Anthropic Claude Code, GitHub Copilot, Cursor, and Replit Agent may not enjoy the benefits of MTP as straightforwardly as conversational systems. When VCs conduct due diligence in this area, it has become increasingly important to check whether the benchmarks are centered on chat use cases.

Third, the competition over ecosystem standardization. Multiple schools are evolving in parallel — Google's official "Gemma 4 MTP Drafter," together with community-driven approaches such as EAGLE-3, MEDUSA, Lookahead, and the DeepSeek-style training-time MTP — and the balance of power could shift depending on which the inference runtime side (vLLM, SGLang, MLX, llama.cpp, TensorRT-LLM) favors as a "first-class citizen." The fact that vLLM gave preferential treatment to Google's drafter on Day 0 hints at the existence of a Google × vLLM × Inferact alliance, which is also an interesting development for decoding a16z's portfolio strategy.

When What Happens: Roadmap for the Next 6–18 Months

As for the most recent developments, first, around May–June 2026, the major release of the vLLM v0.20.x series is expected to incorporate Gemma 4 MTP into its stable version, and discussions in GitHub Issue #42005 and PR #41745 indicate that it has reached the stage where official Docker images will be provided for both Hopper and Blackwell. By the end of the year, MTP is also expected to reach production quality in MLX and llama.cpp, with kaitchup previewing on their blog that "MTP in llama.cpp will move from beta to GA."

In the medium term, as Sequoia Capital described 2026 as "a year of delays," delays in data center expansion will collide with delays in the AGI timeline, and the importance of reducing inference costs will grow even further toward 2027. Given IDC's forecast that "inference demand will grow 1000x by 2027," methods like MTP that "handle more with the same hardware" carry strong significance as a structural answer to GPU supply constraints. Gartner goes further, predicting that by 2030, the inference cost of trillion-parameter LLMs will drop by more than 90% compared to 2025 for GenAI providers.

As a long-term setup, for all the frontier model candidates—DeepSeek V4 (a rumored next-generation model in late 2026, with three-dimensional attention across space, time, and modality being talked about), Meta Llama 5, xAI Grok 5, and the next version of Mistral Large—it is becoming the established course to "build in MTP or its evolved forms from the design stage." NVIDIA has unveiled "DeepSeek V4 with NVIDIA Blackwell" on its official technical blog, indicating a trend of optimizing the Blackwell-generation tensor cores for speculative decoding. If the "chips dedicated to agent loops" startups currently being recruited by Y Combinator come to market, the benefits of MTP will be amplified from both the hardware and software sides.

For Silicon Valley VCs, this Google MTP release is seen less as "an additional commitment to Gemma 4 itself" and more as powerful endorsement from Google for the "inference optimization layer" thesis they have been betting on since 2024. a16z's LLMflation report, the ¥22.5 billion seed in Inferact, the massive additional investments in Together AI and Fireworks AI, and Sequoia Capital's new ¥1 trillion-scale fund all stand on the logic that "the flashy winners of model training and the unglamorous but massive winners of inference implementation are separate things." MTP is precisely the symbol of that "unglamorous but effective method," and the fact that anyone can now verify it on Gemma 4, an accessible open-weight model, has, in one stroke, made the existence of the inference-layer market visible—that is the summary as of May 2026.


Sources