Inference & PerformanceIndustry & CompetitionAI Products & Platforms

TPU vs CUDA: The Attack and Defense After Cloud Next 2026

What Google Announced at Cloud Next

Between April 22 and 24 at Google Cloud Next 2026, Google announced three interconnected things.

Silicon: The eighth-generation TPU split into two distinct products for the first time. 8t (codename Sunfish, physically designed by Broadcom) is for training; 8i (codename Zebrafish, physically designed by MediaTek) is for inference. Both target TSMC 2nm and ramp in late 2027. The real capacity gain on 8i is in HBM, going from 192GB to 288GB with bandwidth at 8.6 TB/s, actually higher than 8t’s 6.5 TB/s, with the full KV cache still living in HBM. On-chip SRAM also tripled to 384MB (8t has 128MB), but 384MB cannot hold a full long-context KV cache (Llama 3.1-70B at 128K context is around 10GB even at INT4). Its job is to hold the active working set being processed by attention computation, reducing the back-and-forth between HBM and compute. The Cloud blog line “hosting massive KV Caches entirely on silicon” is marketing language; the technical deep dive page actually says “host a larger KV Cache entirely on silicon, significantly reducing the idle time of the cores during long-context decoding,” which is the more accurate phrasing. The Boardfly topology is purpose-built for high-concurrency reasoning inference (Hyperframe Research’s architecture analysis is more measured than the official version).

Software: TorchTPU was officially announced on April 24, letting PyTorch workloads run directly on TPU without the PyTorch/XLA bridge layer that has accumulated years of technical debt. The underlying OpenXLA / StableHLO / PJRT / libtpu stack is fully reused; what’s new is mainly the front-end translation layer and a roadmap for “avoid SPMD constraints + reduce recompilation + ship a precompiled kernel library.” It’s the first time Google has staffed PyTorch on TPU as a first-class citizen. Reuters reported in December that Meta is in close collaboration with Google on this.

Demand: Announced April 24, Google’s investment agreement with Anthropic. Structure is $10B in immediately, up to $40B total, valuation locked at $350B, and 5GW of TPU capacity. This is incremental on top of the 3.5GW Broadcom announced on April 6. Anthropic also signed a $100B / 10-year / 5GW deal with Amazon and a $30B Azure compute / 1GW Vera Rubin deal with Microsoft+NVIDIA.

The three questions below decide what these three things actually weigh.

I. Is the goal credible: can TPU become another NVIDIA stack

The real ambition behind this push is not to plug TPU’s gaps but to rebuild a compute stack that runs in parallel to CUDA. Whether that goal is credible depends on what NVIDIA’s moat actually consists of and whether TPU has a viable answer at every layer.

NVIDIA’s moat has four layers: CUDA software plus developer habit, the de facto framework standards (PyTorch / vLLM / TensorRT), rack-scale system integration (NVL72 / NVL576), and upstream supply chain (TSMC advanced nodes + CoWoS + HBM3E locked up for years). For the past decade the market has shorthanded all of this as “the CUDA moat,” but Jensen himself has quietly switched the framing. The week before Cloud Next, on Dwarkesh Patel’s podcast and at the April GTC keynote, what he kept emphasizing was the supply chain moat and rack-scale system, not CUDA. This is a tone shift worth recording. NVIDIA itself has moved on from a single-point moat narrative to a composite one, which amounts to publicly admitting that CUDA alone is no longer sufficient.

TPU now has an answer at every layer, but every answer carries its own uncertainty.

At the software layer, TorchTPU is first-class, the technical direction is right, but there is no public head-to-head benchmark. Google itself only talks in perf/$ and cost-per-token, never raw throughput head-to-head. The best third-party data we can cite is Trelis Research’s May 2025 comparison on Gemma 3 27B: 8x TPU v6e vs 1x H200, TTFT slightly faster on TPU, throughput slightly worse, but cost per million tokens NVIDIA was 4-5x cheaper. Amin Vahdat himself claims the next-generation vLLM TPU backend is 2-5x faster than the prototype, which would compress the gap to 1-2x but is still not independently verified. Any sentence saying “TPU performance matches GPU” today must be tagged as a claim, not a verified fact.

At the system layer, Google has Boardfly + ICI + the Collectives Acceleration Engine, and rack-scale has always been Google’s strength. SemiAnalysis in InferenceX v2 flatly states that Trainium, TPU, and NVIDIA are the only three vendors with real rack-scale system deployments today, and AMD’s MI455X UALoE72 won’t ramp until 2027 Q2.

At the supply chain layer, Google + Broadcom landed N3 a generation ahead of NVIDIA + AWS, TPUv7 was already in production in 2025, and v8 continues on N3/2nm (SemiAnalysis “The Great AI Silicon Shortage” has the full timeline). This is the first time in a decade that there’s been a process-node lead. But this is also the layer most easily counterpressed by NVIDIA: Jensen keeps repeating that advanced node capacity, CoWoS, and HBM3E have been locked for years, and the fact that TPU 8i ramp won’t reach volume until 2027 is consistent with that.

At the business model layer, Google for the first time treats TPU as hardware that can be purchased. SemiAnalysis reported a detail that hasn’t gotten enough attention: roughly 400,000 Ironwood (v7) chips are being sold directly to Anthropic via Broadcom, with Fluidstack handling on-site setup, physically located in Anthropic’s own datacenter. From v1 through v6, TPU was rent-only, all customers had to come in through GCP. Once that rule has an exception, the probability of Meta, xAI, and SSI following rises sharply.

Putting these four layers together: TPU really might become another NVIDIA, but not soon. Three uncertainties of different weight stack on top of each other: missing performance data, supply chain lag, business model only just opened. My judgment is that within 18-24 months TPU will not replace NVIDIA, but will cause NVIDIA to lose pricing power in the frontier-inference niche.

II. Where is the lever: PyTorch is the surface, vLLM is the real crack

For Google to clone the NVIDIA stack, it must solve the migration cost of moving developers off CUDA. Seeing the lever clearly matters because the PyTorch-layer story is easy to misread.

PyTorch backend neutrality has been talked about for five years and barely moved. The reason is that the training-stack migration cost is too high; FSDP / DDP / checkpoint format / memory layout are all deeply coupled with CUDA. SemiAnalysis in the TPUv7 long-form gave a very specific diagnosis: Google historically only provided first-class support for the JAX/XLA:TPU stack, PyTorch on TPU was a second-class citizen relying on lazy tensor graph capture, with no support for PyTorch native distributed APIs and no DTensor/FSDP2/DDP. The Hacker News engineer complaint that “PyTorch/XLA is a swamp of undocumented behavior and bugs that silently hangs after 8 hours of training” is typical. HuggingFace’s own optimum-tpu project entered maintenance mode in early 2026, with the README directly redirecting users to vllm-project/tpu-inference or HF Accelerate, kicking the ball back to Google. The third-party civilian bridge withdrawing and the chip vendor stepping in directly is the most important difference between this round of PyTorch-on-TPU and every previous attempt over the past decade.

But the migration of the PyTorch training stack will take 18-24 months to propagate to mid-and-small application companies. The real short-term lever is vLLM.

vLLM became the de facto standard for multi-backend inference in 2025-2026, covering NVIDIA / AMD / Intel / Trainium / TPU, with 17k+ stars and SGLang right behind it. This means the lock-in at the inference layer has already loosened technically; what remains is just commercial terms and SLAs. Google’s attack on the inference market (TPU 8i + Anthropic claude.ai traffic) is actually riding on the vLLM open-source lever. That lever is maintained by Meta, and NVIDIA can’t dismantle it on its own. NVIDIA’s response is also telling: spending $20B to acquire Groq’s LPU team (including former CEO Jonathan Ross) and turning it into the LPX rack amounts to admitting that in the dedicated-inference-chip niche there must be a dedicated answer, not just GB300 NVL72 system integration.

So the lever ranking is: vLLM is affecting deployment today (the inference-layer crack has already opened) → TorchTPU will affect training-stack choice in 18-24 months (developer migration curve just starting) → Pallas (the TPU-native kernel language) determines the long-term ceiling for deep optimization (still mostly Google internal + Anthropic-style deep partners).

NVIDIA’s counterplay maps onto the same ranking: rack-scale system integration for inference (GB300 NVL72: 50× tokens/W, 35× lower cost-per-token) + Groq absorbed into LPX rack to grab dedicated inference ASICs + supply chain locking capacity. But none of these three counterplays directly removes the vLLM lever.

III. Who can really shake NVIDIA: Anthropic single-point dependency

Pulling the camera back from Google alone to the entire non-NVIDIA camp, what’s holding up this whole narrative is essentially Anthropic, one lab.

AWS Project Rainier is 100% Anthropic; the 1M Trainium2 cluster has no second external production customer. The largest external TPU customer is also Anthropic. SemiAnalysis’s optimistic TPU pieces talk about “the more TPU Meta/SSI/xAI/OAI/Anthropic buy” as a list of names, but Anthropic is always at the front. Bank of America’s October note used a more direct phrasing: “skepticism about Trainium outside of Anthropic.” Cerebras has 86% of revenue concentrated in two UAE entities, which is sovereign-driven rather than hyperscaler-driven growth. Groq has been absorbed by NVIDIA into the LPX rack. AMD landed the Meta 6GW MI450 design win, but rack-scale won’t volume-ramp until 2027 Q2.

So the more accurate description of this Cloud Next 2026 three-layer move is “the three-layer attack of the Google + Anthropic alliance.” Anthropic is simultaneously the technical enabler and the commercial cover. SemiAnalysis emphasizes that Anthropic has “strong engineering resources and ex-Google compiler experts” and can invest in custom kernels to bring TPU MFU close to NVIDIA levels. That solves Google’s chicken-and-egg problem of “no external customer knows how to use TPU well” by proving a non-Google team can squeeze performance out of TPU. At the same time the $40B+ commitment gives the entire Broadcom / TSMC / MediaTek chain a capex signal. Without an Anthropic-sized buyer, TorchTPU, the 8t/8i split, and the multi-year Broadcom contracts all fall apart.

The real cash landing schedule is more fragile than the announcement headlines. The disclaimer in Broadcom’s 8-K is very clear: “consumption depends on Anthropic’s continued commercial success,” and they’re still in discussions with “operational and financial partners.” That amounts to Broadcom telling the SEC publicly that the 3.5GW is conditional, requiring Anthropic to keep raising capital plus third-party compute financing (likely a CoreWeave-style SPV structure) to cash out. Stack on top that $30B of Google’s $40B is milestone-conditional, and the entire “$40B + 5GW” commitment translates to only $10B of real cash landing within 2026, with the rest contingent.

The hardest counterargument to refute is: the compute Google enjoys most won’t be left for customers. Chosun reported that Google holds about 23% of global AI compute (5M H100-equivalent), of which 3.8M is its own TPU, but the vast majority is captive to Search/Ads/Gemini. The 3.5GW Anthropic gets won’t come online until 2027, and TPU 8i’s own ramp also drags into 2027. Meanwhile Google at Cloud Next 2026 also announced a NVIDIA Vera Rubin A5X deployment of 80,000 GPUs per site and 960,000 cross-site on GCP, so Google itself is also a big NVIDIA buyer. This is a real contradiction. My judgment is that the goal of Google’s three-layer move is not to take all of NVIDIA’s market, but to lock down Anthropic + a few sovereigns + 1-2 hyperscale labs, making NVIDIA lose pricing power in the frontier-inference long tail; while Google Cloud itself becomes a neutral venue where both TPU and NVIDIA collect rent from enterprise agents. That framing is closer to actual intent than “Google wants to push NVIDIA out.”

The remaining big players who could really add votes to the non-NVIDIA camp fall into two categories. First, the second-tier frontier labs (Mistral, xAI, SSI, the ByteDance ecosystem): whether a second TPU customer at Anthropic-scale appears within 2026 determines the follow-through speed of this 5GW commitment. Second, sovereigns (Saudi PIF, UAE G42, parts of Europe), whose dependence on NVIDIA gives them geopolitical motivation to actively diversify, but whose engineering capacity and application scale are not yet enough to support a new ecosystem.

Historical review: the real difference from the past decade

Once the three questions above are answered, looking back at the historical review actually carries weight. Google has been pushing TPU for nearly a decade, and the past seven generations have not really shaken NVIDIA. Why this round will be different determines the credibility of the three judgments above.

Why the past decade didn’t succeed has three layers. First, technically, PyTorch on TPU was hard to use, with a second-class software stack. Second, commercially, TPU was rent-only and enterprise customers didn’t want to be locked into a single cloud. Third, organizationally, TPU teams only cared about internal KPIs, external sales were chronically deprioritized, and on HN there are Google sales reps admitting “internal demand is so big we have no bandwidth to push externally.” The three things reinforce each other into a self-perpetuating cold-start loop: few users → little optimization → fewer users.

This round all three changed at once. First, technically, TorchTPU is PyTorch first-class, and HuggingFace exiting maintenance forced the chip vendor to step in directly. Second, commercially, the Anthropic deal opened the door on “physical delivery to customer datacenter.” Third, organizationally, SemiAnalysis explicitly wrote that Google has “revised their software strategy for externally-facing customers and has already made major changes to their TPU team’s KPIs,” investing major engineering effort into native PyTorch / vLLM / SGLang TPU support. Three changes happening simultaneously is a first in a decade.

There’s also a forgotten antecedent in TPU v4i. The Jouppi team’s internally-deployed v4i in 2020 was already a rehearsal of the training/inference split, and the 2021 ISCA paper wrote explicitly that an inference DSA needs multi-tenancy, air-cooling, and large SRAM to hold P99 latency. The 8i in v8 is not a new invention; it productized externally a paradigm that had been validated internally for 5 years. This helps answer why now: technically Google was ready long ago, what it was waiting for was agentic inference becoming an explicit external demand. Looking at Anthropic’s Claude API traffic curve, Cursor’s negative-margin discussion, and the spread of SaaStr’s “Inference is the new S&M”, this external demand only became real in Q1 2026.

The process-node lead is also a first-in-a-decade. SemiAnalysis’s “The Great AI Silicon Shortage” confirms Google + Broadcom’s TPU landed N3 a generation ahead of NVIDIA + AWS, TPUv7 already in production and v8 continuing on N3/2nm. NVIDIA is still moving from 4NP Blackwell to N3P Rubin. Specialization (training/inference split) + process lead + software surrender — these three things happened simultaneously for the first time.

Overall judgment

CUDA lock-in for the first time shows quantifiable loosening; the loosening is mostly in inference, not training; the lever is vLLM, not PyTorch. The real threat of Google’s three-layer move is not market share (NVIDIA datacenter AI revenue will likely still be +50% YoY through 2026), it’s making NVIDIA lose pricing power in frontier inference, forcing NVIDIA to shift its moat narrative from CUDA to supply chain and system. The entire non-NVIDIA camp is propped up by the Google + Anthropic alliance, so the realization of this narrative depends on whether Anthropic’s commercial curve continues to rise sharply through 2027. If Anthropic’s growth disappoints, the 5GW commitment, Broadcom’s capex, and TorchTPU’s ecosystem investment will all be under pressure simultaneously, and by then NVIDIA will have tightened its supply chain moat by another notch.

For builders working on LLM apps, agents, or API wrappers, the three things most worth doing in the next 6-12 months: don’t migrate to TPU now, but abstract the inference layer into prefill/decode separable, model-pluggable architecture (the lever for mid-and-small builders is the abstraction layer; direct TPU migration only has positive ROI for players spending over $500K/month on inference, see the Midjourney case where monthly inference cost dropped from $2.1M to $700K); restart the long-context features that got cut for cost reasons in the past 12 months (TPU 8i’s 288GB HBM + 8.6 TB/s bandwidth + 384MB SRAM buffer combination makes long-context decode unit-token cost decline noticeably steeper than short-context, and a16z’s LLMflation curve of 10x annual decline in equal-capability inference cost has held for the past two years, so full-codebase context, persistent agent memory, and multi-document deep research will hit the threshold between 2026 H2 and 2027 H1); shift the pricing model from per-seat to outcome-based, and reclassify inference accounting from COGS to R&D or growth (Cursor’s negative margin from $100M ARR until $2B ARR shows that per-seat plus uncontrollable token cost is not sustainable).

Three indicators worth tracking monthly: GPU/TPU configuration changes at Stanford CS / CMU / MIT-tier schools, the merge speed of TPU backend PRs in vLLM, and whether Meta formally deploys TorchTPU in production. A qualitative shift in any one of these three variables means the 18-24 month developer migration curve has finished its first leg.

Sources

This piece synthesizes material from five parallel research lines; key sources are linked inline. Full evidence chain and original citations are in the 01-05 research notes in the same directory, plus appendix 00 (user-provided TorchTPU internal information and group discussion).

Main secondary sources: SemiAnalysis, The Information, Stratechery, Bloomberg, Reuters, CNBC, TechCrunch, Hyperframe Research, a16z LLMflation, Epoch AI, Google Cloud blog, developers.googleblog.com, Anthropic news, Broadcom 8-K, multiple relevant Hacker News threads, Dwarkesh Patel podcast 2026-04-15.

Main data gaps: (1) TorchTPU at production scale vs CUDA head-to-head benchmark; (2) the actual 8t/8i ratio within 5GW and the year-by-year ramp curve; (3) whether Meta / OpenAI / xAI show migration signals after 2026-04-24. These three variables are the most direct observable indicators of how fast this narrative will materialize.