For the past three years, the loudest constraint on frontier AI development has not been algorithmic cleverness or data availability. It has been raw compute cost. Training and serving large models is punishingly expensive, and that expense has acted as a moat - keeping serious AI development confined to a handful of well-capitalized labs and hyperscalers. NVIDIA's Vera Rubin platform, confirmed as in full production at CES 2026, is about to tear down that moat.[1]
The headline number is 10x: Rubin delivers AI tokens at roughly one-tenth the cost of the Blackwell platform it replaces.[1] That figure deserves to be treated seriously, not discounted as marketing hyperbole. NVIDIA arrived at it through codesign - engineering the Rubin GPU, Vera CPU, NVLink 6 networking, ConnectX-9 SuperNICs, BlueField-4 DPUs, and Spectrum-X Ethernet Photonics networking as a single integrated system rather than a collection of discrete components engineered in isolation.[1] When every layer of the stack is optimized against every other layer, compound efficiency gains of this magnitude are not only plausible - they are expected.
The Rubin GPU itself is a dual-die chip delivering 50 petaflops of NVFP4 inference performance, paired with 288GB of next-generation HBM4 memory at 22 TB/s bandwidth.[2] Against Blackwell, NVIDIA claims 5x inference performance per GPU and 3.5x training throughput.[3] The flagship rack configuration, the Vera Rubin NVL72, pairs 72 Rubin GPUs with 36 Vera CPUs - each carrying 88 custom Arm cores - in a fully liquid-cooled, cable-free modular design.[3]
Scale up to a full DGX SuperPOD and the figures become almost abstract: eight NVL72 racks delivering 576 GPUs, 288 CPUs, approximately 600 TB of fast memory, and 260 TB/s of aggregate NVLink throughput.[2] These are not research curiosities. They are production systems for enterprises and governments building the next generation of AI infrastructure.
Later in 2026, NVIDIA will extend the platform with Rubin CPX, a monolithic die purpose-built for long-context inference. CPX packs 30 petaflops of NVFP4 compute alongside 128GB of GDDR7 memory, and integrates video encode and decode directly on-chip.[4] The NVL144 CPX rack configuration combines Rubin GPUs with CPX accelerators for 8 exaflops of AI compute - 7.5x more than the GB300 NVL72 - with 100 TB of fast memory and 1.7 PB/s of memory bandwidth in a single rack.[4] Jensen Huang was direct about the intent: Rubin CPX is the first GPU built specifically for models that reason across millions of tokens at once.[4]
Sign in to join the discussion.
The phrase "extreme codesign" appears throughout NVIDIA's CES materials, and it is worth taking at face value.[1] The efficiency gains Rubin delivers are not attributable to any single component. They emerge from the tight integration of all six platform chips, the NVLink 6 switch fabric running at 3.6 TB/s per GPU, and the new AI-native KV-cache storage tier - NVIDIA's Inference Context Memory Storage Platform - which NVIDIA claims delivers 5x higher tokens per second, 5x better performance per TCO dollar, and 5x better power efficiency for long-context inference.[1]
This kind of vertical integration is NVIDIA's deepest competitive advantage, and it is one that AMD, Intel, and custom silicon efforts at Google, Amazon, and Microsoft cannot replicate on a short timeline. Building competitive chips is one thing. Building a competitive six-chip codesigned platform with mature software ecosystems, established hyperscaler partnerships, and a decade of CUDA lock-in is another entirely.
NVIDIA's GPU Technology Conference runs March 16 to 19, just days away.[5] Jensen Huang told Korean media that announcements will include "a chip that will surprise the world" - and credible reports suggest NVIDIA will use the event not only to provide a Rubin production update, but to unveil the Feynman architecture, the generation after Rubin.[6] If accurate, this is NVIDIA sending a deliberate message to hyperscalers evaluating alternative silicon: the roadmap runs further ahead than any competitor can see.
The timing of the Rubin production ramp matters here. Chips confirmed in full production at CES in January, with systems shipping in the second half of 2026, means hyperscaler deployments begin in earnest before year-end.[3] The cost economics of inference will start shifting this year, not in some theoretical future. Organizations still building infrastructure assumptions around Blackwell-era token costs will find those assumptions outdated faster than they expect.
The conventional wisdom holds that cheaper compute democratizes AI. That is true as far as it goes. But the more immediate effect of a 10x inference cost reduction is that it makes already-powerful incumbents more powerful. OpenAI, Google DeepMind, Anthropic, and Meta will absorb Rubin capacity first, allowing them to serve larger models at lower cost - extending their lead over challengers who lack the capital to deploy Rubin at scale.
The deeper disruption will come at the application layer. When inference costs fall by an order of magnitude, entire categories of AI application that were previously uneconomical become viable: always-on AI agents running complex multi-step reasoning, real-time video analysis at scale, enterprise software that reasons over entire codebases rather than snippets. That is the market NVIDIA is building for - and it is a substantially larger market than the one Blackwell served.
"The faster you train AI models, the faster you can get the next frontier out to the world. This is your time to market. This is technology leadership." - Jensen Huang, CES 2026[1]
Huang's framing is characteristically bold, but the underlying argument is sound. Rubin is not a faster version of what came before. It is a platform designed for a qualitatively different scale of AI deployment. Whether that deployment fulfills its promise will depend on software, on model architecture, and on whether the organizations building on Rubin can translate raw compute into genuine capability. But the hardware case is already made. This is the most consequential AI infrastructure announcement in years, and the industry should be treating it that way.