
Sign in to join the discussion.

At GTC 2026, NVIDIA unveiled NemoClaw, a secure software stack that installs Nemotron models and the new OpenShell runtime onto OpenClaw agents in a single command. The move signals something larger than a product launch: NVIDIA is positioning itself as the indispensable infrastructure layer for the agentic AI era.
Cerebras Systems and Amazon Web Services have announced a partnership to deploy Cerebras CS-3 systems inside AWS data centers, making wafer-scale inference available through Amazon Bedrock for the first time.[1] The deal pairs two radically different hardware philosophies: AWS's Trainium chip, designed for compute-dense prefill workloads, with Cerebras's Wafer-Scale Engine, which stores entire model weights in on-chip SRAM for ultra-fast decode. The resulting disaggregated architecture delivers five times more high-speed token capacity in the same hardware footprint.
The impetus for the partnership reflects a shift in how AI is used. Conversational chat generates modest token volumes; agentic coding generates approximately 15 times more tokens per query and demands low-latency output to keep developers and agents productive.[1] On a GPU cluster, inference runs both the prefill phase (processing the input prompt) and the decode phase (generating the output tokens) on the same hardware. Prefill is compute-bound; decode is memory-bandwidth-bound, requiring the full model weights to be fetched from memory for every token generated. Running both phases on the same chip forces a suboptimal compromise.
Disaggregated inference solves this by assigning each phase to the hardware it suits. Trainium, with its dense compute cores, handles prefill, computes the KV cache, and transmits results via Amazon's Elastic Fabric Adapter (EFA) high-speed interconnect. The Cerebras WSE takes the KV cache and performs decode exclusively, using its on-chip SRAM, which offers thousands of times greater memory bandwidth than GPU HBM, to generate thousands of output tokens per second versus the hundreds typical of GPU-based decode.[1]
Cerebras already serves OpenAI, Cognition, and Meta, reaching up to 3,000 tokens per second on supported models.[1] The AWS partnership extends that speed to AWS's global customer base at managed-cloud scale, previously unavailable without direct engagement with Cerebras.
AWS Vice President of Compute and ML Services David Brown described the combination as yielding inference "an order of magnitude faster and higher performance than what's available today" for demanding workloads, though that characterization applies specifically to the disaggregated decode phase rather than all inference tasks.[1] Both companies acknowledge that the disaggregated configuration is optimal for large, stable workloads with predictable prefill/decode ratios. Customers with mixed workloads will likely use both aggregated and disaggregated configurations, routing traffic to whichever is appropriate. AWS plans to support both.
The announcement positions Cerebras squarely against NVIDIA in the one dimension where GPU dominance is most vulnerable: raw decode speed. NVIDIA's H100 and B200 systems run inference at hundreds of tokens per second per query; Cerebras's wafer-scale approach achieves multiples of that by eliminating the memory-bandwidth bottleneck that constrains conventional accelerators. For applications where speed is the product, including real-time coding assistants, interactive agents, and low-latency customer applications, the performance gap translates directly into user experience and developer productivity.
The service is expected to become available through Bedrock in the coming months. Its arrival will give AWS customers a disaggregated inference option that no other major hyperscaler currently offers, and it will test whether the market for ultra-fast inference at cloud scale is large enough to justify the deep technical undertaking both companies are committing to.