Cerebras Brings Wafer-Scale Inference to AWS, Targeting the Agent Throughput Bottleneck

Industry

Vol. 1·Thursday, April 30, 2026

Cloud Revenue Vindicates Big Tech AI Spending, but Meta's Runaway Capex Unnerves Investors

Alphabet, Microsoft, Amazon, and Meta reported Q1 2026 results on April 29 that collectively delivered the clearest evidence yet that AI infrastructure spending is generating real cloud revenue. The outlier was Meta, whose strong earnings were overshadowed by a capex guidance range raised for the second time this year, with no concrete product milestone attached to the ceiling.

Noah Ogbi

Continue →

Why Inference Speed Is the New Bottleneck

The impetus for the partnership reflects a shift in how AI is used. Conversational chat generates modest token volumes; agentic coding generates approximately 15 times more tokens per query and demands low-latency output to keep developers and agents productive.^[1] On a GPU cluster, inference runs both the prefill phase (processing the input prompt) and the decode phase (generating the output tokens) on the same hardware. Prefill is compute-bound; decode is memory-bandwidth-bound, requiring the full model weights to be fetched from memory for every token generated. Running both phases on the same chip forces a suboptimal compromise.

Disaggregated inference solves this by assigning each phase to the hardware it suits. Trainium, with its dense compute cores, handles prefill, computes the KV cache, and transmits results via Amazon's Elastic Fabric Adapter (EFA) high-speed interconnect. The Cerebras WSE takes the KV cache and performs decode exclusively, using its on-chip SRAM, which offers thousands of times greater memory bandwidth than GPU HBM, to generate thousands of output tokens per second versus the hundreds typical of GPU-based decode.^[1]

What This Means in Practice

Cerebras already serves OpenAI, Cognition, and Meta, reaching up to 3,000 tokens per second on supported models.^[1] The AWS partnership extends that speed to AWS's global customer base at managed-cloud scale, previously unavailable without direct engagement with Cerebras.

AWS Vice President of Compute and ML Services David Brown described the combination as yielding inference "an order of magnitude faster and higher performance than what's available today" for demanding workloads, though that characterization applies specifically to the disaggregated decode phase rather than all inference tasks.^[1] Both companies acknowledge that the disaggregated configuration is optimal for large, stable workloads with predictable prefill/decode ratios. Customers with mixed workloads will likely use both aggregated and disaggregated configurations, routing traffic to whichever is appropriate. AWS plans to support both.

Context: The Inference Infrastructure Race

The announcement positions Cerebras squarely against NVIDIA in the one dimension where GPU dominance is most vulnerable: raw decode speed. NVIDIA's H100 and B200 systems run inference at hundreds of tokens per second per query; Cerebras's wafer-scale approach achieves multiples of that by eliminating the memory-bandwidth bottleneck that constrains conventional accelerators. For applications where speed is the product, including real-time coding assistants, interactive agents, and low-latency customer applications, the performance gap translates directly into user experience and developer productivity.

The service is expected to become available through Bedrock in the coming months. Its arrival will give AWS customers a disaggregated inference option that no other major hyperscaler currently offers, and it will test whether the market for ultra-fast inference at cloud scale is large enough to justify the deep technical undertaking both companies are committing to.

Sources

Discussion