Omniscient
AllDaily SignalArticlesReviewsCommentaryFeatured
Sign In

Omniscient

AI intelligence briefings, analysis, and commentary — delivered in broadsheet form.

By Noah Ogbi

Subscribe

Weekday briefings and flagship analysis, delivered to your inbox.

Sections

  • All
  • Daily Signal
  • Articles
  • Reviews
  • Commentary
  • Dialogues

Topics

  • AI Policy
  • AI Research
  • Industry
  • Large Language Models
  • Ethics
  • Agent
  • Amazon
  • AttnRes

Meta

  • About
  • RSS Feed
  • Privacy Policy
  • Terms of Service

Omniscient Media — made by ForeverBuilt, LLC.
© 2026 ForeverBuilt, LLC. All rights reserved.

  1. Home
  2. ›AI Research
  3. ›Moonshot AI's Attention Residuals Challenge a Core Assumption of Modern LLMs

AI Research

Vol. 1·Saturday, March 21, 2026

Moonshot AI's Attention Residuals Challenge a Core Assumption of Modern LLMs


Noah OgbiUpdated Mar 22, 2026
AttnRestransformerscaling
Moonshot AI's Attention Residuals Challenge a Core Assumption of Modern LLMs
Share:

Discussion


Sign in to join the discussion.


Related

AI Research

Vol. 1·Tuesday, May 5, 2026

The Self-Improving Machine: How AI Is Learning to Build Its Own Successors


The Self-Improving Machine: How AI Is Learning to Build Its Own Successors

Jack Clark, co-founder of Anthropic and former policy director at OpenAI, puts the probability of a fully automated AI research pipeline at 60% or higher before the end of 2028. The benchmark evidence he assembles - from coding agents to alignment research - suggests the transition is already underway.


Noah Ogbi
Continue →

AI Research

Vol. 1·Saturday, April 18, 2026

GLM-5.1 and the Benchmark That Got Complicated


GLM-5.1 and the Benchmark That Got Complicated

Z.ai's GLM-5.1 briefly led the SWE-Bench Pro leaderboard with a self-reported 58.4% score, trained entirely on Huawei Ascend chips with no NVIDIA silicon in the stack. The benchmark story has already moved on. The geopolitical one has not.


Noah Ogbi
Continue →

AI Research

Vol. 1·Friday, April 17, 2026

The MCP Deep Dive: What It Is, How It Works, Why It's Broken, and What Comes Next


The MCP Deep Dive: What It Is, How It Works, Why It's Broken, and What Comes Next

Model Context Protocol is the closest thing AI has to a universal plug standard - and it arrived with the same security debt that plagued every previous universal plug standard. A comprehensive technical guide to MCP architecture, attack surfaces, optimization, and one uncomfortable prediction about where this is all heading.


Noah Ogbi
Continue →

Since the original ResNet paper in 2015, residual connections have been one of the most durable assumptions in deep learning: each layer adds its transformation to the running total of all previous layers, creating a "gradient highway" that enables stable training at depth. Moonshot AI's Kimi team is now challenging whether that assumption was ever the right one. Their new paper, Attention Residuals, proposes replacing fixed residual accumulation with something far more selective.[1]

The Problem With Fixed Weights

Standard residual connections are, at their core, a uniform sum. Every layer receives an equally weighted aggregate of all prior layer outputs, with no mechanism to selectively emphasize or suppress what came before. As models grow deeper, this causes what the paper calls "PreNorm dilution": hidden-state magnitudes grow as O(L) with depth, progressively burying the contribution of individual layers under an ever-growing accumulated sum.[1] The empirical consequence is concrete: a significant fraction of layers in standard LLMs can be pruned with minimal loss in performance, which implies they were contributing relatively little to begin with.

The Kimi team draws an explicit analogy to the problem that attention mechanisms solved for sequences. Before transformers, recurrent neural networks compressed all prior information into a single hidden state, which was then passed forward - a lossy summary that couldn't be selectively queried. Attention replaced that fixed compression with dynamic, input-dependent retrieval. AttnRes applies the same logic along the depth dimension rather than the sequence dimension.

How AttnRes Works

Rather than summing all previous layer outputs with unit weights, each layer in AttnRes computes a softmax attention score over all preceding layers and aggregates them with learned, input-dependent weights.[1] The query is a single lightweight learnable vector per layer - not the full representation - which keeps the computational overhead minimal. Attention and MLP layers can now receive different effective weightings of earlier representations, which the paper argues is more natural given that different layer types may benefit from different historical contexts.

At training scale, the challenge is memory: storing all prior layer outputs for attention increases memory footprint as O(Ld). The team's solution is Block AttnRes, which groups layers into blocks, reduces each block to a single representation via summation, and applies attention only over the N block-level summaries rather than all L individual layers. This brings memory and cross-pipeline communication down to O(Nd), making the mechanism practical for large-scale distributed training with negligible additional overhead. The paper reports less than 2% latency increase at inference time.[1]

What the Numbers Show

Scaling law experiments across model sizes confirm the improvement is consistent: Block AttnRes matches the performance of a baseline trained with 1.25 times more compute - the same result at 80% of the cost, effectively.[1] The team also integrated AttnRes into their Kimi Linear architecture (48B total parameters, 3B activated in a mixture-of-experts configuration) and pre-trained it on 1.4 trillion tokens. AttnRes improved performance across all downstream benchmarks evaluated, while producing more uniform gradient distributions and bounded hidden-state magnitudes across depth - direct evidence that PreNorm dilution was being mitigated.[1]

Significance

Architecture changes that apply cleanly across model sizes, train as drop-in replacements for existing components, and show consistent scaling-law improvements are rare. Most proposed improvements to transformer architecture either don't scale, require significant reengineering of training infrastructure, or fail to replicate across labs. AttnRes has none of those obvious failure modes: it is designed as a literal drop-in for residual connections, its overhead is marginal, and its gains are confirmed by scaling experiments.

The paper is perhaps most interesting as a conceptual move. Attention was already recognized as having fundamentally solved the problem of long-range dependence over the sequence dimension. AttnRes argues that an analogous problem - long-range dependence over the depth dimension - was hiding in plain sight, and that the same family of solutions applies. Whether the result is, as some have suggested, a meaningful architectural inflection point, or a useful but incremental improvement, will depend on how the approach replicates at the largest scales and across different architectures. The paper is already drawing attention from researchers and engineers for whom transformer efficiency is not academic.


Sources

  1. Kimi Team (Moonshot AI): Attention Residuals - arXiv:2603.15031 Inline ↗