
Sign in to join the discussion.

Mistral's latest open-weight release consolidates its reasoning, vision, and coding model lines into a single 119B MoE - a deliberate bet that versatility beats specialization. We examine whether the tradeoffs hold up.
Since the original ResNet paper in 2015, residual connections have been one of the most durable assumptions in deep learning: each layer adds its transformation to the running total of all previous layers, creating a "gradient highway" that enables stable training at depth. Moonshot AI's Kimi team is now challenging whether that assumption was ever the right one. Their new paper, Attention Residuals, proposes replacing fixed residual accumulation with something far more selective.[1]
Standard residual connections are, at their core, a uniform sum. Every layer receives an equally weighted aggregate of all prior layer outputs, with no mechanism to selectively emphasize or suppress what came before. As models grow deeper, this causes what the paper calls "PreNorm dilution": hidden-state magnitudes grow as O(L) with depth, progressively burying the contribution of individual layers under an ever-growing accumulated sum.[1] The empirical consequence is concrete: a significant fraction of layers in standard LLMs can be pruned with minimal loss in performance, which implies they were contributing relatively little to begin with.
The Kimi team draws an explicit analogy to the problem that attention mechanisms solved for sequences. Before transformers, recurrent neural networks compressed all prior information into a single hidden state, which was then passed forward - a lossy summary that couldn't be selectively queried. Attention replaced that fixed compression with dynamic, input-dependent retrieval. AttnRes applies the same logic along the depth dimension rather than the sequence dimension.
Rather than summing all previous layer outputs with unit weights, each layer in AttnRes computes a softmax attention score over all preceding layers and aggregates them with learned, input-dependent weights.[1] The query is a single lightweight learnable vector per layer - not the full representation - which keeps the computational overhead minimal. Attention and MLP layers can now receive different effective weightings of earlier representations, which the paper argues is more natural given that different layer types may benefit from different historical contexts.
At training scale, the challenge is memory: storing all prior layer outputs for attention increases memory footprint as O(Ld). The team's solution is Block AttnRes, which groups layers into blocks, reduces each block to a single representation via summation, and applies attention only over the N block-level summaries rather than all L individual layers. This brings memory and cross-pipeline communication down to O(Nd), making the mechanism practical for large-scale distributed training with negligible additional overhead. The paper reports less than 2% latency increase at inference time.[1]
Scaling law experiments across model sizes confirm the improvement is consistent: Block AttnRes matches the performance of a baseline trained with 1.25 times more compute - the same result at 80% of the cost, effectively.[1] The team also integrated AttnRes into their Kimi Linear architecture (48B total parameters, 3B activated in a mixture-of-experts configuration) and pre-trained it on 1.4 trillion tokens. AttnRes improved performance across all downstream benchmarks evaluated, while producing more uniform gradient distributions and bounded hidden-state magnitudes across depth - direct evidence that PreNorm dilution was being mitigated.[1]
Architecture changes that apply cleanly across model sizes, train as drop-in replacements for existing components, and show consistent scaling-law improvements are rare. Most proposed improvements to transformer architecture either don't scale, require significant reengineering of training infrastructure, or fail to replicate across labs. AttnRes has none of those obvious failure modes: it is designed as a literal drop-in for residual connections, its overhead is marginal, and its gains are confirmed by scaling experiments.
The paper is perhaps most interesting as a conceptual move. Attention was already recognized as having fundamentally solved the problem of long-range dependence over the sequence dimension. AttnRes argues that an analogous problem - long-range dependence over the depth dimension - was hiding in plain sight, and that the same family of solutions applies. Whether the result is, as some have suggested, a meaningful architectural inflection point, or a useful but incremental improvement, will depend on how the approach replicates at the largest scales and across different architectures. The paper is already drawing attention from researchers and engineers for whom transformer efficiency is not academic.