DeepSeek's mHC: A New Architecture Rewiring LLMs for 2026 |

Translate: 🇫🇷 French 🇸🇦 Arabic 🇨🇳 Chinese 🇪🇸 Spanish

Just a year after revolutionizing the AI industry with DeepSeek-R1, DeepSeek is setting a powerful precedent for 2026. A fascinating new paper, “mHC: Manifold-Constrained Hyper-Connections,” is generating significant hype as a potential catalyst for the next major AI breakthrough. This work builds upon a previous paper from ByteDance on Hyper-Connections, but no prior knowledge is needed to understand the concepts in this article.

To fully grasp the motivation behind this innovation, we must first revisit the concept of residual connections.

The Foundation: Understanding Residual Connections

First introduced with ResNet back in 2016, the standard residual connection is a fundamental building block of modern neural networks. A residual block doesn’t represent a full model but illustrates a core idea. An input, let’s call it x_l, arrives from the model’s previous layer. From here, the signal splits into two distinct paths.

The Processing Path: On one side, the input is processed by a module, F. This module can be anything from a feed-forward network to a self-attention block.
The Residual Stream: On the other side, the original input x_l is passed forward completely unchanged. This is the “shortcut” or “identity” connection.

These two streams are then merged through an element-wise sum, producing the block’s output. The formula is elegantly simple:

output = F(x_l) + x_l

When these blocks are stacked, layer after layer, residual connections allow the original input signal to propagate deep into the network. This preserves information and is crucial for stabilizing the training of very deep networks. A key reason for their success is their ability to mitigate the “vanishing gradients” problem. The identity mapping provides a constant gradient of 1, ensuring that the learning signal doesn’t shrink to nothing as it travels backward through the network.

For the last decade, innovation has focused heavily on the F module—new attention mechanisms, mixture-of-experts, and more. The residual connection itself, however, has remained almost entirely unchanged. This is the gap that Hyper-Connections were designed to fill.

The Next Step: Hyper-Connections

Introduced in a 2025 paper, Hyper-Connections generalize the concept of residual connections. The core idea is to widen the residual stream itself. Instead of a single residual vector, the input is expanded into multiple components that are mixed together at every layer using learned mappings.

Imagine the input x_l being duplicated multiple times, say four, to form an expanded residual stream. As this wider stream flows through the layers, its components are mixed together by a learnable residual mapping matrix. This gives the model the flexibility to learn how information should be mixed and propagated, rather than being restricted to a fixed identity path.

Crucially, this added flexibility doesn’t come with a massive computational cost. The expansion rate is typically small (e.g., 4). Before being processed by the expensive F module, the expanded input is projected back down to the model’s original dimension. This means the costly attention and feed-forward layers don’t operate on the expanded representation. After processing, the output of F is expanded again via another learnable matrix and combined with the residual stream.

The Problem with Unchecked Flexibility

There is no doubt that this design grants the model more expressive power. The network gains immense flexibility in how information flows across its layers. But this flexibility comes at a price.

Standard residual connections guarantee an identity mapping by their very architecture. This guarantee is vital for stabilizing the training of large-scale networks. DeepSeek’s key observation is that Hyper-Connections break this promise. They rely on unconstrained, learned mixing matrices. As a result, the residual stream can drift, causing signal magnitudes to either explode or vanish during forward and backward passes. This phenomenon undermines the fundamental premise of residual learning and leads to training instability, especially in deeper models.

DeepSeek’s Solution: Manifold-Constrained Hyper-Connections (mHC)

This is precisely the problem that DeepSeek addresses with Manifold-Constrained Hyper-Connections (mHC). The approach is not to remove the flexibility of Hyper-Connections but to tame it. The core idea is to preserve their full expressive power while restoring the identity guarantee that made residual connections so effective.

Structurally, an mHC block is nearly identical to a Hyper-Connections block. The crucial difference lies in the residual mixing matrix. It is no longer unconstrained. Instead, it is subjected to two strict structural constraints:

Non-Negativity: All entries in the matrix must be non-negative.
Doubly Stochastic: Each row and each column must sum to one.

Matrices with these properties are known as “doubly stochastic.” In practice, these constraints are enforced using the classic Sinkhorn–Knopp algorithm from 1967.

Intuitively, this means two things. First, every output residual receives the same total amount of input signal. Second, every input residual contributes the same total amount to the outputs. This ensures the widened residual stream preserves an “identity-like” property at a global level, even while information is freely mixed across multiple paths.

Additionally, DeepSeek enforces non-negativity on the pre- and post-projection matrices using a sigmoid function. The intuition here is that allowing both positive and negative coefficients can lead to signal cancellation, further destabilizing training at scale.

Putting mHC to the Test: The Results

DeepSeek evaluated mHC by pre-training three large language models: a baseline, a model with standard Hyper-Connections, and a model with mHC. All models used mixture-of-experts architectures and an expansion rate of 4.

The results were compelling. In a comparison of 27-billion-parameter models, both Hyper-Connection variants outperformed the baseline, confirming that widening the residual stream improves performance. More importantly, the mHC model consistently achieved the strongest results across multiple downstream benchmarks. This indicates that mHC successfully preserves the benefits of Hyper-Connections while delivering broadly improved performance.

Training stability data tells an even clearer story. The standard Hyper-Connections model showed signs of instability around the 12,000th training iteration, with its loss diverging. The mHC model, in contrast, mitigated this issue. When looking at gradient norms, the standard Hyper-Connections model was clearly unstable, whereas the mHC model’s gradients remained smooth and well-behaved, closely tracking the baseline.

This confirms that Manifold-Constrained Hyper-Connections successfully restore the training stability that is essential for building the next generation of powerful AI models.