Engram - DeepSeek's New Axis of Sparsity Separating Memory from Reasoning

DeepSeek's Engram introduces conditional memory as a new axis of sparsity for LLMs. A modernized N-gram lookup table runs on the CPU, relieves early transformer layers from static recall, and unlocks big gains in reasoning, math, and long-context tasks.

doc-vision.com

Engram - DeepSeek's New Axis of Sparsity Separating Memory from Reasoning thumbnail

What if one of the biggest inefficiencies in modern LLMs is that they are forced to compute things they should simply remember? That is exactly the question DeepSeek's latest paper tries to answer - and the answer reshapes how we think about scaling models.

The field of artificial intelligence is increasingly focused on separating reasoning capabilities from factual memory. Karpathy discusses exactly this idea in the talk . The core intuition is that if we can build a model that does not store factual knowledge but instead specializes in reasoning, we can attach a retrieval system to it and obtain a small, efficient model with strong capabilities - and an almost unbounded memory that is limited mainly by retrieval.

This week, DeepSeek-AI made a concrete move in that direction with a paper called Engram: Conditional Memory via Scalable Lookup - A New Axis of Sparsity for Large Language Models (arXiv:2601.07372, GitHub).

The Hidden Inefficiency in Transformers

Modern LLMs are remarkably good at reasoning, but they are also remarkably wasteful at remembering simple things. When a transformer sees a phrase like "Diana, Princess of Wales," it does not retrieve that entity. It reconstructs its meaning, layer by layer, through attention and feed-forward networks.

That reconstruction consumes depth, attention capacity, and compute - even though the information itself is static and was seen thousands of times in training.

The reason is simple: transformers have no native lookup operation. They only know how to transform vectors. So they are forced to simulate memory using computation, and that simulation runs on every token, in every layer, for every query.

Even Mixture-of-Experts (MoE) does not fix this. MoE scales computation via conditional routing, but it still asks the network to reconstruct static facts through activations.

Two Workloads, One Architecture

A key insight in the paper is that language modeling is not one monolithic task. It is actually two very different workloads glued together:

Dynamic reasoning - logical composition, multi-step inference, math, code generation. Genuinely hard, genuinely adaptive.
Static pattern recall - named entities, idioms, grammatical templates, common phrases, short code snippets. Local, repetitive, context-invariant.

Transformers treat both the same way. Engram argues they should not:

MoE handles thinking. Engram handles remembering.

This is the new axis of sparsity the paper introduces. Conditional computation (MoE) and conditional memory (Engram) are two independent tools, and the paper shows that using both together is strictly better than using only one.

What Engram Actually Does

Engram is a module embedded inside the transformer. It is parametric, trained end-to-end, and fully differentiable. Conceptually, it behaves like a modernized N-gram memory table:

At each token position, the model looks at the local suffix context - typically 2-grams and 3-grams. These short windows are ideal for capturing entities, phrases, and syntactic templates.
Tokens are first normalized through a compression step that collapses superficial differences (casing, whitespace, Unicode variants). This gives a denser, more efficient memory space with fewer wasted slots.
The normalized N-grams are passed through multiple hash functions, each deterministically mapping the N-gram to a slot in a large embedding table. Multiple hash heads reduce collision noise.
The retrieved vectors are concatenated into a memory vector, lightly processed by a small causal convolution, and added back to the hidden state via a residual connection.

What is important to note:

The memory is query-independent and deterministically addressed.

Unlike attention, the retrieval is:

Constant time - O(1) per token
Independent of sequence length
Deterministic - same N-gram always hits the same slots

And unlike RAG, this is not external retrieval over documents. Engram is part of the model itself, learned alongside the rest of the weights.

Context-Aware Gating: The Safety Valve

Raw static memory is dangerous. A lookup can be ambiguous, ridiculous in context, or just wrong. A naive implementation would inject noise into the hidden state and hurt performance.

Engram solves this with a context-aware gate, which plays a role closely analogous to cross-attention. The current hidden state acts as a query that decides how much of the retrieved memory should be trusted:

If the memory aligns with the current context, the gate opens and memory contributes strongly.
If the memory conflicts with what the model is reasoning about, the gate suppresses it almost completely.

So relevance is not determined at retrieval time. It is determined by this fusion module after retrieval, which is why a static, deterministic lookup table can still behave in a context-sensitive way.

This single design choice is what makes Engram robust enough to help not just factual recall, but reasoning-heavy tasks too.

A New Scaling Law: Memory vs. Computation

One of the most important contributions of the paper is what DeepSeek calls the Sparsity Allocation Law.

Given a fixed parameter budget and fixed FLOPs, how should you split sparse capacity between:

MoE experts (computation), and
Engram embeddings (memory)?

The result is a clear U-shaped curve. Pure MoE is not optimal. Pure memory is obviously not optimal either. The best configurations allocate roughly 20–25% of sparse parameters to memory and the remaining 75–80% to MoE.

That finding is more interesting than it sounds. It means memory is not a nice-to-have add-on. It is a first-class scaling dimension, sitting alongside computation, and you leave performance on the table if you ignore it.

Why Reasoning, Math, and Long Context All Improve

You might expect Engram to help only factual benchmarks like MMLU. And it does - the 27B Engram model beats an iso-parameter, iso-FLOPs MoE baseline by +3.4 on MMLU and +4.0 on CMMLU.

But the bigger surprises are elsewhere:

BBH: +5.0
ARC-Challenge: +3.7
HumanEval: +3.0
MATH: +2.4
Multi-Query Needle-in-a-Haystack: 84.2 → 97.0

Why would a memory module help reasoning and math?

The mechanistic analysis in the paper gives a beautifully simple answer. By offloading local, static reconstruction to the memory table, the early transformer layers finish their "easy work" much faster. Early layers in Engram models start behaving like much deeper layers in MoE-only models. The model reaches prediction-ready representations sooner, which effectively deepens the usable network for complex reasoning.

And because local dependencies are handled by lookups, attention is freed to focus on global structure. That is why long-context retrieval jumps so dramatically.

Engram does not make models smarter by stuffing them with more facts. It makes them smarter by freeing compute.

CPU-Scalable Memory: Breaking the GPU Ceiling

There is one more advantage that may be the most important for the future of large models. Because Engram's memory indices depend only on input tokens - not on activations - lookups are deterministic and predictable ahead of time.

That enables:

Asynchronous prefetching from host memory or SSD
Communication–computation overlap with the GPU's forward pass
Offloading massive memory tables to CPU without sacrificing throughput

DeepSeek shows that even a 100B-parameter memory table can be offloaded to host memory with less than 3% inference overhead.

That changes the economics of scale. GPU HBM is the single most expensive resource in modern AI infrastructure. If you can grow capacity by adding cheap CPU memory - and keep the GPU busy with reasoning rather than retrieval - the cost curve for "smarter" models flattens significantly.

Takeaway

Has DeepSeek actually solved the separation of reasoning and memory? It is too early to say, and only real-world deployments - and whatever ships in DeepSeek V4 - will really answer that.

But the direction is unmistakable. Engram introduces conditional memory as a first-class architectural primitive, alongside conditional computation. Together they complete the sparsity story:

Think with computation.
Remember with memory.
Scale both independently.

Transformers have been forced to compute when they should have remembered. Engram finally gives them memory, and in doing so it makes reasoning cheaper, deeper, and more scalable.

References: arXiv paper, DeepSeek Engram on GitHub, r/LLM discussion.

doc-vision.com

N8N, Google Drive, and DocVision Workflow Invoice Scanning - Go Paperless Painlessly

On This Page

The Hidden Inefficiency in Transformers Two Workloads, One Architecture What Engram Actually Does Context-Aware Gating: The Safety Valve A New Scaling Law: Memory vs. Computation Why Reasoning, Math, and Long Context All Improve CPU-Scalable Memory: Breaking the GPU Ceiling Takeaway