DeepSeek's Engram introduces conditional memory as a new axis of sparsity for LLMs. A modernized N-gram lookup table runs on the CPU, relieves early transformer layers from static recall, and unlocks big gains in reasoning, math, and long-context tasks.

What if one of the biggest inefficiencies in modern LLMs is that they are forced to compute things they should simply remember? That is exactly the question DeepSeek's latest paper tries to answer - and the answer reshapes how we think about scaling models.
The field of artificial intelligence is increasingly focused on separating reasoning capabilities from factual memory. Karpathy discusses exactly this idea in the talk . The core intuition is that if we can build a model that does not store factual knowledge but instead specializes in reasoning, we can attach a retrieval system to it and obtain a small, efficient model with strong capabilities - and an almost unbounded memory that is limited mainly by retrieval.
This week, DeepSeek-AI made a concrete move in that direction with a paper called Engram: Conditional Memory via Scalable Lookup - A New Axis of Sparsity for Large Language Models (arXiv:2601.07372, GitHub).
Modern LLMs are remarkably good at reasoning, but they are also remarkably wasteful at remembering simple things. When a transformer sees a phrase like "Diana, Princess of Wales," it does not retrieve that entity. It reconstructs its meaning, layer by layer, through attention and feed-forward networks.
That reconstruction consumes depth, attention capacity, and compute - even though the information itself is static and was seen thousands of times in training.
The reason is simple: transformers have no native lookup operation. They only know how to transform vectors. So they are forced to simulate memory using computation, and that simulation runs on every token, in every layer, for every query.
Even Mixture-of-Experts (MoE) does not fix this. MoE scales computation via conditional routing, but it still asks the network to reconstruct static facts through activations.
A key insight in the paper is that language modeling is not one monolithic task. It is actually two very different workloads glued together:
Transformers treat both the same way. Engram argues they should not:
MoE handles thinking. Engram handles remembering.
This is the new axis of sparsity the paper introduces. Conditional computation (MoE) and conditional memory (Engram) are two independent tools, and the paper shows that using both together is strictly better than using only one.
Engram is a module embedded inside the transformer. It is parametric, trained end-to-end, and fully differentiable. Conceptually, it behaves like a modernized N-gram memory table:
What is important to note:
The memory is query-independent and deterministically addressed.
Unlike attention, the retrieval is:
And unlike RAG, this is not external retrieval over documents. Engram is part of the model itself, learned alongside the rest of the weights.
Raw static memory is dangerous. A lookup can be ambiguous, ridiculous in context, or just wrong. A naive implementation would inject noise into the hidden state and hurt performance.
Engram solves this with a context-aware gate, which plays a role closely analogous to cross-attention. The current hidden state acts as a query that decides how much of the retrieved memory should be trusted:
So relevance is not determined at retrieval time. It is determined by this fusion module after retrieval, which is why a static, deterministic lookup table can still behave in a context-sensitive way.
This single design choice is what makes Engram robust enough to help not just factual recall, but reasoning-heavy tasks too.
One of the most important contributions of the paper is what DeepSeek calls the Sparsity Allocation Law.
Given a fixed parameter budget and fixed FLOPs, how should you split sparse capacity between:
The result is a clear U-shaped curve. Pure MoE is not optimal. Pure memory is obviously not optimal either. The best configurations allocate roughly 20–25% of sparse parameters to memory and the remaining 75–80% to MoE.
That finding is more interesting than it sounds. It means memory is not a nice-to-have add-on. It is a first-class scaling dimension, sitting alongside computation, and you leave performance on the table if you ignore it.
You might expect Engram to help only factual benchmarks like MMLU. And it does - the 27B Engram model beats an iso-parameter, iso-FLOPs MoE baseline by +3.4 on MMLU and +4.0 on CMMLU.
But the bigger surprises are elsewhere:
Why would a memory module help reasoning and math?
The mechanistic analysis in the paper gives a beautifully simple answer. By offloading local, static reconstruction to the memory table, the early transformer layers finish their "easy work" much faster. Early layers in Engram models start behaving like much deeper layers in MoE-only models. The model reaches prediction-ready representations sooner, which effectively deepens the usable network for complex reasoning.
And because local dependencies are handled by lookups, attention is freed to focus on global structure. That is why long-context retrieval jumps so dramatically.
Engram does not make models smarter by stuffing them with more facts. It makes them smarter by freeing compute.
There is one more advantage that may be the most important for the future of large models. Because Engram's memory indices depend only on input tokens - not on activations - lookups are deterministic and predictable ahead of time.
That enables:
DeepSeek shows that even a 100B-parameter memory table can be offloaded to host memory with less than 3% inference overhead.
That changes the economics of scale. GPU HBM is the single most expensive resource in modern AI infrastructure. If you can grow capacity by adding cheap CPU memory - and keep the GPU busy with reasoning rather than retrieval - the cost curve for "smarter" models flattens significantly.
Has DeepSeek actually solved the separation of reasoning and memory? It is too early to say, and only real-world deployments - and whatever ships in DeepSeek V4 - will really answer that.
But the direction is unmistakable. Engram introduces conditional memory as a first-class architectural primitive, alongside conditional computation. Together they complete the sparsity story:
Transformers have been forced to compute when they should have remembered. Engram finally gives them memory, and in doing so it makes reasoning cheaper, deeper, and more scalable.
References: arXiv paper, DeepSeek Engram on GitHub, r/LLM discussion.