Retrieval & Knowledge SystemsModel Architecture

MSA Survey: A New Division of Labor for Long-Term Memory

Published Mar 20, 2026

Date: March 20, 2026
Subject: EverMind’s MSA (Memory Sparse Attention) and Related Directions
Positioning: External-facing, high-level technical intuition report

Core Judgment

MSA is worth watching, not because it has solved long-term memory, but because it signals a clear shift: long-term memory is moving from a purely external system capability to a collaborative division of labor between internal model mechanisms and external systems.

For the past two years, the industry has followed a stable division: models handle reasoning, while external systems handle memory. This memory takes many forms: RAG, vector databases, knowledge graphs, agent memory layers, or simple document indices. MSA proposes moving part of this access capability inside the model, teaching it to perform sparse selection and multi-hop access across massive histories.

This matters because it changes our understanding of AI system boundaries. The question is no longer just how large a model’s window is, but what capabilities should remain internal and what should stay external.

Currently, this work is in its early stages. Public materials are limited to a README and a paper. Code and models are not fully open, and third-party replications have yet to appear. It is best viewed as a significant node on the technical map rather than a ready-to-implement architectural manual.

Why It Is Appearing Now

MSA is not a sudden breakthrough. It is the natural convergence of several directions that have been maturing and are now intersecting.

The first is the continuous expansion of long-context capabilities. The industry has proven that context windows can reach 1M tokens or more, but benchmarks also show that capacity and access quality are different things. The truly scarce capability is reliably locating relevant information within extreme contexts and converting it into reasoning quality. We saw this gap in Long Context Benchmarks: All Three Hit 1M — Now What?: capacity scales fast, but access reliability improves much more slowly.

The second is the maturation of RAG. Today’s RAG is no longer a lightweight process of embedding, top-k retrieval, and prompt stitching. It is evolving into a fuller context engine responsible for state, provenance, permissions, planning, and multi-hop retrieval. RAGFlow’s review of the shift from RAG to a context engine. MSA’s value lies in pulling part of this capability back into the model.

The third is sparsity becoming a mainstream engineering direction. Whether through sparse attention, KV cache eviction, or DeepSeek’s recent memory-related work, the constraint is the same: the cost of full retention and full access is too high. Systems need to preserve only a small number of high-value connections within a massive information space. Ben Dickson’s analysis of sparse attention

Taken together, these trends make MSA’s timing easier to understand. It addresses an old question that is becoming increasingly urgent: as context keeps growing, how should models actually access history?

A Technical Map of Long-Term Memory

The most effective way to understand MSA is not to start with the paper’s details, but to see where it sits on the broader technical map.

Today, work on long-term memory roughly falls into five routes.

The first route is simply expanding the context window. This is the most direct path. Product changes are minimal, and the default usage cost is relatively low. The problem is equally clear: as context grows, noise, localization cost, and attention decay all rise with it. This route works well as a baseline capability layer for medium-scale tasks, but it is not a complete answer to long-term memory.

The second route keeps memory outside the model, using RAG or external memory systems. This is the most mature route today. It supports updates, auditing, permission control, and structured filtering, and it integrates naturally with product data layers. Its cost is that the system boundary becomes more complex, so multi-hop reasoning and state consistency require additional design.

The third route compresses history into recurrent or fixed-size memory. The advantage here is a clearer computational boundary and more controllable cost. The downside is also obvious: once critical information is lost during compression, later steps cannot recover it. This route is closer to an internal summary memory. It works for continuous state maintenance, but it is a weak fit for tasks that require factual traceability.

The fourth route turns memory into an internal sparse retrieval layer. This is where MSA sits. The goal is for the model to learn how to selectively access historical information, reducing the distance between retrieval and reasoning. The advantage is that multi-hop chains have a better chance of remaining continuous. The unresolved issues are memory updates, provenance, verifiability, and system cost.

The fifth route is a hybrid architecture: internal sparse memory plus an external context engine. From a product perspective, this is currently the most reasonable route. Internal memory is well suited for ultra-long, multi-hop tasks that require continuity of reasoning. The external context engine handles dynamic updates, permissions, auditing, provenance, and structured queries.

That is also my current judgment about this family of work. Long-term memory is unlikely to converge into a single route. A more plausible outcome is that different layers of memory capability will live in internal and external systems and carry different responsibilities.

MSA’s Position on the Map

Once the map is in place, MSA’s role becomes much easier to see.

MSA tries to encode large-scale historical content into routable latent memory, then use sparse selection to retrieve only the parts highly relevant to the current query and combine them with the current context for generation. MSA GitHub

From a system perspective, it is closer to an internal sparse access layer. What it advances is the fusion of long-context access and retrieval augmentation, rather than a complete long-term memory system. This distinction matters. Long-context access, retrieval augmentation, and long-term memory writing and maintenance correspond to different levels of the problem. MSA is primarily advancing the first two.

If we keep only the minimum amount of technical detail, its structure can be summarized in four points. First, history is encoded into retrievable latent states for top-k selection, rather than applying dense attention to all historical tokens. Second, it uses document-wise RoPE and related positional design choices to reduce drift between training length and inference length. Third, it places KV cache into tiered storage, with part of it on GPU and part on CPU/DRAM, using system engineering to support extremely long memory. Fourth, it introduces interleaved retrieval and generation rounds to improve context expansion in multi-hop tasks. MSA GitHub README

Taken together, these design choices define its role: a solution that handles retrieval, sparse access, and systems scaling within the same layer.

How It Relates to Earlier Work

Historically, MSA is best understood as combinational progress.

Its closest predecessors are Memorizing Transformers from 2022 and LongMem from 2023. Both were already exploring similar ideas: cache historical representations, retrieve relevant content when needed, and fuse it with local context. MSA differs mainly in end-to-end training style, retrieval granularity, and system scale.

From the perspective of pretraining systems, RETRO from 2022 is also important. RETRO had already shown that a model can interact sparsely with external blocks during generation without compressing all knowledge into parameters. MSA differs mostly in architecture and engineering implementation.

From the compressed-memory route, Infini-attention from 2024 and the earlier Recurrent Memory Transformer represent another solution. They compress long-distance information into finite states through recurrent or compressive memory. MSA sits closer to retrieval-based memory, but both families are addressing the same core problem: how to process very long history under finite computation.

At the systems engineering level, work such as H2O and StreamingLLM had already shown that KV cache needs sparse management. MSA continues in that direction and further organizes retained content into a routable memory tier.

So MSA’s value comes mainly from regrouping existing ideas and pushing them to larger scale, not from a completely independent new principle. That does not reduce its importance. Many influential systems papers matter precisely because they redraw boundaries between existing directions and turn scattered capabilities into a usable architecture.

Where This Kind of Architecture Creates Value

MSA-style architectures are most likely to change product design in four categories of tasks.

The first is multi-hop reasoning inside a single massive corpus, such as chaining evidence across regulations, contracts, or research papers. The key difficulty here is maintaining continuity across multiple pieces of evidence, not simply retrieving one local answer. MSA’s value lies in moving that chain more tightly into the model’s access process. MSA README

The second is long-horizon agent memory, such as long-term project collaboration, customer success systems, or cross-session research assistants. The core issue here is how to reorganize events scattered across long time spans into the context needed for the current reasoning step. EverMind’s continued work on memory benchmarks also suggests that they see MSA within a broader long-term agent memory frame. EverMemBench

The third is very large but slow-changing knowledge bases. If part of the knowledge is relatively static and called frequently, moving part of the access capability into the model can indeed change the cost structure. That is also why DeepSeek Engram is often discussed alongside it. DeepSeek Engram

The fourth is long-range agent workflows. Coding agents and research agents generate large amounts of intermediate state within a single task. In those cases, the bottleneck often comes from storing and accessing historical state, which is where sparse attention and internal memory mechanisms become more directly useful. Sebastian Raschka’s review of the DeepSeek stack

These scenarios share one pattern: the challenge is not just that there is a lot of information, but that the information contains scattered, multi-hop, and cross-temporal dependencies.

Where This Kind of Architecture Has Limited Value

The limitations are equally clear.

The first category is dynamic data. When the knowledge base changes continuously, external RAG still has a clear advantage because it can update the index directly. MSA’s currently available materials provide almost no explanation of memory writing or hot-update mechanisms.

The second category is systems with clear auditing and accountability requirements. Legal, financial, medical, and enterprise knowledge systems typically require explicit citation trails. External retrieval systems support provenance naturally. Internal memory mechanisms still lack mature interfaces for this. Redis on RAG versus large context windows

The third category is multi-tenant and permission-controlled environments. Enterprise data access usually depends on user identity, organizational boundaries, and policy rules. RAG can implement permission control at the query layer more naturally. Internal memory solutions still lack a clear design here.

The fourth category is simple tasks. When the corpus is small, update frequency is low, and provenance requirements are weak, existing long-context models or simple RAG setups usually offer better deployment efficiency and debugging cost.

Taken together, these scenarios support a broader judgment: this kind of architecture is better at solving access quality than at replacing updates, auditing, and permissions.

The Evidence Still Missing

What MSA currently lacks is not more promotion, but stronger validation.

First, production data is missing. Public materials are still limited to a README and a paper. Without fully open code, models, or third-party deployment data, we still do not have latency curves, throughput numbers, cost curves, or failure modes.

Second, a memory write mechanism is missing. Many papers prove that a system can read history more effectively. Real systems also depend on writing, updating, versioning, and forgetting. A memory architecture with only a read path cannot cover the full requirements of long-term memory.

Third, product-like benchmarks are missing. needle-in-a-haystack, long-document QA, and RULER are valuable, but they are still far from real-world conditions that involve permissions, temporal updates, provenance auditing, and cross-system queries.

Fourth, independent replication is missing. As long as the key conclusions come primarily from the authors themselves, this work is better treated as a strong signal than as stable consensus.

These gaps determine how we should use this paper today. It is better suited to updating architectural intuition than to replacing existing designs directly.

Final Judgment

What MSA really points to is not that one model has finally acquired long-term memory. It points to the fact that long-term memory itself is entering a new phase of division of labor.

Over the next few years, the more likely outcome is not convergence onto a single route, but a clearer hierarchy of responsibilities. Internal sparse memory will handle ultra-long, multi-hop access that requires continuity of reasoning. External context engines will handle dynamic updates, permissions, auditing, provenance, and structured queries. Documents, indices, RAG, memory layers, and internal model mechanisms will increasingly sit inside the same overall architecture while carrying different responsibilities.

For AI builders, the more durable question is this: which memories need to remain external, auditable, and updatable, and which memories are worth internalizing so they become part of the model’s reasoning process. That question will outlast MSA itself and will continue to shape the way agent architectures are designed.

References

MSA official repo and paper: https://github.com/EverMind-AI/MSA
Memorizing Transformers: https://arxiv.org/abs/2203.08913
RETRO: https://proceedings.mlr.press/v162/borgeaud22a/borgeaud22a.pdf
LongMem: https://arxiv.org/abs/2306.07174
MemLong: https://arxiv.org/abs/2408.16967
Infini-attention: https://arxiv.org/abs/2404.07143
Recurrent Memory Transformer: https://papers.neurips.cc/paper_files/paper/2022/file/47e288629a6996a17ce50b90a056a0e1-Paper-Conference.pdf
H2O: https://arxiv.org/abs/2306.14027
StreamingLLM: https://arxiv.org/abs/2309.17439
EverMemBench: https://arxiv.org/html/2602.01313v3
DeepSeek Engram: https://github.com/deepseek-ai/Engram
RAGFlow on the shift from RAG to a context engine: https://ragflow.io/blog/rag-review-2025-from-rag-to-context
Redis on RAG versus large context windows: https://redis.io/blog/rag-vs-large-context-window-ai-apps/
Ben Dickson on sparse attention: https://bdtechtalks.substack.com/p/how-sparse-attention-is-solving-ais
Sebastian Raschka on the DeepSeek stack: https://magazine.sebastianraschka.com/p/technical-deepseek