Fountas et al. · 2026

Why the Brain Consolidates: Predictive Forgetting for Optimal Generalisation

We show that memory consolidation implements conditional information compression — the iterative elimination of sensory information that does not predict future outcomes — provably tightening generalisation bounds.

Zafeirios Fountas1*, Adnan Oomerjee1,2, Haitham Bou-Ammar1,2, Jun Wang2, Neil Burgess3,4
1Huawei Noah's Ark Lab   2UCL Computer Science   3UCL Institute of Cognitive Neuroscience   4UCL Queen Square Institute of Neurology

arXiv Preprint Code Cite this work

Forgetting is not failure — it is the mechanism of generalisation

During encoding, the brain captures every detail: lighting, texture, background. But most of this information is noise for future decisions. We show that consolidation progressively strips away sensory details that do not predict outcomes, leaving behind the semantic core that transfers to novel situations. This is predictive forgetting.

Generalisation Bound
$$\Delta \;\leq\; \tilde{\mathcal{O}}\!\left(\sqrt{\frac{I(X;\,Z \mid Y)\;+\;C}{n}}\right)$$

The generalisation gap is bounded by $I(X; Z \mid Y)$ — the information your memory retains about the input that doesn't predict the outcome. Consolidation minimises this term.

Figure 1: Consolidation as predictive forgetting Figure 1: Consolidation as predictive forgetting

Figure 1 — Consolidation progressively discards noise (green) while sharpening category-predictive features (red, blue).

Why can't the brain just learn compressed representations in a single pass?

Rapid perception demands high fidelity — you must capture every detail of a novel scene for immediate survival. But optimal generalisation demands compression. These objectives are mathematically incompatible in a single encoding step. Temporal separation resolves this: encode richly during wake, then compress iteratively during sleep.

Wake: online encoding Wake: online encoding Sleep: offline consolidation Sleep: offline consolidation
Hover the boxes below
Wake · Online

Maximise $I(X; Z)$ — capture everything for immediate inference and one-shot learning.

Sleep · Offline

Minimise $I(X; Z \mid Y)$ — discard sensory noise, retain only what predicts outcomes.

Validated across diverse substrates

The same principle — reduce $I(X; Z \mid Y)$, preserve $I(Y; Z)$ — operates in cortical circuits, predictive coding networks, and large language models.

Figure 2: Iterative refinement tightens generalisation bounds Figure 2: Iterative refinement tightens generalisation bounds
View in paper →
Experiment I

Cortical Latent Code Refinement

A frozen encoder captures high-fidelity codes. A lightweight refiner iteratively compresses them offline — without re-accessing sensory input. Across five benchmarks, this shrinks the generalisation gap while increasing accuracy.

Autoencoder · 5 datasets
Figure 3: Bidirectional predictive coding wake-sleep consolidation Figure 3: Bidirectional predictive coding wake-sleep consolidation
View in paper →
Experiment II

Bidirectional Predictive Coding

A biologically plausible circuit exploits its generative model to "dream" denoised versions of stored memories. Stronger homeostatic priors during sleep compress representations — and the benefit scales with network capacity.

Predictive Coding · Wake-Sleep
Figure 5: Hierarchical cache refinement in LLMs Figure 5: Hierarchical cache refinement in LLMs
View in paper →
Experiment III

LLM Cache Consolidation

A Cache Refiner iteratively rewrites the Key-Value store of a frozen Llama-3-8B. This closes the generalisation gap on reasoning tasks and reveals coarse-to-fine hierarchy: global renormalisation in early layers, selective editing in deep layers.

Transformer · Llama-3-8B

High-capacity systems need consolidation

In low-capacity networks, architectural bottlenecks naturally enforce compression — replay is redundant. But in high-capacity regimes characteristic of mammalian neocortex, the system memorises sensory noise, causing catastrophic overfitting. Offline consolidation resolves this, allowing the system to scale capacity without sacrificing generalisation.

Normative explanation: The mammalian neocortex, which possesses immense capacity, requires prolonged offline consolidation to generalise well — because single-pass encoding retains outcome-irrelevant detail.
Figure 4: Consolidation resolves the capacity-generalisation trade-off Figure 4: Consolidation resolves the capacity-generalisation trade-off

Empirical signatures of predictive forgetting

The framework generates quantitative predictions testable with current neuroimaging methods.

MechanismNeural SignatureMethodsStatus
Temporal Compression Manifold radius/dimension reduction; increased within-category similarity fMRI RSA, Manifold Capacity Analysis Grounding

If consolidation implements predictive forgetting, neural representations must become progressively more compressed over time. Geometrically, this corresponds to a reduction in the radius and intrinsic dimension of the neural manifold. Pattern similarity within task-relevant categories should increase, whilst between-category distinctions are sharpened. This reinterprets the "semanticisation" observed in longitudinal fMRI — not as abstract knowledge accumulation, but as the active pruning of manifold variance orthogonal to the task.

Chung et al. (2018) Phys Rev X; Schapiro et al. (2017) Sci Rep
Sleep Acceleration $I(X; Z)$ compression accelerates during sleep vs. wake Sleep manipulation, fMRI reactivation Grounding

If offline consolidation is computationally necessary, sleep should accelerate representational change relative to equivalent wake intervals. Measuring representational geometry immediately post-encoding vs. post-sleep should reveal greater compression (manifold contraction) following sleep. This effect has been behaviourally indexed as "gist extraction" and "relational integration", but our framework offers a precise neural definition: sleep replay selectively samples traces to minimise description length.

Igloi et al. (2015) Nat Commun; Ellenbogen et al. (2007) Curr Biol
Generalisation Link Neural compression predicts out-of-distribution performance Behavioural transfer tests Grounding

Across individuals and time points, greater neural compression should predict better generalisation to novel exemplars. This provides the critical link between our information-theoretic framework and behavioural outcomes: if reducing $I(X; Z \mid Y)$ tightens generalisation bounds, then participants exhibiting stronger compression should show superior out-of-distribution performance. Importantly, this relationship is selective: compression of information orthogonal to task demands improves generalisation, whereas compression of task-relevant structure impairs it.

Tompary & Davachi (2017) J Neurosci; Wimmer & Shohamy (2012) Nat Neurosci
Readout Sharpening Increased SNR in memory traces; synaptic down-selection Longitudinal 7T fMRI, electrophysiology Grounding

Consolidation optimises the decision boundaries of downstream readouts. This mechanism explains "trace sharpening" recently observed in high-field fMRI, where the signal-to-noise ratio of memory representations increases over weeks. Our model interprets this not as passive decay, but as an active process driven by synaptic down-selection: synapses encoding non-predictive nuisance variables are pruned during sleep-dependent renormalisation, effectively implementing the information bottleneck.

Vanasse et al. (2022) Neuroimage; de Vivo et al. (2017) Science
Predictive Codes Shift from retrospective to prospective successor representations Sequential learning tasks Unification

In sequential domains, minimising $I(X; Z \mid Y)$ naturally leads to representations that encode expected future state occupancy rather than immediate sensory features. SR-like codes in hippocampal CA1 may reflect the earliest stages of this compression, while the eigenvectors of the SR correspond to grid-like structural codes in entorhinal and prefrontal cortices. We predict that consolidation drives a qualitative shift from environment-specific, place-like representations toward structural, environment-general codes over days to weeks.

Stachenfeld et al. (2017) Nat Neurosci; Whittington et al. (2020) Cell; Bennett et al. (2025) bioRxiv
Hierarchical Gradients Compression magnitude scales with cortical hierarchy Multi-region fMRI Prediction

Consolidation-related compression should vary systematically across the cortical hierarchy. Early sensory regions, which must preserve input fidelity for immediate perceptual demands, should exhibit minimal compression. Higher associative cortices (prefrontal, parietal, entorhinal) that extract abstract task structure should show maximal compression. This predicts a spatial gradient in the magnitude of $I(X; Z \mid Y)$ reduction, consistent with initial evidence of "task-tailored" geometry in prefrontal cortex.

Bhandari et al. (2025) bioRxiv; Krenz et al. (2023) PNAS
Reconsolidation Retrieval induces further compression of the trace Retrieval-practice paradigms Prediction

Each memory retrieval event should trigger further compression. Our framework predicts that the inverse loop (recall followed by re-encoding) implements an additional consolidation step. Thus, retrieved memories should show greater compression than non-retrieved memories matched for age. This is context-dependent: when task demands require preservation of episodic detail, retrieval-induced compression is suppressed; when abstraction is required, retrieval accelerates the collapse of nuisance dimensions.

Hahamy et al. (2023) Science; Bridge & Voss (2014) J Neurosci
Structure / Content Split Temporal keys stable, sensory values compressed LLM analysis, HC subfield recording Prediction

Based on our LLM results (Fig. 5c–d), we predict a physiological dissociation between addressing and content. Neural populations encoding temporal or contextual indices (Keys) — potentially in CA3 or superficial cortical layers — should exhibit high stability during consolidation to preserve structural scaffolding. In contrast, populations encoding sensory content (Values) — such as those in CA1 or deep cortical outputs — should undergo more aggressive compression.

Gershman et al. (2025) Trends Cogn Sci; see Fig. 5c–d in this paper

Click any row to expand details and references.

Cite this work

@misc{fountas2026predictiveforgetting,
  title         = {Why the Brain Consolidates: Predictive Forgetting
                   for Optimal Generalisation},
  author        = {Zafeirios Fountas and Adnan Oomerjee and
                   Haitham Bou-Ammar and Jun Wang and Neil Burgess},
  year          = {2026},
  eprint        = {2603.04688},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2603.04688}
}