We show that memory consolidation implements conditional information compression — the iterative elimination of sensory information that does not predict future outcomes — provably tightening generalisation bounds.
During encoding, the brain captures every detail: lighting, texture, background. But most of this information is noise for future decisions. We show that consolidation progressively strips away sensory details that do not predict outcomes, leaving behind the semantic core that transfers to novel situations. This is predictive forgetting.
The generalisation gap is bounded by $I(X; Z \mid Y)$ — the information your memory retains about the input that doesn't predict the outcome. Consolidation minimises this term.
Figure 1 — Consolidation progressively discards noise (green) while sharpening category-predictive features (red, blue).
Rapid perception demands high fidelity — you must capture every detail of a novel scene for immediate survival. But optimal generalisation demands compression. These objectives are mathematically incompatible in a single encoding step. Temporal separation resolves this: encode richly during wake, then compress iteratively during sleep.
Maximise $I(X; Z)$ — capture everything for immediate inference and one-shot learning.
Minimise $I(X; Z \mid Y)$ — discard sensory noise, retain only what predicts outcomes.
The same principle — reduce $I(X; Z \mid Y)$, preserve $I(Y; Z)$ — operates in cortical circuits, predictive coding networks, and large language models.
A frozen encoder captures high-fidelity codes. A lightweight refiner iteratively compresses them offline — without re-accessing sensory input. Across five benchmarks, this shrinks the generalisation gap while increasing accuracy.
Autoencoder · 5 datasetsA biologically plausible circuit exploits its generative model to "dream" denoised versions of stored memories. Stronger homeostatic priors during sleep compress representations — and the benefit scales with network capacity.
Predictive Coding · Wake-SleepA Cache Refiner iteratively rewrites the Key-Value store of a frozen Llama-3-8B. This closes the generalisation gap on reasoning tasks and reveals coarse-to-fine hierarchy: global renormalisation in early layers, selective editing in deep layers.
Transformer · Llama-3-8BIn low-capacity networks, architectural bottlenecks naturally enforce compression — replay is redundant. But in high-capacity regimes characteristic of mammalian neocortex, the system memorises sensory noise, causing catastrophic overfitting. Offline consolidation resolves this, allowing the system to scale capacity without sacrificing generalisation.
The framework generates quantitative predictions testable with current neuroimaging methods.
| Mechanism | Neural Signature | Methods | Status |
|---|---|---|---|
| Temporal Compression | Manifold radius/dimension reduction; increased within-category similarity | fMRI RSA, Manifold Capacity Analysis | Grounding |
If consolidation implements predictive forgetting, neural representations must become progressively more compressed over time. Geometrically, this corresponds to a reduction in the radius and intrinsic dimension of the neural manifold. Pattern similarity within task-relevant categories should increase, whilst between-category distinctions are sharpened. This reinterprets the "semanticisation" observed in longitudinal fMRI — not as abstract knowledge accumulation, but as the active pruning of manifold variance orthogonal to the task. Chung et al. (2018) Phys Rev X; Schapiro et al. (2017) Sci Rep
| |||
| Sleep Acceleration | $I(X; Z)$ compression accelerates during sleep vs. wake | Sleep manipulation, fMRI reactivation | Grounding |
If offline consolidation is computationally necessary, sleep should accelerate representational change relative to equivalent wake intervals. Measuring representational geometry immediately post-encoding vs. post-sleep should reveal greater compression (manifold contraction) following sleep. This effect has been behaviourally indexed as "gist extraction" and "relational integration", but our framework offers a precise neural definition: sleep replay selectively samples traces to minimise description length. Igloi et al. (2015) Nat Commun; Ellenbogen et al. (2007) Curr Biol
| |||
| Generalisation Link | Neural compression predicts out-of-distribution performance | Behavioural transfer tests | Grounding |
Across individuals and time points, greater neural compression should predict better generalisation to novel exemplars. This provides the critical link between our information-theoretic framework and behavioural outcomes: if reducing $I(X; Z \mid Y)$ tightens generalisation bounds, then participants exhibiting stronger compression should show superior out-of-distribution performance. Importantly, this relationship is selective: compression of information orthogonal to task demands improves generalisation, whereas compression of task-relevant structure impairs it. Tompary & Davachi (2017) J Neurosci; Wimmer & Shohamy (2012) Nat Neurosci
| |||
| Readout Sharpening | Increased SNR in memory traces; synaptic down-selection | Longitudinal 7T fMRI, electrophysiology | Grounding |
Consolidation optimises the decision boundaries of downstream readouts. This mechanism explains "trace sharpening" recently observed in high-field fMRI, where the signal-to-noise ratio of memory representations increases over weeks. Our model interprets this not as passive decay, but as an active process driven by synaptic down-selection: synapses encoding non-predictive nuisance variables are pruned during sleep-dependent renormalisation, effectively implementing the information bottleneck. Vanasse et al. (2022) Neuroimage; de Vivo et al. (2017) Science
| |||
| Predictive Codes | Shift from retrospective to prospective successor representations | Sequential learning tasks | Unification |
In sequential domains, minimising $I(X; Z \mid Y)$ naturally leads to representations that encode expected future state occupancy rather than immediate sensory features. SR-like codes in hippocampal CA1 may reflect the earliest stages of this compression, while the eigenvectors of the SR correspond to grid-like structural codes in entorhinal and prefrontal cortices. We predict that consolidation drives a qualitative shift from environment-specific, place-like representations toward structural, environment-general codes over days to weeks. Stachenfeld et al. (2017) Nat Neurosci; Whittington et al. (2020) Cell; Bennett et al. (2025) bioRxiv
| |||
| Hierarchical Gradients | Compression magnitude scales with cortical hierarchy | Multi-region fMRI | Prediction |
Consolidation-related compression should vary systematically across the cortical hierarchy. Early sensory regions, which must preserve input fidelity for immediate perceptual demands, should exhibit minimal compression. Higher associative cortices (prefrontal, parietal, entorhinal) that extract abstract task structure should show maximal compression. This predicts a spatial gradient in the magnitude of $I(X; Z \mid Y)$ reduction, consistent with initial evidence of "task-tailored" geometry in prefrontal cortex. Bhandari et al. (2025) bioRxiv; Krenz et al. (2023) PNAS
| |||
| Reconsolidation | Retrieval induces further compression of the trace | Retrieval-practice paradigms | Prediction |
Each memory retrieval event should trigger further compression. Our framework predicts that the inverse loop (recall followed by re-encoding) implements an additional consolidation step. Thus, retrieved memories should show greater compression than non-retrieved memories matched for age. This is context-dependent: when task demands require preservation of episodic detail, retrieval-induced compression is suppressed; when abstraction is required, retrieval accelerates the collapse of nuisance dimensions. Hahamy et al. (2023) Science; Bridge & Voss (2014) J Neurosci
| |||
| Structure / Content Split | Temporal keys stable, sensory values compressed | LLM analysis, HC subfield recording | Prediction |
Based on our LLM results (Fig. 5c–d), we predict a physiological dissociation between addressing and content. Neural populations encoding temporal or contextual indices (Keys) — potentially in CA3 or superficial cortical layers — should exhibit high stability during consolidation to preserve structural scaffolding. In contrast, populations encoding sensory content (Values) — such as those in CA1 or deep cortical outputs — should undergo more aggressive compression. Gershman et al. (2025) Trends Cogn Sci; see Fig. 5c–d in this paper
| |||
Click any row to expand details and references.
@misc{fountas2026predictiveforgetting,
title = {Why the Brain Consolidates: Predictive Forgetting
for Optimal Generalisation},
author = {Zafeirios Fountas and Adnan Oomerjee and
Haitham Bou-Ammar and Jun Wang and Neil Burgess},
year = {2026},
eprint = {2603.04688},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2603.04688}
}