Why the Brain Consolidates — Predictive Forgetting for Optimal Generalisation

Three Implementations

Validated across diverse substrates

The same principle — reduce $I(X; Z \mid Y)$, preserve $I(Y; Z)$ — operates in cortical circuits, predictive coding networks, and large language models.

Figure 2: Iterative refinement tightens generalisation bounds

View in paper →

Experiment I

Cortical Latent Code Refinement

A frozen encoder captures high-fidelity codes. A lightweight refiner iteratively compresses them offline — without re-accessing sensory input. Across five benchmarks, this shrinks the generalisation gap while increasing accuracy.

Autoencoder · 5 datasets

Figure 3: Bidirectional predictive coding wake-sleep consolidation

View in paper →

Experiment II

Bidirectional Predictive Coding

A biologically plausible circuit exploits its generative model to "dream" denoised versions of stored memories. Stronger homeostatic priors during sleep compress representations — and the benefit scales with network capacity.

Predictive Coding · Wake-Sleep

Figure 5: Hierarchical cache refinement in LLMs

View in paper →

Experiment III

LLM Cache Consolidation

A Cache Refiner iteratively rewrites the Key-Value store of a frozen Llama-3-8B. This closes the generalisation gap on reasoning tasks and reveals coarse-to-fine hierarchy: global renormalisation in early layers, selective editing in deep layers.

Transformer · Llama-3-8B

Testable Predictions

Empirical signatures of predictive forgetting

The framework generates quantitative predictions testable with current neuroimaging methods.

Mechanism	Neural Signature	Methods	Status
Temporal Compression	Manifold radius/dimension reduction; increased within-category similarity	fMRI RSA, Manifold Capacity Analysis	Grounding
If consolidation implements predictive forgetting, neural representations must become progressively more compressed over time. Geometrically, this corresponds to a reduction in the radius and intrinsic dimension of the neural manifold. Pattern similarity within task-relevant categories should increase, whilst between-category distinctions are sharpened. This reinterprets the "semanticisation" observed in longitudinal fMRI — not as abstract knowledge accumulation, but as the active pruning of manifold variance orthogonal to the task. Chung et al. (2018) Phys Rev X; Schapiro et al. (2017) Sci Rep
Sleep Acceleration	$I(X; Z)$ compression accelerates during sleep vs. wake	Sleep manipulation, fMRI reactivation	Grounding
If offline consolidation is computationally necessary, sleep should accelerate representational change relative to equivalent wake intervals. Measuring representational geometry immediately post-encoding vs. post-sleep should reveal greater compression (manifold contraction) following sleep. This effect has been behaviourally indexed as "gist extraction" and "relational integration", but our framework offers a precise neural definition: sleep replay selectively samples traces to minimise description length. Igloi et al. (2015) Nat Commun; Ellenbogen et al. (2007) Curr Biol
Generalisation Link	Neural compression predicts out-of-distribution performance	Behavioural transfer tests	Grounding
Across individuals and time points, greater neural compression should predict better generalisation to novel exemplars. This provides the critical link between our information-theoretic framework and behavioural outcomes: if reducing $I(X; Z \mid Y)$ tightens generalisation bounds, then participants exhibiting stronger compression should show superior out-of-distribution performance. Importantly, this relationship is selective: compression of information orthogonal to task demands improves generalisation, whereas compression of task-relevant structure impairs it. Tompary & Davachi (2017) J Neurosci; Wimmer & Shohamy (2012) Nat Neurosci
Readout Sharpening	Increased SNR in memory traces; synaptic down-selection	Longitudinal 7T fMRI, electrophysiology	Grounding
Consolidation optimises the decision boundaries of downstream readouts. This mechanism explains "trace sharpening" recently observed in high-field fMRI, where the signal-to-noise ratio of memory representations increases over weeks. Our model interprets this not as passive decay, but as an active process driven by synaptic down-selection: synapses encoding non-predictive nuisance variables are pruned during sleep-dependent renormalisation, effectively implementing the information bottleneck. Vanasse et al. (2022) Neuroimage; de Vivo et al. (2017) Science
Predictive Codes	Shift from retrospective to prospective successor representations	Sequential learning tasks	Unification
In sequential domains, minimising $I(X; Z \mid Y)$ naturally leads to representations that encode expected future state occupancy rather than immediate sensory features. SR-like codes in hippocampal CA1 may reflect the earliest stages of this compression, while the eigenvectors of the SR correspond to grid-like structural codes in entorhinal and prefrontal cortices. We predict that consolidation drives a qualitative shift from environment-specific, place-like representations toward structural, environment-general codes over days to weeks. Stachenfeld et al. (2017) Nat Neurosci; Whittington et al. (2020) Cell; Bennett et al. (2025) bioRxiv
Hierarchical Gradients	Compression magnitude scales with cortical hierarchy	Multi-region fMRI	Prediction
Consolidation-related compression should vary systematically across the cortical hierarchy. Early sensory regions, which must preserve input fidelity for immediate perceptual demands, should exhibit minimal compression. Higher associative cortices (prefrontal, parietal, entorhinal) that extract abstract task structure should show maximal compression. This predicts a spatial gradient in the magnitude of $I(X; Z \mid Y)$ reduction, consistent with initial evidence of "task-tailored" geometry in prefrontal cortex. Bhandari et al. (2025) bioRxiv; Krenz et al. (2023) PNAS
Reconsolidation	Retrieval induces further compression of the trace	Retrieval-practice paradigms	Prediction
Each memory retrieval event should trigger further compression. Our framework predicts that the inverse loop (recall followed by re-encoding) implements an additional consolidation step. Thus, retrieved memories should show greater compression than non-retrieved memories matched for age. This is context-dependent: when task demands require preservation of episodic detail, retrieval-induced compression is suppressed; when abstraction is required, retrieval accelerates the collapse of nuisance dimensions. Hahamy et al. (2023) Science; Bridge & Voss (2014) J Neurosci
Structure / Content Split	Temporal keys stable, sensory values compressed	LLM analysis, HC subfield recording	Prediction
Based on our LLM results (Fig. 5c–d), we predict a physiological dissociation between addressing and content. Neural populations encoding temporal or contextual indices (Keys) — potentially in CA3 or superficial cortical layers — should exhibit high stability during consolidation to preserve structural scaffolding. In contrast, populations encoding sensory content (Values) — such as those in CA1 or deep cortical outputs — should undergo more aggressive compression. Gershman et al. (2025) Trends Cogn Sci; see Fig. 5c–d in this paper

Click any row to expand details and references.

Why the Brain Consolidates: Predictive Forgetting for Optimal Generalisation

Forgetting is not failure — it is the mechanism of generalisation

Why can't the brain just learn compressed representations in a single pass?

Validated across diverse substrates

Cortical Latent Code Refinement

Bidirectional Predictive Coding

LLM Cache Consolidation

High-capacity systems need consolidation

Empirical signatures of predictive forgetting

Cite this work