Nagesh Singh Chauhan
- Feb 18
- 9 min read

Catastrophic Forgetting in Neural Networks

The article explores Catastrophic Forgetting in Neural Networks and different ways to counter it.

Introduction

Forgetting is a fundamental aspect of being human. We all experience moments where we can’t recall where we left our keys, struggle to remember a familiar name, or fail to remember what we ate for dinner a few nights prior. However, these apparent memory lapses are not necessarily indicative of failure. Instead, they underscore a complex cognitive process that allows our brains to prioritize, filter, and manage a flood of information. Paradoxically, the act of forgetting attests to our capacity to learn and remember.

In the same way humans forget, machine learning models, especially Large Language Models, also experience this phenomenon. These models learn by modifying internal parameters in response to exposure to data. But, if the new data is in contrast with what the model has previously learned, it might suppress or overwrite the old information. Even data that supports previous learning can inadvertently adjust otherwise well-set learning weights. This occurrence, known as ‘catastrophic forgetting,’ poses a significant hurdle in the development of stable and adaptable artificial intelligence systems.

What is Catastrophic forgetting?

Catastrophic forgetting refers to a phenomenon observed in machine learning models, particularly in neural networks, where the model loses previously learned knowledge or experiences significant degradation in performance on previously learned tasks when it is trained on new tasks or datasets. This issue is especially prevalent in sequential learning scenarios where the model is trained continuously over time.

The primary cause of catastrophic forgetting is the lack of mechanisms for preserving previously learned knowledge while adapting to new information. When a neural network is trained on new data, the weights and parameters of the network are updated to minimize the loss function for the new task. However, these updates often result in significant changes to the model's internal representations, leading to the loss of information relevant to previous tasks.

For example you’ve successfully trained your model on dog pictures, which it recognizes 99% of the time. Then, you start another training session, this time for bird pictures. The model readjusts its weights to recognize birds – and thereby loses its ability to identify dogs. This effect is called catastrophic forgetting or catastrophic interference.

Let me give you an example to showcase catastrophic forgetting using a simple neural network trained sequentially on two tasks:

Catastrophic forgetting is observed when the model's performance on the second task significantly deteriorates compared to its performance on the first task, indicating that the model has forgotten some of the knowledge learned during the initial training phase.

Strategies to Counter Catastrophic Forgetting

Elastic Weight Consolidation(EWC)

Elastic Weight Consolidation (EWC) is a method for sequentially training a single artificial neural network on multiple tasks. It operates under the assumption that certain weights of the trained neural network hold greater importance for previously learned tasks. During training on a new task, adjustments to the network's weights are constrained based on their importance, with changes less likely for more critical weights. EWC estimates weight importance using probabilistic mechanisms, such as the Fisher information matrix. The technique regularizes training by introducing a penalty term to the loss function, discouraging significant changes to important parameters identified based on their role in previous tasks.

Credits

Here's how Elastic Weight Consolidation works:

Training on Previous Task: After training the neural network on a specific task, EWC computes the importance of each parameter for that task using the Fisher Information matrix. The Fisher Information matrix captures how sensitive the loss function is to small changes in each parameter.
Computing Importance Estimates: The importance of each parameter is computed as a function of the Fisher Information matrix. Parameters that are highly sensitive to changes in the loss function have higher importance estimates.
Regularizing Training on New Task: When training the network on a new task, EWC modifies the loss function to include a regularization term that penalizes changes to important parameters. The regularization term ensures that the network retains knowledge of the previous task by discouraging significant updates to critical parameters.
Trade-off Parameter: EWC introduces a hyperparameter called the trade-off parameter (λ) that controls the balance between minimizing the loss on the new task and preserving knowledge of previous tasks. A higher value of λ places more emphasis on preserving previous knowledge, while a lower value allows more flexibility for adapting to the new task.

By incorporating the EWC regularization term into the training process, neural networks can effectively prevent catastrophic forgetting by selectively protecting important parameters learned from previous tasks. This allows the network to adapt to new tasks while retaining knowledge of earlier experiences, enabling continual learning in sequential or lifelong learning scenarios.

Progressive Neural Networks

Progressive Neural Networks (PNNs) are a class of neural network architectures designed to address catastrophic forgetting in sequential learning tasks. Instead of training a single neural network on multiple tasks sequentially, PNNs maintain separate neural networks, called "experts," for each task encountered.

When a new task is introduced, a new expert network is added to the architecture, initialized with weights from the existing experts. During training on the new task, the parameters of the new expert network are updated, while the parameters of the existing experts are kept frozen. This allows the network to specialize in the new task without significantly disrupting the performance on previously learned tasks.

PNNs leverage the knowledge transfer from existing experts to expedite learning on new tasks, mitigating the risk of catastrophic forgetting. By preserving the expertise learned from previous tasks and gradually expanding the network's capacity, PNNs provide a scalable solution for continual learning scenarios.

A Block-based ProgNet Model, https://arxiv.org/abs/1606.04671

Imagine Progressive Neural Networks as a constellation of separate processing units, each having the ability to discern and harness the most pertinent inputs for the tasks they are assigned. Let’s consider an example from Figure 3, where output₃ not only interacts with its directly connected hidden layer, h₂, but also interfaces with the h₂ layers of prior columns, modifying their outputs through its unique lateral parameters. This output₃ unit scans and evaluates the available data, strategically omitting inputs that are unnecessary. For instance, if h₂¹ encapsulates all the needed information, output₃ may choose to neglect the rest. On the other hand, if both h₂² and h₂³ carry valuable information, output₃ could preferentially focus on these while ignoring h₂¹. These side connections empower the network to effectively manage the flow of information across tasks while also enabling it to exclude irrelevant data.

Rehearsal or Replay

The primary idea behind rehearsal is to store and replay samples from past tasks during the training of the network on new tasks. By periodically revisiting and retraining on previous data, the network can retain knowledge about those tasks while learning new ones.

Here's a detailed explanation of how rehearsal works and its key components:

Memory Buffer: A memory buffer or replay buffer is used to store a subset of past training samples. This buffer typically contains input-output pairs (e.g., images and labels) from previous tasks that the network has already been trained on. The buffer has a finite capacity and can be implemented using data structures like lists or queues.
Rehearsal Strategy: During training on a new task, the network alternates between training on the current task's data and replaying samples from the memory buffer. The rehearsal strategy determines how often and which samples are replayed from the buffer. Common strategies include replaying random samples, replaying samples with high uncertainty, or replaying samples from tasks with similar characteristics to the current task.
Regularization: Replaying past samples effectively regularizes the learning process by providing additional training data. This regularization helps prevent catastrophic forgetting by encouraging the network to maintain its performance on previous tasks while adapting to the new task. It helps stabilize the network's parameters and reduces the risk of overfitting to the current task.
Compatibility with Learning Algorithms: Rehearsal can be incorporated into various learning algorithms, including gradient-based optimization methods like stochastic gradient descent (SGD) or variants like Adam. When combined with these algorithms, rehearsal typically involves interleaving batches of current task data with batches of replayed data from the memory buffer during training iterations.
Trade-offs and Challenges: While rehearsal is effective in mitigating catastrophic forgetting, it comes with trade-offs and challenges. Managing the memory buffer's size and selecting appropriate replay strategies can impact the efficiency and effectiveness of the approach. Additionally, storing and replaying large amounts of data may incur computational overhead and memory requirements.

Optimized Fixed Expansion Layers(OFELs)

OFELs address Catastrophic forgetting by introducing additional layers into the neural network architecture that are specifically designed to preserve knowledge learned from previous tasks while allowing the network to adapt to new tasks. These additional layers are known as "expansion layers" and are strategically placed within the network to act as memory modules.

Here's how OFELs work to counter catastrophic forgetting:

Fixed Expansion Layers: The expansion layers are fixed in size and are not updated during the training process. This means that their parameters remain constant throughout training, serving as memory banks that store important features and representations learned from previous tasks.
Optimized Adaptation Layers: Alongside the fixed expansion layers, OFELs include adaptation layers that are trainable and are responsible for adapting the network's parameters to the current task. These layers are placed after the expansion layers and are responsible for learning task-specific features while leveraging the preserved knowledge stored in the expansion layers.
Preserving Knowledge: By keeping the expansion layers fixed, OFELs ensure that the representations learned from previous tasks remain intact and are not overwritten during the training of new tasks. This helps in preserving the knowledge acquired from earlier training phases, thereby mitigating catastrophic forgetting.
Adaptive Learning: The trainable adaptation layers allow the network to adjust its parameters to perform well on the current task while minimizing interference with the representations stored in the expansion layers. This adaptive learning mechanism enables the network to effectively leverage both old and new knowledge without significant degradation in performance on earlier tasks.

Overall, OFELs provide a structured approach to address catastrophic forgetting by incorporating fixed expansion layers to preserve knowledge from previous tasks and trainable adaptation layers to adapt to new tasks. This balance between stability and adaptability helps neural networks maintain performance across multiple learning scenarios without suffering from catastrophic forgetting.

Meta-learning Approaches

Meta-learning approaches offer a promising strategy to counter catastrophic forgetting in machine learning models, including neural networks. These methods involve training models to learn how to learn, enabling them to quickly adapt to new tasks without significant degradation in performance on previously learned tasks.

Here's a detailed explanation of meta-learning approaches:

Model-Agnostic Meta-Learning (MAML):

MAML is a popular meta-learning algorithm that aims to learn a good initialization of model parameters such that fine-tuning on a new task with a small amount of data leads to rapid convergence and good performance.
It works by iteratively updating the initial parameters based on the gradients computed from a few training examples sampled from a new task.
By learning a set of initial parameters that generalize well across tasks, MAML enables efficient adaptation to new tasks without catastrophic forgetting of previously learned knowledge.

Reptile (Repeated Meta-Learning):

Reptile is another meta-learning algorithm that follows a similar principle to MAML but with a simpler optimization procedure.
Instead of updating the model parameters directly towards the gradient of the new task, Reptile updates the parameters towards the direction of the average gradient across multiple tasks.
This approach encourages the model to find a more robust set of parameters that generalize well across tasks, thereby mitigating catastrophic forgetting.

Memory-Augmented Meta-Learning:

Memory-augmented meta-learning methods incorporate external memory modules into neural network architectures to store and retrieve task-specific information.
These memory modules allow the model to retain important knowledge from previous tasks and selectively update parameters to adapt to new tasks while preserving previously learned knowledge.
By effectively leveraging memory, these approaches prevent catastrophic forgetting by maintaining a balance between adaptation to new tasks and retention of old knowledge.

Gradient-Based Meta-Learning with Hypernetworks:

Gradient-based meta-learning techniques with hypernetworks involve learning a meta-network, known as a hypernetwork, that generates the parameters of task-specific networks.
By conditioning the hypernetwork on both the input data and task identity, these methods enable rapid adaptation to new tasks while preserving task-agnostic knowledge encoded in the hypernetwork.
This approach effectively separates task-specific and task-agnostic knowledge, mitigating catastrophic forgetting by updating only the task-specific parameters during adaptation.

Conclusion

Forgetting is indeed a complex phenomenon observed not only in artificial intelligence but also in human cognition. It serves as a double-edged sword, where it impedes continuous learning while also facilitating efficient information processing and prioritization. While strategies like Elastic Weight Consolidation, Progressive Neural Networks, and Optimized Fixed Expansion Layers offer promising avenues to mitigate catastrophic forgetting in Large Language Models (LLMs), it's important to recognize the interdisciplinary nature of this challenge.

In addition to the mentioned approaches, researchers have explored techniques inspired by cognitive psychology, such as rehearsal or replay, which involve periodically revisiting past experiences or data during training to reinforce learning and prevent forgetting. These methods leverage the concept of memory consolidation observed in human memory systems.

Moreover, advancements in neuro-inspired computing have led to the development of neuromorphic architectures that mimic the brain's ability to retain and recall information efficiently. These architectures integrate principles of synaptic plasticity and memory consolidation, offering potential solutions to the problem of catastrophic forgetting in AI systems.

Overall, the quest to address catastrophic forgetting requires a holistic approach that combines insights from artificial intelligence, cognitive science, neuroscience, and computational modeling. By leveraging interdisciplinary knowledge and innovative methodologies, researchers can strive towards developing AI systems that exhibit robust and adaptive learning capabilities, akin to human cognition.

References

https://www.linkedin.com/pulse/catastrophic-forgetting-machine-learning-what-how-overcome-/

https://en.wikipedia.org/wiki/Catastrophic_interference

https://towardsdatascience.com/understanding-what-we-lose-b91e114e281b