Nagesh Singh Chauhan
- Jan 19
- 9 min read

Fine-tuning Large Language Models (LLMs) using PEFT

The article contains an overview of fine tuning approches using PEFT and its implementation using pytorch, transformers and unsloth.

Credits

Introduction

Fine-tuning large language models (LLMs) emerges as a crucial technique in the field of natural language processing, allowing professionals to tailor advanced pre-trained models to their specific needs. This exploration delves into the details of this process, offering insights into how we can refine models like GPT-3, Llama 2 and Mixtral.

Throughout this article, we'll navigate the steps involved in fine-tuning LLMs, uncovering the nuances of adapting pre-trained models to diverse applications. From sentiment analysis to named entity recognition and language translation, we'll unveil the potential of customizing models for specific domains.

As we traverse the fine-tuning landscape, we'll discuss the subtleties of selecting the right pre-trained model, setting task-specific goals, and crafting datasets that enable models to grasp the intricacies of targeted domains. The iterative nature of fine-tuning, coupled with the need for precise hyper-parameter tuning, highlights the blend of art and science in this process.

Whether you're an experienced practitioner or an enthusiast looking to comprehend the mechanics of adapting language models, this exploration aims to demystify fine-tuning, opening doors to more effective and specialized natural language processing solutions.

What is Fine-tuning?

Fine-tuning a Large Language Model (LLM) involves adjusting the parameters or weights of a pre-trained language model to adapt it to a new and specific task or dataset. In the context of natural language processing, LLMs are often trained on vast amounts of general language data. Fine-tuning allows practitioners to take advantage of this pre-existing knowledge and customize the model for more specialized applications.

The process typically begins with a pre-trained LLM, such as GPT-3 or BERT. Instead of starting from scratch, which can be computationally expensive and time-consuming, fine-tuning involves updating the model based on a smaller, task-specific dataset. This dataset is carefully curated to align with the targeted application, whether it's sentiment analysis, question answering, language translation, or any other natural language processing task.

Fine Tuning OpenAI GPT-3.5-Turbo. Credits

Fine-tuning is crucial when there is a need for domain-specific expertise or when working with limited data for a particular task. It enables the model to leverage its pre-existing linguistic knowledge while adapting to the nuances and intricacies of the new task or domain. The fine-tuned LLM retains the general language understanding acquired during pre-training but becomes more specialized and optimized for the specific requirements of the desired application.

Fine-tuning . Credits

3 Approaches to Model Fine-tuning

Fine-tuning a model can be achieved through three distinct avenues: self-supervised learning, supervised learning, and reinforcement learning. These approaches are not mutually exclusive, allowing for a combination tailored to specific requirements during the fine-tuning process.

Self-supervised Learning: In self-supervised learning, a model is trained based on the inherent structure of the training data. For language models, this often involves predicting the next word or token in a given sequence. Beyond initial model development, self-supervised learning can be applied to fine-tune models, such as creating a model to emulate a specific writing style based on example texts.

Supervised Learning: Supervised learning stands out as a popular method for model fine-tuning. It entails training a model on input-output pairs specific to a task. For instance, instruction tuning aims to enhance the model's ability to answer questions or respond to user prompts.

Reinforcement Learning: The third approach involves using reinforcement learning (RL) for model fine-tuning. RL leverages a reward model to guide the training of the base model, aligning language model completions with human labelers' preferences. By combining the reward model with an RL algorithm like Proximal Policy Optimization (PPO), the pre-trained model undergoes effective fine-tuning.

3 Options for Parameter Training

Fine-tuning a model with a substantial number of parameters (~100M-100B) necessitates consideration of computational costs. The pivotal question revolves around the selection of parameters for (re)training. Three distinct options offer flexibility in this regard.

1. Retrain all parameters: This straightforward approach involves training all internal model parameters (full parameter tuning). While conceptually simple, it proves to be the most computationally intensive, and it may encounter the challenge of catastrophic forgetting, where valuable initial knowledge is lost.

2. Transfer Learning :Transfer learning (TL) aims to retain useful model representations/features from past training when adapting the model to a new task. This involves replacing the "head" of a neural network with new layers, mitigating the computational cost of training an LLM. However, it might not fully address the issue of catastrophic forgetting.

3: Parameter Efficient Fine-tuning (PEFT): In this article we'll discuss PEFt in details.

Parameter Efficient Fine-tuning(PEFT)

Parametric efficient fine-tuning (PEFT) is a methodology used in transfer learning to efficiently fine-tune large pre-trained models without modifying most of their original parameters. PEFT aims to minimize the storage requirements and computation cost associated with traditional fine-tuning approaches, making them feasible for deployment on resource-constrained devices.

PEFT strategies involve adjusting only a limited number of additional model parameters while keeping the majority of the pretrained Language Model (LLM) parameters fixed. This results in a significant reduction in computational and storage requirements. Notably, PEFT addresses the challenges associated with catastrophic forgetting, a phenomenon observed when fully fine-tuning LLMs. Additionally, PEFT methods have demonstrated superiority over traditional fine-tuning, particularly in scenarios with limited data, and exhibit enhanced generalization capabilities in out-of-domain situations.

In short, PEFT approaches enable you to get performance comparable to full fine-tuning while only having a small number of trainable parameters.

Parameter Efficient Fine-tuning(PEFT) techniques

Presently, only the following PEFT methods are employed. Nevertheless, ongoing research is underway to explore and develop new methods.

Adapters

Adapters are special submodules added to pre-trained language models, modify hidden representations during fine-tuning. Positioned after specific layers in the transformer architecture, adapters enable updating their parameters while keeping the rest of the model frozen. This straightforward adoption involves inserting adapters into each transformer layer and adding a classifier layer atop the pre-trained model.

Adapters in PEFT. Credits

Updating adapter and classifier head parameters enhances task-specific performance without modifying the entire model, saving time and computational resources.

LoRA

LoRA (Low-Rank Adaptation) is a fine-tuning approach for large language models, akin to adapters. It introduces a small trainable submodule into the transformer architecture, freezing pre-trained model weights, and incorporating trainable rank decomposition matrices in each layer. This significantly reduces trainable parameters for downstream tasks, cutting down the count by up to 10,000 times and GPU memory requirements by 3 times. Despite this reduction, LoRA maintains or surpasses fine-tuning model quality across tasks, ensuring efficient task-switching with lowered hardware barriers and no additional inference latency.

LoRA. Credits

LoRA represents a smart balance in model fine-tuning, preserving the core strengths of large pre-trained models while adapting them efficiently for specific tasks or datasets. It’s a technique that redefines efficiency in the world of massive language models.

QLoRA

QLoRA (Quantized Low-Rank Adaptation) is an extension of the Parameter Efficient Finetuning (PEFT) approach for adapting large pretrained language models like BERT. In QLoRA, instead of adding new task-specific layers on top of the frozen pretrained model, existing higher layers are adapted.These layers are made more efficient by quantizing and decomposing their weight matrices into low-rank approximations.

For example, the weight matrix may be quantized to 8-bits and then decomposed into two smaller matrices using singular value decomposition. This allows efficiently adapting a large number of weights in the original layers using much fewer trainable parameters. Only these quantized, low-rank factorized matrices are trained on the downstream task. The rest of the model is frozen. This provides greater adaptation capacity compared to only training a new output layer, but with minimal compute and memory overhead. The low-rank adaptations are efficient to train while avoiding forgetting the original knowledge in the pretrained layers.

In the QLoRA approach, it is the original model’s weights that are quantized to 4-bit precision. The newly added Low-rank Adapter (LoRA) weights are not quantized; they remain at a higher precision and are fine-tuned during the training process. This strategy allows for efficient memory use while maintaining the performance of large language models during finetuning. Credits

Prefix tuning

Prefix-tuning is a simpler way to train big language models for tasks like writing. Instead of adjusting all the model parts, which can be costly, prefix-tuning focuses on a small task-specific part called the prefix. This prefix helps guide the model to write in a specific way for a task.

Fine-tuning (top) updates all Transformer parameters (the red Transformer box) and requires storing a full model copy for each task. They propose prefix-tuning (bottom), which freezes the Transformer parameters and only optimizes the prefix (the red prefix blocks).

Consequently, prefix-tuning only need to store the prefix for each task, making prefix-tuning modular and space-efficient. Note that each vertical block denote transformer activations at one time step. Credits

By changing only a tiny portion of the model, prefix-tuning performs as well as full fine-tuning in regular scenarios, works better with less data, and handles new topics well. Like other PEFT techniques, prefix tuning aims to reach a specific result, using prefixes to change how the model generates text. Only the prefixes are updated, not the rest of the model layers.

Prompt tuning

Prompt tuning, a PEFT method, adapts pre-trained language models for specific tasks differently. Unlike model tuning, where all parameters are adjusted, prompt tuning involves learning flexible prompts through backpropagation. These prompts, fine-tuned with labeled examples, outperform GPT-3's few-shot learning, especially with larger models. Prompt tuning enhances robustness for domain transfer and allows efficient prompt ensembling. It only requires storing a small task-specific prompt per task, making it simpler to reuse a single frozen model across various tasks compared to model tuning, which needs a task-specific model copy for each task.

P-tuning

P-tuning enhances GPT-like language models in Natural Language Understanding (NLU) tasks, surpassing traditional fine-tuning methods. It utilizes trainable continuous prompt embeddings, showing substantial improvements in precision and world knowledge recovery on benchmarks like LAMA and SuperGLUE. P-tuning minimizes the need for prompt engineering and excels in few-shot SuperGLUE scenarios compared to current approaches. This technique proves versatile, boosting pre-trained language models for tasks like sentence classification and predicting a country’s capital by adjusting input embeddings with differential output embeddings generated through a prompt. Continuous prompts optimization, aided by a downstream loss function and prompt encoder, addresses challenges of discreteness and association.

IA3

IA3, or Infused Adapter by Inhibiting and Amplifying Inner Activations, is a parameter-efficient fine-tuning technique designed to improve upon the LoRA technique by reducing trainable parameters and maintaining model performance. Similar to LoRA, IA3 is valuable for adapting large pre-trained models to specific tasks while minimizing computational demands, and its ability to merge adapter weights without adding inference latency enhances its versatility for real-time applications and various downstream tasks.

Benefits of PEFT

Here we will discuss the benefits of PEFT in relation to traditional fine-tuning. So, let us understand why parameter-efficient fine-tuning is more beneficial than fine-tuning.

Decreased computational and storage costs: PEFT involves fine-tuning only a small number of extra model parameters while freezing most parameters of the pre-trained LLMs, thereby reducing computational and storage costs significantly.
Overcoming catastrophic forgetting: During full fine-tuning of LLMs, catastrophic forgetting can occur where the model forgets the knowledge it learned during pretraining. PEFT stands to overcome this issue by only updating a few parameters.
Better performance in low-data regimes: PEFT approaches have been shown to perform better than full fine-tuning in low-data regimes and generalize better to out-of-domain scenarios.
Portability: PEFT methods enable users to obtain tiny checkpoints worth a few MBs compared to the large checkpoints of full fine-tuning. This makes the trained weights from PEFT approaches easy to deploy and use for multiple tasks without replacing the entire model.
Performance comparable to full fine-tuning: PEFT enables achieving comparable performance to full fine-tuning with only small number of trainable parameters.

Comparison of Popular PEFT Methods

Fine Tuning Mistral 7b using PEFT and Unsloth

Unsloth is an open-source platform for efficient fine-tuning of popular open-source LLMs like Llama-2, Mistral, and other derivatives. Unsloth implements optimized Triton kernels, manual autograds, etc, to speed up training. It is almost twice as fast as Huggingface and Flash Attention implementation.

Credits

Start in Google Colab, switch the runtime as T4 GPU and install unsloth and transformer.

You can find the code here.

Now, we download the 4-bit Mistral 7b model to our runtime through Unsloth’s FastLanguageModel class.

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

Data Preparation:

We will use the Yahma version of the Alpaca 52k dataset. This is a cleaned version of the original Alpaca dataset from Stanford. The dataset has data in instruction-output format. Here is an example

Model Training Process: Having loaded a 4-bit quantized Mistral-7b model, configured LoRA, and prepared data, the next step is training. Two methods are available for this: SFT (Supervised Fine Tuning) and DPO (Direct Preference Optimization).

SFT involves a labeled dataset, like the Alpaca dataset, with instructions and expected answers. Models fine-tuned with SFT learn patterns and nuances associated with questions.

On the other hand, DPO (Direct Preference Optimization) treats the task as a classification problem. It uses a dataset with instructions, an accepted answer, and a rejected answer. During fine-tuning, the aim is for the trained model to assign higher probabilities to accepted responses than a reference model, and lower probabilities for rejected answers.

To align model behavior with preferences efficiently, the model is rewarded for preferred responses and penalized for rejected ones.

For our model training, we'll employ the Supervised Fine Tuning (SFT) method using the TRL library's SFTTrainer for LoRA adapters on the Alpaca dataset. Additionally, there's a DPOTrainer class for DPO fine-tuning.

Next, start the training,

This may take sometime. Once the training is finished, we can use the fine-tuned model for inferencing.

Inferencing

Let's run the model! You can change the instruction and input - leave the output blank!

You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

Output:

We can save the LoRA adapters to the local directory with the following code.

We can load the saved LoRA adapters and inference as well.

Define a function to format and print the model output.

output:

Another example:

Output:

Conclusion

In conclusion, Fine-tuning Large Language Models (LLMs) using Parameter-Efficient Fine-Tuning (PEFT) emerges as a pivotal approach in enhancing model performance while mitigating computational costs. Techniques like LoRA, IA3, and various others discussed signify the evolution towards efficient adaptation of pre-trained models to specific tasks. Whether through adapter modules, prompt tuning, or direct preference optimization, PEFT methods showcase versatility and effectiveness, offering a nuanced balance between model customization and resource efficiency. As the field advances, the continual refinement of PEFT methodologies promises to play a crucial role in maximizing the potential of large language models for a diverse array of applications.