LLMs - Parameter Efficient Fine-tuning

Deep dive into parameter efficient fine-tuning for LLMs

In the preceding articles, I journeyed through the fundamentals of Language Models, delving into various types of these model architectures and their primary objectives. I also took the time to explore and grasp the importance of crafting effective prompts and configuring the models accordingly. Additionally, I delved into diverse techniques for refining (fine-tuning) models for single-task and multi-task objectives while considering the metrics used to evaluate their performance.

Fine-tuning LLMs for specific tasks is not without hurdles. A primary challenge is “Catastrophic Forgetting,” where the process erases prior task learning upon adapting to new ones. This process is computationally demanding and memory-intensive, which in turn leads to cost inefficiencies. Managing these inefficiencies becomes crucial as fine-tuning involves multiple LLM instances for different tasks, creating deployment difficulties. Adding to the complexity, large-scale PLMs have a prohibitive price tag for fine-tuning, restricting accessibility. Fine-tuning relies on task-specific datasets. If these datasets are limited or biased, the fine-tuned model’s performance might be compromised and not generalize well to new, unseen data.

However, the upcoming article takes a more detailed plunge into a potential solution — Parameter Efficient Fine-Tuning (PEFT) — and aims to elucidate how we can overcome the challenges of fine-tuning using this approach. Before we delve deeply into this topic, it’s crucial to establish a clear understanding of what precisely PEFT entails. Before we dive deep into these techniques, let’s first grasp the essence of PEFT.

Parameter Efficient Fine-Tuning (PEFT)

Parameter efficient fine-tuning (PEFT) encompasses a range of approaches that enable the streamlined adaptation of pre-trained language models for diverse downstream tasks, all without the need to fine-tune the entirety of the model’s parameters. Instead, PEFT focuses exclusively on fine-tuning a limited subset of additional model parameters. This strategic choice leads to substantial reductions in both computational requirements and storage overhead. Cutting-edge PEFT methods currently attain results on par with those achieved through complete fine-tuning, marking a significant advancement in efficiency and performance.

Parameter Efficient Fine-tuning (PEFT) is categorized into multiple categories. Here we will be discussing three major categories — Selection, Reparameterization, and Additive

Selection

In the selection method, the subset of base LLM is selected for inferencing. This method involves different techniques like

  • Pruning: Identify and remove less essential model parameters based on criteria like weight magnitude or gradient sensitivity.

  • Quantization: Reduce the precision of model weights to lower bit-widths (e.g., from 32-bit floats to 8-bit integers) without significant loss in performance.

  • Knowledge Distillation: Transfer knowledge from a larger, pre-trained model (teacher) to a smaller model (student) during fine-tuning.

Reparameterization

  • Low-Rank Adaptation: Approximate the weight matrices with lower-rank factorized matrices to reduce the number of parameters while preserving expressiveness. Parameter update for a weight matrix in LoRa is decomposed into a product of two low-rank matrices

  • Quantized LoRA: QLoRA is an extension of LoRA that further introduces quantization to enhance parameter efficiency during fine-tuning. It builds on the principles of LoRA while introducing 4-bit NormalFloat (NF4) quantization and Double Quantization techniques.

Additive

  • Adapters: It involved adding domain-specific layers between neural network modules to efficiently adapt the model to different domains. Adding a fully connected network after attention layers and feed-forward network within LLMs. Adapters typically have a smaller hidden dimension than the input.

  • Soft prompts: Prompt tuning keeps the LLM weights frozen adding an initial soft prompt of 20 to 100 tokens to the original prompt and fine-tuning the embedding space.

Low-Rank Adaptation

LoRA represents a reparameterization method aimed at approximating weight matrices by using lower-rank factorized matrices. This approach is designed to decrease the parameter count while maintaining expressive capabilities. In LoRA, the weight matrix is decomposed into the product of two matrices with lower ranks. Before understanding how LoRA works, let’s look at some benefits.

  1. Parameter Efficiency: LoRA reduces the number of model parameters.

  2. Memory and Compute Savings: Smaller models consume less memory and compute resources.

  3. Faster Inference: LoRA models typically have quicker inference times.

  4. Regularization: It can act as a form of regularization.

  5. Scalability: You can adjust the trade-off between model size and performance.

  6. Expressiveness: LoRA maintains model performance despite fewer parameters.

  7. Energy Efficiency: Smaller models are more energy-efficient.

How LoRA works?

Let’s break down the process of how LoRA operates into three distinct steps:

Step 1: Begin by restricting the modification of the majority of the original Large Language Model’s weights, including those associated with the MLP (Multi-Layer Perceptron) and self-attention mechanisms (comprising query, key, and value matrices).

Step 2: Introduce two matrices that have undergone rank decomposition. These rank matrices can be customized with various rank values such as 2, 4, 8, 16, or any other desired rank.

Step 3: Train the weights of these smaller matrices independently. During inference, perform matrix multiplication with the smaller matrices and subsequently add the results to the original matrices to obtain the final output.

Let’s illustrate this concept using an example involving base transformers. In accordance with the base transformer paper (Attention is All You Need) the dimensions of the transformer weights are described as d x k = 512 x 64, resulting in a total of 32,768 trainable parameters.

Now, let’s apply LoRA with a rank of r = 8. We define two matrices as follows: Matrix A with dimensions r x k = 8 x 64, totaling 512 parameters. Matrix B with dimensions d x r = 512 x 8, resulting in 4,096 parameters.

Consequently, the overall count of trainable parameters becomes A + B, which equals 512 + 4,096, resulting in 4,608 parameters. This represents an 86% reduction in the number of trainable parameters compared to the original transformer configuration.

Here are the results of the E2E NLG Challenge from the original paper. LoRA outperformed in almost every scoring matrices.

Results from the original paper

Prompt Tuning

Prompt tuning should not be confused with prompt engineering. In prompt tuning, the process involves introducing what are known as “soft prompts” into the input. This technique maintains the original Large Language Model (LLM) weights as frozen, incorporates soft prompt tokens, and fine-tunes the embedding space. The objective of this fine-tuning isn’t to make the model aware of the specific tokens themselves but rather to encourage the formation of semantic groupings based on similarities in meaning. Specifically, prompt tuning freezes the LLM weights, incorporates an initial soft prompt comprising 20 to 100 tokens in addition to the original prompt, and fine-tunes the embedding space.

Full finetuning vs Prompt Tuning

During inference, task modification is achievable by switching the soft prompt tokens.

Conclusion

In conclusion, Parameter Efficient Fine-tuning (PEFT) streamlines the fine-tuning of pre-trained language models for various tasks efficiently. PEFT includes Selection, Reparameterization, and Additive techniques.

Reparameterization methods like LoRA reduce parameters while preserving model capacity. LoRA involves freezing most weights, introducing rank-decomposed matrices, and training smaller matrices independently.

Prompt tuning introduces “soft prompts” for semantic grouping without altering model weights. It enables task modification during inference by changing soft prompts.

In upcoming articles, I will be exploring LoRA and Prompt Tuning in more detail with implementation.

References