Fine-Tuning with Finesse: Parameter Efficient Fine-Tuning (PEFT)

Pratima Rathore
8 min readAug 16, 2023

--

Transformer-based architectures, such as GPT, T5, and BERT, belonging to the category of Large Language Models (LLMs), have demonstrated exceptional performance in a wide array of tasks within the field of Natural Language Processing (NLP), setting new benchmarks.

Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model.

We know when these pre-trained Large Language Models (LLMs) are fine-tuned using task-specific datasets, significant performance improvements are achieved compared to using the original pre-trained models directly (like in zero-shot inference scenarios, for instance).

Example — Regular 16-bit finetuning of a LLaMA 65B parameter model requires more than 780 GB of GPU memory. While recent quantization methods can reduce the memory footprint of LLMs, such techniques only work for inference and break down during training.

Challenges with Standard Finetuning

  • With the increase in model size, full finetuning of LLMs becomes impractical on standard consumer hardware.
  • Performing complete fine-tuning on LLMs can also lead to catastrophic forgetting. This occurs when the model, after being trained on a second task, loses its ability to effectively execute the first task it was trained on. This is a recognized concern in the realm of neural networks.
  • Storing and using separately fine-tuned models for each specific task becomes costly due to their size similarity to the initial pretrained model.

ICL — an alternate for traditional fine-tuning

Attention then turned to Few-shot In-context Learning (ICL), wherein the model receives prompts for specific tasks and responds with relevant updates. ICL enables the model to undertake novel tasks through prompted examples, bypassing the need for gradient-based training i.e. ICL requires no gradient-based training .Nonetheless, the use of ICL comes with significant expenses in terms of computation, memory, and storage. Processing the prompt during every prediction and occasional subpar performance in comparision to fine tuning have rendered this approach less appealing.

This is precisely where Parameter-efficient Fine-tuning (PEFT) emerges as an alternative framework to prompting.

PEFT — an efficient alternate to standard fine-tuning

An additional paradigm for enabling a model to perform a new task with minimal updates is parameter efficient fine-tuning (PEFT), where a pre-trained model is fine-tuned by only updating a small number of current model’s parameters or a newly introduced set.

ICL vs PEFT

ICL enhances pre-trained language models’ few-shot learning by incorporating contextual details during fine-tuning. It refines models using a few-shot task alongside additional context input like sentences or paragraphs. This improves generalization to new tasks despite limited data. On the other hand parameter-efficient fine-tuning boosts efficiency by selectively freezing key model parameters during fine-tuning for downstream tasks. This maintains pre-trained knowledge, enhancing performance with limited data which outperforms ICL in terms of accuracy while requiring significantly fewer computational resources. Explore this article ICL vs PEFT for an in-depth comparison

They may be differentiated by their underlying approach or conceptual framework: does the method introduce new parameters to the model, or does it fine-tune a small subset of existing parameters? Alternatively, they may be categorized according to their primary objective: does the method aim to minimize memory footprint or only storage efficiency?

Additive methods involve enhancing the pre-trained model with new parameters or layers and training only those new parameters.Despite the introduction of extra parameters, these methods significantly improve training efficiency by reducing gradient and optimizer state sizes. In practical terms, training requires much more GPU memory than model weights — around 12–20 times. Additive PEFT methods save memory on optimizer states, gradients, and allow frozen parameters to be quantized. This enables the fine-tuning of larger networks and larger microbatch sizes, which boosts GPU training throughput. Additionally, optimizing fewer parameters in distributed setups significantly reduces communication volume.Within this category, there are Adapter-like methods , Soft Prompts , LeTS (Learn-to-Share) , LST (Ladder side-tuning) and (𝙸𝙰)³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) .

  • Adapter-like methods — Adapters are a type of additive parameter-efficient fine-tuning method that involves introducing small trainable feed-forward networks inserted between the layers in the fixed pre-trained model. It has various adaptations involve altering adapter positioning, pruning techniques, and reparametrization to decrease the count of trainable parameters.
  • Soft Prompts — To counter the limitations of Language model prompting, “Soft” or “continuous” prompts was introduced. This approach involves fine-tuning a portion of the model’s input embeddings using gradient descent. This transforms the challenge of selecting prompts in a discrete space into a continuous optimization task. Soft prompts can be tailored for just the input layer or applied across all layers. Recent progress explores the pre-training of soft prompts or utilizing prompts from various tasks to lessen the computational demand when fine-tuning a soft prompt for a new task. Some examples being Prompt Tuning,Prefix-Tuning and Intrinsic Prompt Tuning (IPT).

Selective Methods -based parameter-efficient finetuning methods involve fine-tuning a subset of the existing parameters of the model. It could be a layer depth-based selection, layer type-based selection, or even individual parameter selection. The initial selective PEFT fine-tunes the network’s top layers. Modern PEFT approaches categorize by layer type or internal structure, like tuning biases ( BitFit only fine-tune the biases of the network) or specific rows. Extreme selective methods use sparse updates which can completely ignore the structure of the model, and select parameters individually namely DiffPruning, Freeze and Reconfigure (FAR) or FishMask.

Reparametrization-based parameter-efficient finetuning methods leverage low-rank representations to minimize the number of trainable parameters.Fine-tuning can be performed effectively in low-rank subspaces.Findings indicate that larger models or those pretrained for longer periods require smaller adaptation subspaces. Prominent reparametrization approach in this category are Intrinsic SAID,Low-Rank Adaptation (LoRa) and KronA.

  • Lora — Their premise, based from , is that although weight updates are of full-rank (each row and column are linearly-independent), they can still be represented into a lower-dimensional space while retraining most its structure (low-rank).

These B and A matrices are trainable.The rank r is a hyperparameter that must be tuned. In a transformer network, LoRA is applied to only attention weights.

LoRA is also a small trainable submodule that can be inserted into the transformer architecture. It involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the transformer architecture, greatly diminishing the number of trainable parameters for downstream tasks.

Fine-tuning is changing the hidden representation h calculated by the original transformer model. Hence, in this case, the hidden representation calculated by the feed-forward up-project layer of the original transformer is h. Meanwhile, the vector calculated by LoRA is the incremental change Δh that is used to modify the original h. Thus, the sum of the original representation and the incremental change is the updated hidden representation h’.

  • Qlora — Quantized Low Rank Adapters takes LORA to the next level by incorporating three advancements to reduce memory use without sacrificing performance:
  1. 4-bit quantization i.e. 4-bit NormalFloat NF4, theoretically a new optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats.
  2. Dual Quantization to reduce the average memory footprint by quantizing the quantization constants
  3. Paged Optimizers using NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.

QLORA, uses a novel high-precision technique to quantize a pretrained model to 4-bit, then adds a small set of learnable Low-rank Adapter weights that are tuned by backpropagating gradients through the quantized weights.

QLoRA operates under the premise that a substantial portion of a LLM’s information resides within its weights, and that other details can be approximated without significant accuracy loss. By quantizing LLM weights to 4 bits, QLoRA reduces memory usage by a factor of 8. The quantized LLM undergoes further training via QLoRA’s utilization of Low Rank Adapters (LoRA), allowing the refined model to maintain most of the original LLM’s accuracy with considerably smaller size and faster speed.

QLoRA’s advancements are considered progress towards making large language model fine-tuning accessible to a broader audience. Through memory reduction and improved efficiency, QLoRA paves the way for smaller research teams with limited resources to engage in fine-tuning large language models.

Hybrid methods in parameter-efficient finetuning merge various techniques to enhance performance and cut computational expenses. They synergistically blend different approaches, capitalizing on strengths and minimizing weaknesses to achieve better efficiency and performance. MAM Adapter combines Adapters and Prompt tuning, while UniPELT is a gated combination of LoRa, Prefix-tuning, and Adapters. Compacter and KronA reparametrize adapters to decrease parameter count. An automated algorithm S4, combines all PEFT classes to optimize accuracy with minimal parameter increase (0.5%).

Conclusion

In this blog post, we looked into a few parameter-efficient finetuning techniques for large language models. PEFT extends beyond efficiency to unleash potential. It empowers smaller research teams with resource constraints to fine-tune large language models and propels the development of more sophisticated models capable of advanced human language understanding and generation.

If you liked the article, show your support by clapping for this article. Follow me and lets unravel the mysteries and unleash the potential of AI. Feel free to connect with me on LinkedIn as well!

Happy Parameter-Efficient Fine-Tuning!

References

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

QLORA: Efficient Finetuning of Quantized LLMs

https://huggingface.co/blog/peft

--

--