Fine-tuning LLMs Made Easy with LoRA and Generative AI-Stable Diffusion LoRA

xiao sean

11 min readMar 11, 2023

Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models, ” In ICLR 2021.

Paper link: https://arxiv.org/abs/2106.09685

Github(Pytorch): https://github.com/microsoft/LoRA

This article will cover:

Microsoft’s LoRA technique, which can significantly reduce the number of parameters required for fine-tuning large language models
A comparison of memory usage between using LoRA and not using it
An introduction to LoRA technology
The combination of LoRA and Stable Diffusion — Stable-Diffusion-LoRA
How Stable-Diffusion-LoRA can enable users to share their model with others online by sharing LoRA’s weight file, thus achieving the idea of patches/plugin

Introduction

In recent years, Large Language Models (LLMs), also known as Foundational Models, have been trained using large datasets and models with a massive number of parameters, such as the common GPT-3 (175B parameters). The emergence of ChatGPT also indicates the generalization level of LLMs, as they have performed well in common problems. However, when it comes to specific domains, although in-context learning can be achieved through a few examples (few-shot), fine-tuning the model would yield better results.

As models have grown increasingly larger, directly fine-tuning all parameters incurs significant costs. Therefore, in recent years, researchers have focused on efficient fine-tuning, known as Parameter-Efficient Fine-Tuning (PEFT). This article will introduce Low-Rank Adaptation (LoRA) proposed by the Microsoft team, which involves freezing the weights of the pre-trained model (e.g., GPT-3) and fine-tuning it with a small model, achieving excellent fine-tuning results, similar to the Adapter concept. The idea is to use the small LoRA network inserted into specific layers to make the model adaptable to different tasks.

In addition, Cheng-Han Chiang, Yung-Sung Chuang, Hung-yi Lee, “AACL_2022_tutorial_PLMs,” 2022, provides a detailed explanation in the tutorial, as shown in the following figure.

In the future, instead of fine-tuning the parameters of a large neural network model, the approach may shift towards training a smaller model or weight, and combining it with the specific layer weights of the original LLM. Compared to fine-tuning the GPT-3 model, this method requires 10,000 times fewer training parameters and only 1/3 of GPU usage. This technique is not only applied to LLMs, but also extensively used in training high-resolution image-generating AIs, such as the Stable-Diffusion generative model.

The comparison of memory usage between using and not using LoRA technology:

As shown in the figure below, the LoRA model only requires a small number of parameters, with 0.5M and 11M parameters, which is much smaller than the original LLM model (here using GPT-2 Medium with 345M parameters). Moreover, using LoRA technology, the inference efficiency is better than the previous Adapter technology in the case of Batch size = 1.

Why is the parameter/size of a model so important? First, it is important to understand how much GPU memory will be used during model training. For details, please refer to Jacob Stern’s comprehensive guide to memory usage in PyTorch.

The maximum model usage during training (without considering mixing precision) is calculated as follows:

Formula: Memory required by the model + Memory required for forward calculation (more flexible) + Memory required for gradients (memory required for model training parameters) + Memory required for optimizer variables * Memory required for model training parameters (usually the largest, Adam can be considered as requiring twice the memory required by the model)
The forward calculation depends on the batch size, the size of the input content, and whether or not to use mixing precision. The memory consumption of this part can be reduced by using PyTorch’s checkpoint mechanism, which is flexible and adjustable.
The memory required for optimizer variables depends on the different optimizers (SGD: 0, RMSProp: 1, Adam: 2). The common optimizer, Adam, records the EMA and Momentum of the model’s previous gradients, so for Adam optimizer, it will store 2 times the size of the model parameters!

The following is an estimate using the Adam optimizer and no mixing precision (corrections are welcome if there are any errors!):

Assuming that the memory consumption of a large model occupies 4 slots, to train the entire model, the Gradient also requires a memory of 4 slots of the model size, and in order to train the model, an optimizer (Adam) is required, which also requires 2 times of memory for the model size, thus occupying 8 slots. This does not yet include the memory required for Forward, which requires 4 times of memory for the model size. It can be imagined that for an extremely large model like GPT-3, with a model size of 175B, the required memory is enormous!!

Although in practice, mixing precision and some techniques will be used, the memory size required for training depends on the number of parameters to be trained. In fact, when fine-tuning large models, we usually use adapters to train part of the parameters, rather than training the entire model. By freezing the LLM model weights, we only need to train a small part of the extended model, in which case our optimizer and the number of parameters required for backward will instantly be reduced a lot. We use the following example to show that if we only use the Θ quantity required by the LoRA model (usually, the trained parameters will be < 0.1% of LLM), the memory usage at this time will become the following figure. Of course, in reality, it is even less, because the added trainable parameters will be very few, and the additional computational cost will be very low:

Method — LoRA(Low-Rank Adaptation)

In the past, make LLM or Foundation Models (such as the GPT series) applicable to various downstream tasks, the goal of training the model (Φ) was to ensure that the model performs well in handling multiple different tasks (Z).

The following figure shows the downstream tasks used for GPT-1, which include common NLP tasks such as classification, hypothesis testing, similarity comparison, and multiple-choice questions. The model is trained by providing different prompts as input.

Radford et al., “Improving Language Understanding by Generative Pre-Training”, in 2018.

In the past, there have been two Parameter-Efficient Fine-Tuning approaches for different downstream tasks:

Adapter: By adding a small amount of model architecture and freezing the LLM model parameters, training is performed.
Prefixing: Adding tokens to the beginning of the prompt to allow the model to perform better for specific tasks.

The LoRA introduced in this article belongs to the Adapter type. The concept of LoRA is that since LLM is applicable to different tasks, the model will have different neurons/features to handle different tasks. If we can find the features that are suitable for the downstream task from many features and enhance their features, we can achieve better results for specific tasks. Therefore, by combining the LLM model — Φ with another set of trainable parameters Trainable Weight — Θ(Rank decomposition matrices), downstream task results can be optimized.

The orange module on the right represents the LoRA model weights that we want to train. By restricting the rank(r) to smaller in the middle, the number of trainable parameters can be significantly reduced, and the dimensionality of the features can be reduced to “r << d”. The overall number of parameters can then be expressed as “|Θ|=2×LoRA ×dmodel ×r”. LoRA is the number of LoRA modules used in the entire model, and in the paper, LoRA modules were inserted into the Attention layer of the Transformer architecture. The value of “r” varies depending on the task, but in experiments, a value of 2~4 has been shown to yield good results. Ultimately, we want to optimize the downstream tasks through the LoRA modules, as shown in the formula below.

Experiments

LoRA achieved better results than Fine-tuning, and required much fewer parameters to train.

Compared to other efficient Fine-tuning methods, LoRA achieved the best accuracy.

The experiments only evaluated the performance of adding LoRA modules to the Attention block, and evaluated which block (Q, K, V, or O) achieved the best results while keeping the parameter count fixed.

The choice of the number of Ranks was also investigated.

LoRA Quick Start Guide

For more details, please refer to the Github repository: https://github.com/microsoft/LoRA

Although the implementation details are quite challenging, the package is well encapsulated.

LoRA has now been integrated into HuggingFace Parameter-Efficient Fine-Tuning (PEFT).

To replace specific layers of a model with LoRA, the model architecture needs to be adjusted, but the calling process is simple:

# ===== Before =====
# layer = nn.Linear(in_features, out_features)
# ===== After ======
import loralib as lora
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)

Before training, freeze the original LLM model and set only the LoRA parameters to be trainable.

import loralib as lora
model = BigModel()
# This sets requires_grad to False for all parameters without the string "lora_" in their names
lora.mark_only_lora_as_trainable(model)
# Training loop
for batch in dataloader:

When saving the model, only the weights trained by LoRA can be saved, which will facilitate sharing of your weights/patches.

# ===== Before =====
# torch.save(model.state_dict(), checkpoint_path)
# ===== After =====
torch.save(lora.lora_state_dict(model), checkpoint_path)

When loading the LoRA or original LLM weights, set strict to False.

# Load the pretrained checkpoint first
model.load_state_dict(torch.load('ckpt_pretrained.pt'), strict=False)
# Then load the LoRA checkpoint
model.load_state_dict(torch.load('ckpt_lora.pt'), strict=False)

Summary of LoRA Paper

Sharing a large LLM model is a trend in the future. To adapt to a specific task, training only the LoRA module is sufficient, which also brings convenience and replaceability. In the future, one can save the LoRA model weights to quickly share or switch between different tasks.
By significantly reducing the number of parameters in training, the hardware training requirement is greatly reduced, and compared with a fully fine-tuned model, the increase in inference latency is minimal.
Drawbacks: Since LoRA trains the model to optimize for each task, if a batch contains different tasks, the program needs to be adjusted to infer different inputs on the corresponding module.

In addition to the LoRA method in Parameter-Efficient Fine-Tuning (PEFT) introduced in this article, there are other research directions for training large models with fewer devices. For example, Microsoft’s DeepSpeed integrates ZeRO-Offload: Democratizing Billion-Scale Model Training, another popular technique is FlexGen, which can infer the GPT-3 175B model on a single GPU, are all recent technologies that people can explore if interested.

Stable-diffusion-LoRA(Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning)

In recent years, generative AI models like DALLE and Stable-diffusion have demonstrated the ability to generate high-quality and high-resolution images. However, these models require a significant amount of computing resources to train due to the need for high-resolution training data. Training a high-resolution diffusion model requires a considerable amount of memory, making it challenging to train even with the Pixel-level Diffusion in Stable-diffusion replaced with Latent Diffusion Model, which reduces memory requirements. Despite this, it still cannot be trained on a single 11 GB GPU. Now, the integration of LoRA technology into Stable-diffusion has led to the release of Stable Diffusion LoRA, which changes this paradigm.

Integrating LoRA into Stable-diffusion brings the following benefits:

Much faster training
Can train directly on an 11GB VRAM GPU
LoRA weights only take up 3MB~200MB, making them easy to share

HuggingFace has also shared a LoRA Tutorial for those interested in using the technology

This breakthrough in technology has expanded the community of Stable Diffusion models and has enabled them to be uploaded to the CivitAI website. Many of these models have been trained using LoRA.

There are also many resources online for generating/training models using Colab or personal computers. Recently, the Stable-diffusion community has open-sourced many projects and provided GUI interfaces, allowing even non-programmers to train high-quality generative AI.

Stable-diffusion-webui-colab

Kohya’s GUI, supports Windows

For those who just want to try Stable-diffusion, it is recommended to use the WebUI. Not only can you use the officially released models, but it is also directly linked to CivitAI, allowing you to download other people’s generative models.

Other Resources:

Stable-diffusion-webui Online Services

GitHub — AUTOMATIC1111/stable-diffusion-webui

GitHub — camenduru/stable-diffusion-webui-colab: stable diffusion webui colab

Off Topic Zone

Recently, there are many AI tools available, and how effectively we can use these tools will determine our productivity in the future. Here, I recommend a website called “Explainpaper,” which can help you read research papers more effectively. All you need to do is highlight the text you don’t understand, and AI will explain it to you based on the entire article. Although I only use it occasionally, and not every explanation is good, it is a trend. There are so many papers, and it’s faster to let AI read them…

I hope my small contribution of knowledge can help you before I am replaced by AI :)