February 18, 2023

What I Learned About Fine-tuning Stable Diffusion

LoRA (Low-Rank Adaptation) of Large Language Models was included in the Diffusers release a few weeks ago, which enables fine-tuning Stable Diffusion (SD) model with much lower GPU requirements so that I can finally try it on my old RTX 2080 Ti (I now use Tesla V100 most of the time). In addition, LoRA fine-tuning is much faster and the trained weights are much smaller, e.g., ~3M vs. ~5G (Lora models found on civitai.com are often ~100M-200M, which used a larger rank value such as 128, the default is 4 as explained here).

There are many tutorials on fine-tuning Stable Diffusion using Colab [1] and UI tools [2][3][4]. But I did not find a good “self-contained” repo with environment setup, simple sample datasets, training scripts, and instructions so that people can just clone, customize, and run.

In this tutorial, I want to share my experience in fine-tuning Stable Diffusion using HuggingFace training scripts with a few sample datasets. I am still learning about the tips and tricks on this and will report more findings as I go along.

The data and code can be accessed at this repo and this tutorial is based on these references.

Some of my lessons learned are:

The key topics are:

Prepare Custom Datasets

Stable Diffusion can be fine-tuned in different ways, such as:

One thing to know when preparing the dataset is how the images are preprocessed before being fed to the model for fine-tuning (see my sample code and examples):

The following shows the results of my default resizing and cropping (left column) and my custom resizing and cropping (right column) - note the custom cropped images focus more on the subject, which is better for fine-tuning:

In addition, the images should have variations about the subject, such as:

birme.net is a great website for bulk image resizing, renaming, format changing.

Run Training and Use the Fine-tuned Models

To try the examples, you can simply clone the repo and set up the environment as follows (Note that the Python version matters - the code was tested with Python 3.9.11. 3.10.x and 3.8.x may have issues like [1])

git clone https://github.com/harrywang/finetune-sd.git
cd finetune-sd
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
accelerate config default

Then, you need to log in to HuggingFace with your token and WandB (optional) with your API key:

NOTE: on Ubuntu, if you run into huggingface-cli login command not found error, you may need to add the path of huggingface_hub package to $PATH

pip show huggingface_hub | grep Location
/home/hjwang/.local/lib/python3.8/site-packages

Then, edit vim ~/.profile for bash shell and add the following to $PATH:

if [ -d "/home/hjwang/.local/lib/python3.8/site-packages" ] ; then
    PATH="/home/hjwang/.local/lib/python3.8/site-packages:$PATH"
fi

source ~/.profile then restart the terminal to proceed.

huggingface-cli login
wandb login

Then, you can run the training scripts using my cat dataset as follows - NOTE: you should change the argument values to fit your need, such as learning rate, training steps, checkpoint steps, validation prompt, etc.

If you use W&B, the validation results (4 generated images from the prompt “A photo of a sks cat in a bucket”) are automatically tracked:

In this case, choosing a checkpoint close to step 684 may generate better results.

Once the training is finished, the fine-tuned LoRA weights are stored in the output folder, which is ./models/dreambooth-lora/miles for my cat example above. The folder includes the final weights and intermediate checkpoint weights.

Use generate-lora.py to generate images using the fine-tuned LoRA weights:

python generate-lora.py --prompt "a sks cat standing on the great wall" --model_path "./models/dreambooth-lora/miles" --output_folder "./outputs" --steps 50

You can find other examples to run LoRA fine-tuning using other datasets here and examples to run DreamBooth without LoRA as well.

I also have not experimented with different settings on learning rates, prior preservation, schedulers, and text encoder found here, which seem to be quite effective on face fine-tuning.

Convert Diffusers LoRA Weights for Automatic1111 WebUI

If you download Lora models from civitai.com, you can follow this tutorial to use the LoRA models with Automatic1111 SD WebUI.

However, the LoRA weights trained using Diffusers are saved in .bin or .pkl format, which must be converted first in order to be used in Automatic1111 WebUI (see here for detailed discussions).

As seen below, the trained LoRA weights are stored in custom_checkpoint_0.pkl or pytorch_model.bin:

convert-to-safetensors.py can be used to convert .bin or .pkl files into .safetensors format, which can be used in WebUI (just put the converted the file in WebUI models/Lora). The script is adapted from the one written by ignacfetser.

Simply put this script in the same folder of the .bin or .pkl file and run python convert-to-safetensors.py --file checkpoint_file

PS: if you want to convert Lora models from civitai.com to diffusers format so that you can use them using code, please check out this PR.

Fine-tune using WebUI

You need to install Automatic1111 WebUI and two extensions d8ahazard/sd_dreambooth_extension and kohya-ss/sd-webui-additional-networks - check out my installation instructions and then follow the Youtube tutorial to train.

I run into many issues and finally trained a model successfully using 12G 2080Ti:

I record the issues and solutions below in case you need them:

Merge Models

Another way to get a custom model is via merging existing models, which can be easily done using Automatic1111 WebUI by following the tutorial here.

I use an example to show why and how to merge models. You can find tons of models and model merging receipts from civitai.com.

The following XY Plot shows the generated images using the prompt “cat” and seed values from 1 to 5 from three models (top to bottom):

By repeating the model merging steps, you can generate models with targeted effects, and many popular models are merged models, such as DreamShaper mentioned above, Photogen, PastelMix and many NSFW models.

References

PS. The first image for this post is generated via Midjourney using the prompt “experiment cooking with thousands of different receipts and flasks flying in the universe “.