August 23, 2022

Deploy Stable Diffusion for AI Image Generation

Setup Stable Diffusion WebUI

I ran into so many issues trying to set it up on my MacBook Pro M1 and finally made it work.

The most important lesson learned: the Python version matters!

I figured this out from the inline comment here after having many issues with Python 3.8.0 and 3.9.7 - It would be helpful if the author can highlight this in the README file.

I use pyenv to manage my Python versions and use the following commands to install Python 3.10.6 first.

pyenv versions
pyenv install 3.10.6
pyenv global 3.10.6

It should be simple as the few steps above if the Python version is correct.

“Stable Diffusion 2.0 and 2.1 require both a model and a configuration file, and image width & height will need to be set to 768 or higher when generating images”

For using Stable Diffusion v2.0, follow the instruction to download the checkpoints and yaml files.

My /stable-diffusion-webui/models/Stable-diffusion/ folder looks like the following:

Note: for v2.0, you may need to run ./webui.sh --no-half or restart to make it work.

for 768-v-ema.ckpt SD v2.0, you have to use at least 768x768 or higher, e.g., 768x1024 to generate images otherwise you get garbage images shown below:

I want to record the issues I ran into below in case I need them later.

M1 Deployment

I just followed the instructions here.

Tested on my 2020 MacBook Pro M1 with 16G RAM and Torch 1.13.0.

Run the following to generate the models in coreml-sd folder:

git clone https://github.com/apple/ml-stable-diffusion.git
cd ml-stable-diffusion
conda create -n coreml_stable_diffusion python=3.8 -y
conda activate coreml_stable_diffusion
pip install -e .
huggingface-cli login
python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-text-encoder --convert-vae-decoder --convert-safety-checker -o coreml-sd

Generate image with Python and output to image-outputs folder:

python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i coreml-sd -o image-outputs --compute-unit ALL --seed 93

The method above loads the model every time, which is quite slow (2-3 minutes). Use Swift to speed up model loading by setting up the Resources:

python -m python_coreml_stable_diffusion.torch2coreml --convert-unet --convert-text-encoder --convert-vae-decoder --convert-safety-checker --bundle-resources-for-swift-cli -o coreml-sd 

Then, generate image with Swift and output to image-outputs folder:

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path coreml-sd/Resources/ --seed 93 --output-path image-outputs

Ubuntu Deployment

In the past few months, I tried almost all popular text-to-image AI generation models/products, such as Dall-E 2, MidJourney, Disco Diffusion, Stable Diffusion, etc. Stable Diffusion checkpoint was just released a few days ago. I deployed one on my old GPU server and record my notes here for people who may also want to try. Machine creativity is a quite interesting research area for IS scholars and I jotted down some potential research topics in the end of this post as well.

I first spent a few hours trying to set up Stable Diffusion on Mac M1 and failed - I cannot install the packages properly, e.g., version not found, dependency issues, etc. I found some successful attempts here but have no time to try them yet.

I ended up setting up Stable Diffusion on my old GPU server running Ubuntu and here are my notes.

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
| 30%   27C    P8    20W / 250W |      1MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:68:00.0 Off |                  N/A |
| 30%   26C    P8    19W / 250W |     73MiB / 11016MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
sudo apt install links
links www.google.com

rename the checkpoint file to model.ckpt and put it in the following folder (create a new one):

mkdir -p models/ldm/stable-diffusion-v1/

A side note on estimated training cost based on the reported GPU usage and the related AWS price I found:

Price of p4d.24xlarge instance with 8 A100 with 40G VRAM:

The training would cost between 225,000 USD and 600,000 USD.

name: ldm
channels:
  - pytorch
  - defaults
dependencies:
  - python=3.8.5
  - pip=20.3
  - cudatoolkit=10.2
  - pytorch=1.11.0
...
conda env create -f environment.yaml
conda activate ldm

Now, Stable Diffusion is ready to go and let’s see what AI will create based on the following text:

A car in 2050 designed by Antoni Gaudi

python optimizedSD/optimized_txt2img.py --prompt "a car in 2050 designed by Antoni Gaudi" --H 512 --W 512 --seed 27 --n_iter 2 --n_samples 10 --ddim_steps 50

This whole area is relatively new and there are many potential interesting research topics, e.g.,

Anyway, out of the 20 generated images from the prompt above, the following are my top 3:

PS. The first image for this post is generated via Midjourney using the prompt “A car in 2050 designed by Antoni Gaudi”.