Executing the stable diffusion text-to-image model on an AMD Ryzen 5 5600G integrated GPU (iGPU).
The stable diffusion codes (either the original version or the one using the diffusers package) are curently expected to execute on nVidia GPUs (using CUDA). In this post, I wanted to see how efficiently it could execute on the integrated GPU (iGPU) of a recent AMD Ryzen CPU (AMD Ryzen 5 5600G).
|diffusers (simple)||CPU||simple||CPU||7 min.|
|diffusers (simple)||HIP (ROCm)||simple||GPU||2 min. 36 s|
|diffusers (simple)||HIP (ROCm)||mixed||GPU||2 min. 28 s|
|diffusers (simple)||HIP (ROCm)||half||GPU||2 min. 30 s|
|stable_diffusion.openvino||CPU plugin||simple||CPU||3 min. 46 s|
- AMD Ryzen 5 5600G with Radeon Graphics;
- 16 GiB RAM / 32 GiB of RAM;
- Debian testing;
- Linux kernel 5.18.16-1;
- Mesa 22.2.0-rc3~1.
Runs on the Ryzen 5 5600G iGPU with enough RAM allocated to the iGPU using:
sudo apt install rocminfo rocm-smi rocm-device-libs sudo setfacl -m u:$USER:rw /dev/kfd python3 -m diffusers.venv cd diffusers.venv . bin/activate pip install --upgrade diffusers transformers scipy ftfy pip uninstall torch pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 huggingface-cli login HSA_OVERRIDE_GFX_VERSION=9.0.0 DEVICE=cuda python3 generate.py "a gothic medieval cyberpunk knight, standing in front of a dark bridge, detailed" "test.jpg"
Table of Content
- Table of Content
- Execution on the CPU
- Using the iGPU (ROCm backend)
- Bonus, extra examples
- Bonus, botched output
- Appendix, NSFW filter
- Appendix, parameters
Execution on the CPU
I am first trying to execute on the CPU.
python3 -m diffusers.venv cd diffusers.venv . bin/activate pip install --upgrade diffusers transformers scipy ftfy huggingface-cli login
An account on Hugging Face is needed for the software to download the weights of the model.
My initial code was:
from sys import argv from torch import autocast from diffusers import StableDiffusionPipeline prompt = argv output_file = argv model_id = "CompVis/stable-diffusion-v1-4" device = "cpu" pipe = StableDiffusionPipeline.from_pretrained( torch_dtype=torch.float32, model_id, use_auth_token=True) pipe = pipe.to(device) with autocast(device): image = pipe(prompt, guidance_scale=7.5)["sample"] image.save(output_file)
python3 generate.py "a gothic medieval cyberpunk knight, standing in front of a dark bridge, detailed" "test1.jpg"
This led to the following error:
RuntimeError: expected scalar type BFloat16 but found Float
Removing half-precision computations
This error is caused by
autocast() which tries to automatically use half-precision (16 bit) floating points instead of single precision (32 bit) floating points for some computations for efficiency reasons. However, half-precision are not available on this CPU. We need to disable the usage of half-precisions float by removing
from os import environ from sys import argv import torch from diffusers import StableDiffusionPipeline prompt = argv output_file = argv model_id = "CompVis/stable-diffusion-v1-4" device = environ.get("DEVICE", "cpu") pipe = StableDiffusionPipeline.from_pretrained( model_id, torch_dtype=torch.float32, use_auth_token=True) pipe = pipe.to(device) image = pipe(prompt, guidance_scale=7.5)["sample"] image.save(output_file)
time python3 generate.py "a gothic medieval cyberpunk knight, standing in front of a dark bridge, detailed" "test1.jpg"
Rendering a single image (512×512, 50 iterations) takes 7 minutes on the CPU which is not that bad:
real 7m29,143s user 35m40,751s sys 5m54,596s
Interestingly, two of the four attempts with this query were censored by the Not Safe For Work (NSFW) filter included in Stable Diffusion (see appendix).
Using the iGPU (ROCm backend)
In this section, I am trying to use the ROCm / HIP backend of PyTorch in order to exploit the AMD iGPU. However, this GPU is not officially supported by ROCm.
We need to install some ROCm runtime and the ROCm-enabled version of PyTorch:
sudo apt install rocminfo rocm-smi rocm-device-libs pip uninstall torch pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1
The current user need to be in the
render group in order to use the
/dev/kfd device. As a quick workaround, we can just give access to this device for this user:
sudo setfacl -m u:$USER:rw /dev/kfd
sudo chown $USER /dev/kfd
We can now get some information about our iGPU seen from ROCm:
Using the hip backend
The commands in this section are not correct. As explained below, you should use
DEVICE=cuda when targeting AMD GPU using ROCm/HIP.
We can try to use the ROCm/HIP backend in order to exploit the AMD iGPU:
DEVICE=hip python3 generate.py "a gothic medieval cyberpunk knight, standing in front of a dark bridge, detailed" "test.jpg"
We get the following error:
Unable to find code object for all current devices!"
We can try to override the detected GPU version:
HSA_OVERRIDE_GFX_VERSION=9.0.0 DEVICE=hip python3 generate.py "a gothic medieval cyberpunk knight, standing in front of a dark bridge, detailed" "test.jpg"
However I still have another error:
NotImplementedError: Could not run 'aten::empty_strided' with arguments from the 'HIP' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions.
Using the (fake) CUDA backend
The previous attempt failed because you are not supposed to specify the HIP backend (
DEVICE=hip). HIP masquerades as CUDA. In order to use ROCm/HIP, you are expected to specify the CUDA backend (
HSA_OVERRIDE_GFX_VERSION=9.0.0 DEVICE=cuda python3 generate.py "a gothic medieval cyberpunk knight, standing in front of a dark bridge, detailed" "test.jpg"
When doing this, the image generation fails with some out-of-memory error. Moreover, I get these kernel errors (and a GPU reset):
amdgpu: qcm fence wait loop timeout expired amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption amdgpu: Failed to evict process queues amdgpu: Failed to quiesce KFD amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
Allocating more VRAM to the iGPU
By adding 16 GiB of extra RAM (32 GiB in total) and dedicating 16 GiB of RAM for the GPU VRAM (in the BIOS), I managed to execute the computation of the GPU. Execution took 2 min 36 s. This is significantly faster than the same code running on CPU.
Warning: 'qcm fence wait loop timeout expired' error and GPU reset
The program still generates the “qcm fence wait loop timeout expired” error and a GPU reset at the end of the program after the image has been generated succesfully (after/while releasing the GPU memory).
The GPU reset can be quite boring if you are running a desktop session because it freezes it. In many cases, you can recover from it by restarting the deskop session (in my case
sudo systemctl restart lightdm from a Linux console would do the trick) but in some cases, the GPU (or the Linux driver) ends up in a weird state which can only be fixed by rebooting.
In many cases, the GPU reset does not happen anymore after a first one (but not always).
Note: update concerning the 'qcm fence wait loop timeout expired' error and GPU reset (2022-12-02)
The problem appears to have been fixed in some update (using Linux 6.0.10-1 and Mesa 22.2.4-1).
Mixed precision computations
Using mixed precision, I has a similar computation time (2 minutes 28 s).
with autocast("cuda"): image = pipe(text, width=width, height=height, guidance_scale=guidance, num_inference_steps=steps)["sample"]
Half precision model
Using half precision, I had a similar computation time as well (2 min. 30 s) and a lower VRAM memory usage (38 % of 16 GiB).
pipe = StableDiffusionPipeline.from_pretrained( model_id, revision="fp16", torch_dtype=torch.float16, use_auth_token=True) ### ... with autocast("cuda"): image = pipe(text, width=width, height=height, guidance_scale=guidance, num_inference_steps=steps)["sample"]
Bonus, extra examples
Bonus, botched output
Here is what I got in the first try when I was passing the output filename (
testN.jpg) instead of the actual input to the neural network which gave completely random output. 🤪
Appendix, NSFW filter
The NFSW filter is run after the main processing: an additional neural network is run on the output image which is used to detect NSFW images. If such an image is detected, it is replaced by a completely black image by the model (see the
pooled_output = self.vision_model(clip_input) # pooled_output image_embeds = self.visual_projection(pooled_output) special_cos_dist = cosine_distance(image_embeds, self.special_care_embeds).cpu().numpy() cos_dist = cosine_distance(image_embeds, self.concept_embeds).cpu().numpy() ... has_nsfw_concepts = [len(res["bad_concepts"]) > 0 for res in result] for idx, has_nsfw_concept in enumerate(has_nsfw_concepts): if has_nsfw_concept: images[idx] = np.zeros(images[idx].shape) # black image ... return images, has_nsfw_concepts
We can disable the NSFW filter by overriding
pipe.safety_checker to a no-op.
We can pass the parameters to the
prompt: Union[str, List[str]]
height: Optional[int] = 512
width: Optional[int] = 512
num_inference_steps: Optional[int] = 50
guidance_scale: Optional[float] = 7.5, discussed in "Stable Diffusion with 🧨 Diffusers"
eta: Optional[float] = 0.0
- High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion Paper)
- Stable Diffusion Public Release
- Stable Diffusion v1-4 Model Card
- Stable Diffusion GitHub repository
- Stable Diffusion prompting cheatsheet
- Stable Diffusion on AMD RDNA™ 3 Architecture
- Running Stable Diffusion on Windows with an AMD GPU
- Implementation of Text-To-Image generation using Stable Diffusion on Intel CPU
- Run Stable Diffusion on your M1 Mac’s GPU
- Running Stable Diffusion on your GPU with less than 10Gb of VRAM on Windows
Stable Diffusion artwork
Articles about diffusers
- ROCm Hardware and Software Support Reference Guide
- PyTorch for AMD ROCm™ Platform now available as Python package
- Issue: Getting “hipErrorNoBinaryForGPU: Unable to find code object for all current devices!” when trying to run PyTorch
- GitHub Issue: ROCm 5.xx ever planning to include gfx90c GPUs?
The two CPU executions worked fine with 16 GiB of RAM. In order to execute the ROCm, I had to add 16 GiB of RAM and allocate 16 GiB for the GPU VRAM. The original 16 GiB of RAM would probably have been enough by allocating 8GiB of GPU VRAM. ↩︎
FP16 is supported on som CPUs. For the Intel architecture, this is the AVX-512_FP16 extension. ↩︎
radeontopreports 48% of GPU usage. Allocating 8GiB of VRAM from the 16 GiB of RAM might have been enough. ↩︎