Four months since I installled Automatic1111 to explore Stable Diffusion 1.5, the world has gone far ahead and created better UI like Forge and better model like FLUX. I hence did an upgrade to my armory.
Milestone models of AI image generation
I found the video below particularly useful to understand the history of the AI image generation progress. It highlights the milestone models (SD 1.5, SDXL, Pony, FLUX) and their connections as well as difference. It offers a very concise view over the models available out there. I finally learnt what models are popular and why. Kudos to the author.
To summarize, the most popular models as of now are SD 1.5, SDXL and FLUX.
Base Model | Release Time | Recommended Image Size | Ecosystem | Hardware Requirement |
SD 1.5 | Oct. 2022 | 512 * 512 | Large | Low |
SDXL | Jul. 2023 | 1024 * 1024 | Large | Medium |
FLUX | Aug. 2024 | < 2 MB | Small | High |
The FLUX family consists of multiple models. FLUX-schnell is lighter while FLUX-dev performs better. They also are under different licenses.
SD 1.5 and SDXL are still widely used and liked, because they both have large ecosystems where the user communities have experimented with and contributed to all sorts of fine-tuning models, LoRA, etc. They also run faster than FLUX and are less demanding on GPU and memory.
Installing Forge UI
Shortly, Forge is based on Automatic1111, but runs faster. It also supports FLUX besides Stable Diffusion. The Forge github site offers a zip to download as a ready-to-run package for users not familiar with git or python. I already have the python environment set up from before (old post about Automatic1111), so I just cloned the Forge package and ran the sh script.
git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git
cd stable-diffusion-webui-forge
./webui.sh
The sh script would install all dependence before launching the web UI in the browser.
Adding models
All models are available on huggingface. Download the model file and save to the stable-diffusion-webui-forge/models/ folder.
- SD 1.5, v1-5-pruned.safetensors
- SDXL, sd_xl_base_1.0.safetensors
FLUX-schnell, flux1-schnell.safetensors(see the Corrections section below)
I ran into an error of “You do not have clip state dict” while testing FLUX-schnell. I found the solution in this github thread by hosseinrashidi75. Essentially, I needed to download the missing clip and t5 files, and put them in the correct path. (The t5 file below is wrong. See the Corrections section below)
Download the ae.safetensors file and place it in the webui_forge\webui\models\VAE folder. Next, download the following two files:
t5xxl_fp8_e4m3fn_scaled.safetensors
clip_l.safetensors
and place them in the \webui_forge\webui\models\text_encoder folder
When using Flux in the Forge UI, in the VAE / Text Encoder drop-down tab,
select all three items and then click the Generate button.Download links t5xxl_fp8_e4m3fn_scaled.safetensors ,clip_l.safetensors:
https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main
Download link ae.safetensors:
https://huggingface.co/black-forest-labs/FLUX.1-schnell/tree/main
Download VAE for SDXL:
https://huggingface.co/stabilityai/sdxl-vae/tree/main
Location: webui_forge\webui\models\VAE
Below is a test image generated with FLUX-schnell, with the prompt “a dog running in the park”. The dog itself has five legs, the background is very funny, and the texts make no sense. I need more study on prompting skills.
Corrections about FLUX models
I did some testing with FLUX and generated more absurd images besides the one above. For example, I tried “a woman in a black dress” and got a labyrinth. While looking for solutions, I found tutorials downloading different versions of FLUX, which got me very confused, since Black Forest Labs only released one FLUX Schnell and one FLUX Dev (see their huggingface page https://huggingface.co/black-forest-labs).
So I did a thorough search on these issues.
Question 1. What’s wrong with the text2img above?
The problem was the t5 file. I used t5xxl_fp8_e4m3fn_scaled.safetensors. However, the correct one to use should be t5xxl_fp8_e4m3fn.safetensors (https://huggingface.co/comfyanonymous/flux_text_encoders/tree/main).
My understanding is that this file contains the text embedding used by the model; it tells the models how to interpret the text prompt. The scaled version changed the embedding in certain ways (that’s what scaling means), and the texts were then misunderstood.
Question 2. There are many FLUX versions, like Q4, fp8, etc. What do they mean, what’s the difference and which is the correct one to use?
These models can be found in the pages below:
- FLUX.1 Dev:
- FLUX.1 Schnell:
All these models are simplified versions of the original models. They are simplified in the sense that they use data types of lower precision than those of the original models. For example, the original FLUX Schnell and Dev both use 16-bit floating point numbers, while their fp8 counterparts use 8-bit floating point numbers. Going from fp16 to fp8 saves half VRAM usage and can potentially speed up the generation. Simillarly Q4, Q6 or Q8 means quantized to a lower precision of certain degree, the larger the number, the closer to the original model and the more VRAM needed.
The simplified models may create images of lower quality compared with the original FLUX models. However, they use less computing resources and are good choices for systems with small VRAM or old GPUs. Which one to choose depends on the hardware of the system.
After fixing the issue with the t5 file and switching to the model better suited for my hardward (fp8 models for 24G 3060), below are the images created by Schnell and Dev for the text prompt “a dog running in the park”. They are much better than the funky one above.