The Gory Details of Finetuning SDXL and Wasting $16k

Details on how the big diffusion model finetunes are trained is scarce, so just like with version 1, and version 2 of my model bigASP, I'm sharing all the details here to help the community. However, unlike those versions, this version is an experimental side project. And a tumultuous one at that. I’ve kept this article long, even if that may make it somewhat boring, so that I can dump as much of the hard earned knowledge for others to sift through. I hope it helps someone out there.

To start, the rough outline: Both v1 and v2 were large scale SDXL finetunes. They used millions of images, and were trained for 30m and 40m samples respectively. A little less than a week’s worth of 8xH100s. I shared both models publicly, for free, and did my best to document the process of training them and share their training code.

Two months ago I was finishing up the latest release of my other project, JoyCaption, which meant it was time to begin preparing for the next version of bigASP. I was very excited to get back to the old girl, but there was a mountain of work ahead for v3. It was going to be my first time breaking into the more modern architectures like Flux. Unable to contain my excitement for training I figured why not have something easy training in the background? Slap something together using the old, well trodden v2 code and give SDXL one last hurrah.

TL;DR

If you just want the quick recap of everything, here it is. Otherwise, continue on to “A Farewell to SDXL.”

I took SDXL and slapped on the Flow Matching objective from Flux.
The dataset was more than doubled to 13M images
Frozen text encoders
Trained nearly 4x longer (150m samples) than the last version, in the ballpark of PonyXL training
Trained for about 6 days on a rented four node cluster (Nebius) for a total of 32 H100 SXM5 GPUs; 300 samples/s training speed
4096 batch size, 1e-4 lr, 0.1 weight decay, fp32 params, bf16 amp
Training code and config: Github
Training run: Wandb
Model: HuggingFace
Total cost including wasted compute on mistakes: $16k

A Farewell to SDXL

The goal for this experiment was to keep things simple but try a few tweaks, so that I could stand up the run quickly and let it spin, hands off. The tweaks were targeted to help me test and learn things for v3: