Fine-Tuning Stable Diffusion

stable-diffusion.png

Large pre-trained diffusion models have recently become the by far preferred choice for image generation tasks as they can render high-fidelity images that align closely to textual prompts. Stable Diffusion is one such model, famous in addition for being open source with freely usable weights. However, given a specific object, such as an unpublished piece of artwork, it’s unlikely that a diffusion model could generate a matching image, as no textual prompt could possibly convey the level of detail necessary. I and two other students investigated several methods for fine-tuning diffusion image generators to be able to generate such specific images by introducing new concepts to the models’ knowledge.

Specifically, we evaluated the efficacy of textual inversion, Dreambooth, and low-rank adaptation (LoRA):

We evaluated the results of these fine-tuning methods using CLIP cosine similarity and Fréchet Inception Distance and found that while Dreambooth was most accurate at simply reproducing the target images, LoRA was the best at enabling the model to produce images involving the target object in more complex settings.

© 2025 Gabriel Guralnick