Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

1CMU, 2Adobe Research, 3UIUC

Example of change of generated images and object attention during different finetuning steps with our proposed SepEn compositional finetuning.

alternatetext

Visual comparisons between Stable Diffusion and our method. Left: compositional finetuning with individual concepts. After finetuning, we are able to generate more aligned images with the text input of high quality. Right: joint compositional finetuning with a large collection of concepts. After finetuning, the model keeps a high compositional capacity for unseen novel concepts.

Abstract

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation.

This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively.

Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.

Single-concept comparisons

alternatetext

large-scale experimental results

alternatetext

Motivations

alternatetext
(Top) Failure cases and (Bottom) reasons of Stable Diffusion. Even state-of-the-art T2I models grapple with challenges when representing multiple objects with varying attributes. Two primary factors, demonstrated by the bottom two examples respectively, include: (1) low attention activation scores for certain objects and (2) the attention masks overlap.

Our Solutions

alternatetext
The Separate loss (left) aims to split the attention masks of multiple objects while the Enhance loss (right) aims to amplify the attention activation scores of each target objects.

Extension to more than two concepts

alternatetext