Image Editing with Stable Diffusion

10 min readJan 18, 2024

Many have become familiar with the concept of image generation through technologies like Stable Diffusion. However, a fascinating question arises: what happens when you wish to edit these generated images, or even real photos, using text prompts or segmentation maps? This exploration delves into the innovative world of image editing, harnessing the capabilities of diffusion models.

Here I provide an overview of some of the latest methods for editing images using Stable Diffusion.

Text Prompt Methods

This approach involves optimizing a text embedding to match the input image, fine-tuning the model for enhanced reconstruction, and employing linear interpolation between the target text embedding and the optimized one. Notable techniques in this domain include Imagic, P+, DiffusionCLIP, and Blended Diffusion.

Imagic

Imagic stands out for its ability to effectuate substantial edits, such as altering object poses or compositions, while preserving the inherent qualities of the original image. It harnesses a pre-trained text-to-image diffusion model. The process optimizes a text embedding to correspond with both the input image and the specified target text. This innovative method facilitates high-quality, versatile semantic edits on objects. However, it is less effective for global editing across the entire image.

P+

P+ introduces Extended Textual Conditioning, which involves partitioning the cross-attention layers of a denoising U-net into subsets with different resolutions and injecting different textual prompts into these layers. Extended Textual Inversion is employed, where input images are inverted into a set of token embeddings, enhancing the model’s ability to represent subjects with greater fidelity and mix different object styles effectively. However, it’s important to note that this method is less suitable for global image editing.

DiffusionCLIP

DiffusionCLIP uses a DDIM inversion to convert input images to latent noises, which are then reversed with a fine-tuned score function guided by CLIP loss based on text prompts. This method demonstrates strong performance in object editing.

Blended Diffusion

Blended Diffusion is a method for local, region-based editing of natural images using natural language descriptions and an ROI mask. The approach combines a DDPM with CLIP model. Unlike basic combinations of DDPM and CLIP, which often fail to preserve the background, this method blends the CLIP-guided diffusion latents with noised versions of the input image at each diffusion step. This results in natural-looking edits that are coherent with the unaltered parts of the image.

Mask based methods

These methods utilize masks, either prompted by the user or generated automatically, for precise object editing, ensuring the overall structure of the image remains intact. Techniques such as GLIDE, DiffEdit, and SpaText are notable in this category. Additionally, the use of augmentation layers, as seen in models like Text2live, enhances their capability in global image editing.

Spatext

In Spatext a global text prompt describing the entire scene along with a segmentation map should be provided. SpaText uses spatial free-form text, allowing for more nuanced and user-specified control over the image generation process. This method allows for precise control over specific image regions, enabling users to dictate the appearance, position, and shape of elements within the image. The rest of the scene, not specified by the user, is automatically filled in by the model, thus creating a complete image that aligns with the user’s textual descriptions.

DiffEdit

The method automatically generates a mask highlighting the regions of the input image that need editing. This mask guides the diffusion process, ensuring edits are confined to specific areas as dictated by the text query.

GLIDE

GLIDE is fine-tuned for image inpainting, enabling text-driven editing of a certain region in real images to match the style and lighting of their context. It can insert new objects, shadows, and reflections realistically, suggesting its potential for detailed, text-prompt-based image editing. It shows that Classifier-free guidance works better than CLIP guidance.

Text2live

The technique involves generating an RGBA edit layer, which is overlaid on the original input to make localized, semantic edits. It leverages a pre-trained CLIP model and does not require user-provided masks or pre-trained generators. The method is capable of realistic texture synthesis and complex semi-transparent effects in a semantically meaningful manner, maintaining high fidelity to the original input.

Attention based methods

In this innovative approach, the process begins by inverting the source images, if real, to extract cross-attention maps. Following this, image editing is executed with a focus on attention control. This technique is exemplified in methods like Prompt-to-Prompt, Plug-and-Play, MasaCtrl, StyleDiffusion, and pix2pix-zero. Each of these methods harnesses attention mechanisms to guide and refine the editing process, ensuring more targeted and effective modifications.

Prompt-to-prompt

Prompt-to-Prompt (P2P) allows for monitoring the synthesis process solely through text prompt edits. The paper considers three primary cases of editing:

Word Swap Control: This editing mode involves swapping one word in the original prompt with another. The cross-attention maps from the original image are maintained to preserve the scene’s composition. This allows the image to adapt to the new word while keeping the overall structure and layout consistent with the original image.
Prompt Refinement Control: In this mode, new words are added to the prompt. The method freezes the attention to previous tokens while allowing new attention to flow to the new tokens. This technique enables global editing or specific object modification within the image. It effectively allows users to expand or refine the original concept of the image while maintaining a coherent structure.
Attention Re-weighting Control: Here, the user can increase or decrease the attention weights of specific tokens in the prompt. This results in either amplifying or attenuating the semantic effect of those tokens on the generated image. For instance, making a particular aspect of an image more or less pronounced based on how much emphasis is placed on the corresponding word in the prompt.

pix2pix-zero

The paper introduces a method, based on DDIM image inversion and Cross-Attention Guidance:

DDIM inversion with white noise regularization guidance.
Automatic Editing Direction Discovery, which is more robust, as it is based on multiple sentences, not just single words.
Content Preservation via Cross-Attention Guidance by maintaining the cross-attention maps of the input image throughout the diffusion process to preserve the original structure.

Side network/adapter based methods

These techniques ingeniously retain the original weights of Stable Diffusion while still allowing for significant fine-tuning and customization. The core principle of these methods is the integration of an additional network or an adapter with the existing Stable Diffusion model. This auxiliary component is designed to learn and adapt to new features or styles, effectively guiding the main model towards generating more targeted outputs. Leading papers: ControlNet, CycleNet, T2i-Adapter.

ControlNet

ControlNet is designed to transform images based on a target prompt, effectively guided by various conditioning elements like an image, segmentation map, edge map, and more. At its core, ControlNet integrates a frozen copy of Stable Diffusion along with the encoder part of the UNet architecture. It operates by finetuning through a set of triples, consisting of an image, a corresponding prompt, and a conditioning (canny edges, segmentation map, human pose, depth map, etc.).

CycleNet

CycleNet introduces a cycle consistency to regularize image manipulation. It is used to ensure that a transformation from one domain to another (and vice versa) maintains the original content’s integrity. In I2I translation, it means that an image converted from one style or format to another should be able to revert to its original state with minimal loss of information or quality. CycleNet is robust even with limited training data (around 2k) and computational resources (1 GPU) to train.

Inversion based methods

Optimization-based inversion with DDIM [Prompt tuning inversion, StyleDiffusion, pix2pix-zero, Null-text inversion]. These methods aim to “correct” the forward latents guided by the source prompt (referred to as the source branch) by aligning them with the DDIM/DDPM inversion trajectory.

Dual-branch inversion with DDIM [Direct inversion] and DDPM [CycleDiffusion]. These methods separate the source and target branches in the editing process: directly revert the source branch back to z0 and iteratively calibrate the trajectory of the target branch. They calculate the distance between the source branch and the inversion branch (or directly sampled from q-sampling in CycleDiffusion), and calibrate the target branch with this computed distance at each t.

Virtual inversion with DDCM [Inversion Free]. This method separates as well the source and target branches in the editing process, but the forward process can start from any random noise and supports multi-step consistency sampling. It ensures exact consistency between original and reconstructed images as each step on the forward branch zt−1 only depends on the ground truth z0 rather than the previous step zt.

Direct inversion [Direct Inversion, PnP Inversion].

Other — with DDIM [EDICT] or with DDPM [LEDITS].

Null-text inversion

The method begins with an image that’s been processed using DDIM inversion as its foundation. It then fine-tunes this image through a process called pivotal tuning. This involves reducing the discrepancies between the initial DDIM-generated image and a version refined using classifier-free guidance, known as null-text latent tuning. By minimizing these differences, the method ensures that the image remains true to the intended outcome. This careful fine-tuning aids in achieving high-quality image edits while also reducing the likelihood of errors building up during the editing process. When the image is generated from this refined inverted state, it closely resembles the original image, ensuring accuracy and fidelity in the editing process.

PnP Inversion

Plug-and-Play (PnP) Inversion (ICLR 2024) disentangles the source and target branches and empowers each branch to excel in its designated role: preservation or editing:

Source Branch Adjustment: In the source branch, the method involves a straightforward step of adding back a specific difference (z^Ip_t — z^Fsg_t) to z^Fsg_t, addressing the gap between the initial source image (z^src_0) and the processed image (z^Fsg_0).This simple strategy effectively corrects any deviation in the image’s path and is easy to integrate with various editing methods.

Target Branch Unchanged: In the target branch, the method leaves it unaltered to fully utilize the diffusion model’s capacity for generating the target image. This ensures that the edits align closely with the desired target prompt.

The method does not require an optimization, which significantly reduces the time needed for image editing, and does not affect the distribution of the diffusion model’s input or the target latent.

Summary

The realm of editing both generated and real images using Stable Diffusion is burgeoning with novel and fascinating works. While there isn’t yet a universal method suitable for all types of edits, certain techniques excel in specific areas. Some are adept at object-specific editing, while others effectively handle global image transformations, such as converting summer scenes to winter or day to night. The runtime for these approaches also varies significantly. This indicates that there is ample opportunity for further advancements and improvements in this exciting field.