Prompt-to-Prompt Image Editing with Cross Attention Control , developed by a research team from Google Research in the United States and Tel Aviv University in Israel, is a technology that allows AI to modify images generated from text. By correcting a part of the sentence used for input, it is possible to change only that part with pinpoint accuracy without changing the entire image significantly.
For example, if you rewrite “a cat riding a bicycle” as “a cat riding a car”, only the bicycle will be replaced with the car while the scenery and the position of the cat remain the same. In addition, various editing is possible, such as changing a landscape painting to a style drawn by a child, or changing the gummy that decorates a whole cake to a gummy made by a specific manufacturer.
Large-scale language-image (LLI) models such as Imagen , DALL-E 2 and Parti have shown stunning output results and received unprecedented attention from the research community and the public. ” Midjourney ” has become a hot topic recently.
These LLI models are trained on large language-image datasets and use state-of-the-art image generation models, including autoregressive and diffusion models. These models are suitable for image generation from scratch, but the disadvantage is that they cannot be used for simple image editing from the middle.
To avoid this, the method using the LLI model adopts a method in which the user masks a part of the image and matches it with the background of the original image, while only the masked part is changed as an edited image. Although this method yields good results, it is procedurally cumbersome and detracts from the advantages of the language-image model, which is quick and intuitive.
This time, we propose a method “Prompt-to-Prompt” that corrects only that part of the image once generated by correcting part of the text used for input.
This method performs local image editing by modifying the pixel-text interaction that occurs in the cross attention layer. Specifically, we allow image editing by injecting a cross attention map at the diffusion step of the prompt text and controlling which pixels pay attention to which token at which diffusion step.
The image output by this method retains most of the structure of the original image, and is partially modified according to the content of editing to finish the image. You can easily respond to needs such as wanting to change only this part while leaving your favorite image structure generated once.