US20240386627
2024-11-21
Physics
G06T11/001
The described image transformation system utilizes a generator network to edit images based on text prompts. It processes an input image and a text prompt, leveraging multiple layers within the generator network to perform respective edits. Each layer is responsible for modifying specific attributes of the image, guided by masks that define local edit regions. This approach allows for precise image modifications in accordance with the text prompt, resulting in an edited image that reflects the requested changes.
Traditional image editing systems often require manual selection of specific layers within a generator network to apply changes based on text prompts. This process can be error-prone and may lead to undesirable artifacts if incorrect layers are chosen. Moreover, these conventional methods may miss edits that involve multiple layers. The new technique overcomes these limitations by automating the layer selection process and applying edits across multiple layers as needed.
The system comprises a mapping network and a generator network with several convolutional layers. Each layer outputs an unedited feature representing the input image and an edited feature that reflects changes based on the text prompt. The system generates masks for each layer to identify local edit regions, blending unedited and edited features to create a cohesive final image. This method ensures that only relevant parts of the image are altered according to the text prompt.
An example involves editing an image of a human subject with the text prompt "beard." The input image is transformed through a series of operations involving latent vectors and affine transformations at each generator layer. Each layer receives a latent style vector and outputs blended features, progressively incorporating changes specified by the text prompt, such as adding a beard.
The system utilizes latent edit vectors determined through either a global direction module or a latent mapper module. The global direction module produces input-independent vectors usable across various images, while the latent mapper module generates input-dependent vectors tailored to specific images. These vectors are combined with transformed latent vectors to produce edited latent style vectors for each layer, ensuring precise attribute modifications aligned with the text prompt.