Invention Title:

DIGITAL IMAGE REPOSING BASED ON MULTIPLE INPUT VIEWS

Publication number:

US20250005812

Publication date:

2025-01-02

Section:

Physics

Class:

G06T11/001

Inventors:

Balaji Krishnamurthy Noida, India

Jingwan Lu Sunnyvale, CA, United States

Mausoom Sarkar Noida, India

Mayur Hemani Noida, India

Krishna Kumar Singh San Jose, CA, United States

Duygu Ceylan Aksit London, Great Britain (UK)

Rishabh Jain Gurgaon, India

Assignee:

Adobe Inc. San Jose, CA, United States

Applicant:

Adobe Inc. San Jose, CA, United States

Smart overview of the Invention

The described system addresses the limitations of conventional human reposing techniques by utilizing multiple input views to generate digital images. It processes input data that includes digital images of a person, keypoints representing poses in these images, and keypoints for a target pose. By leveraging a machine learning model, the system creates selection masks that indicate spatial correspondence likelihoods between pixels of an output image and portions of the input images. This approach enables the generation of an output image depicting the person in a desired target pose with reduced artifacts and increased realism.

Advantages Over Conventional Methods

Traditional systems often struggle with generating accurate images when there is a significant difference between source and target poses due to reliance on a single source image. This limitation leads to visual artifacts and occlusions. The new system overcomes these challenges by using multiple input images, allowing for more comprehensive data to inform the creation of the target pose image. This method ensures that even if certain details are missing in one image, they can be sourced from another, resulting in a more accurate and realistic depiction.

Machine Learning Models Utilized

The system employs three distinct machine learning models to achieve its goals. The first model includes a convolutional neural network that generates visibility segment maps and predicted images. These maps help identify which parts of the person are visible or invisible in each input image. The second model uses a transformer and feature pyramid network to create selection masks, capturing inter-channel relationships and performing per-pixel segmentation. Finally, a third model processes outputs from the previous models to produce the final output image.

Process of Image Generation

The reposing system generates flow-field pyramids at multiple resolutions for visible and invisible parts of the target pose in each input image. These pyramids are combined using gated aggregation to create composite flows, which are then upsampled to produce predicted images. Selection masks are generated through attention mechanisms within shifting windows, merging information across scales. The final output image is created by fusing pose features and texture features processed by encoders and decoders in the third machine learning model.

Implementation Environment

The described system can be implemented on various computing devices connected to a network, ranging from full-resource devices like personal computers to low-resource mobile devices. It is also adaptable for cloud-based operations using multiple servers. This flexibility allows the system to be utilized in diverse environments for generating high-quality digital images depicting humans in desired poses from multiple viewpoints.