US20250005812
2025-01-02
Physics
G06T11/001
The described system addresses the limitations of conventional human reposing techniques by utilizing multiple input views to generate digital images. It processes input data that includes digital images of a person, keypoints representing poses in these images, and keypoints for a target pose. By leveraging a machine learning model, the system creates selection masks that indicate spatial correspondence likelihoods between pixels of an output image and portions of the input images. This approach enables the generation of an output image depicting the person in a desired target pose with reduced artifacts and increased realism.
Traditional systems often struggle with generating accurate images when there is a significant difference between source and target poses due to reliance on a single source image. This limitation leads to visual artifacts and occlusions. The new system overcomes these challenges by using multiple input images, allowing for more comprehensive data to inform the creation of the target pose image. This method ensures that even if certain details are missing in one image, they can be sourced from another, resulting in a more accurate and realistic depiction.
The system employs three distinct machine learning models to achieve its goals. The first model includes a convolutional neural network that generates visibility segment maps and predicted images. These maps help identify which parts of the person are visible or invisible in each input image. The second model uses a transformer and feature pyramid network to create selection masks, capturing inter-channel relationships and performing per-pixel segmentation. Finally, a third model processes outputs from the previous models to produce the final output image.
The reposing system generates flow-field pyramids at multiple resolutions for visible and invisible parts of the target pose in each input image. These pyramids are combined using gated aggregation to create composite flows, which are then upsampled to produce predicted images. Selection masks are generated through attention mechanisms within shifting windows, merging information across scales. The final output image is created by fusing pose features and texture features processed by encoders and decoders in the third machine learning model.
The described system can be implemented on various computing devices connected to a network, ranging from full-resource devices like personal computers to low-resource mobile devices. It is also adaptable for cloud-based operations using multiple servers. This flexibility allows the system to be utilized in diverse environments for generating high-quality digital images depicting humans in desired poses from multiple viewpoints.