US20260038126
2026-02-05
Physics
G06T7/194
A processing device is employed to generate mask data and foreground feature data from frames of a subject video. The mask data effectively separates the subject from its environment, while the foreground feature data details the subject's characteristics. Upon receiving a condition frame depicting a different environment, a machine-learning model aligns the subject's movements with this new environment. This results in a composite video, which is then presented through a user interface.
Video compositing traditionally involves combining elements from multiple digital content sources to create a unified video output, often used for background changes. Conventional methods encounter issues such as high computational demands, manual interventions, and limited flexibility. These challenges restrict their application to specific scenarios, making the process inefficient and cumbersome.
The described techniques leverage a processing device to receive an input video and a condition frame, which depict the subject and a desired environment, respectively. Mask data and subject data are generated to isolate and describe the subject, enabling a machine-learning model to create a composite video. This model aligns the subject's movements with the new environment, offering a more efficient and flexible solution compared to traditional methods.
The system employs a generative model trained on extensive video datasets to automate background synthesis. By using cross-attention layers in a denoising network, the model focuses on environmental details from the condition frame, ensuring realistic foreground-background interactions. This approach significantly reduces the manual effort and computational resources typically required, allowing for rapid iteration to meet creative goals.
In practice, this technology is applicable in industries like film and visual effects, where integrating subjects into different environments is common. The system uses a machine-learning model, such as a diffusion-based model, to generate realistic composite videos. It aligns the subject's viewpoint movements with the rendered environment, generating new views and details through machine learning. This capability allows for creative integration of subjects into various scenes, enhancing the realism and flexibility of video compositing.