Invention Title:

DIGITAL VIDEO EDITING BASED ON A TARGET DIGITAL IMAGE

Publication number:

US20250265752

Publication date:
Section:

Physics

Class:

G06T11/60

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application describes advanced digital video editing techniques utilizing a target digital image. These methods involve receiving inputs such as a target text prompt, a target digital image depicting a target object, and a source digital video with multiple frames showing a source object. By leveraging machine-learning models, specifically diffusion models, regions-of-interest within the source video frames are identified based on the target text prompt and image. This approach enables the generation of new video frames featuring the target object, ensuring visual and temporal coherence.

Challenges with Conventional Techniques

Traditional digital video editing methods using machine-learning models face significant challenges, particularly in maintaining accuracy and consistency across video frames. These methods often rely heavily on text prompts, which limit their expressive power and ability to handle edits involving objects of varying sizes and shapes. Consequently, conventional techniques struggle with visual inaccuracies and inefficiencies, especially when replacing objects with different dimensions, often resulting in noticeable artifacts.

Innovative Approach

The described techniques address these limitations by incorporating a target digital image as a visual guide. This enhances the expressiveness of the editing process, allowing for accurate object replacement even when dealing with varying shapes and sizes. Additionally, the method ensures temporal consistency between frames, which is crucial for producing visually coherent videos. By using generative machine-learning models, these techniques can effectively transform a source object into a target object within a digital video while preserving movement patterns.

Implementation Details

The process begins with identifying regions-of-interest in the source video frames using diffusion models that apply generative AI techniques to create masks for each frame. These models include separate branches for processing both source and target text prompts alongside the target digital image. The source video undergoes transformation through randomized latent noise that is subsequently denoised by the respective model branches. The differences in these operations are analyzed to form masks defining regions-of-interest, which are then utilized to generate the final target video.

Advantages Over Conventional Methods

This innovative approach offers several advantages over traditional text-based editing techniques. By incorporating visual information from a target digital image, it achieves higher accuracy and expressiveness in edits. The method supports complex transformations involving objects of different sizes and shapes while maintaining temporal consistency across frames. This results in more efficient and visually appealing video outputs, overcoming many challenges faced by conventional methods.