Invention Title:

TEMPORALLY CONSISTENT AND SEMANTICS GUIDED TEXT-BASED VIDEO EDITING GENERATIVE ARTIFICIAL INTELLIGENCE (AI) MODEL WITH IMPROVED INITIALIZATION

Publication number:

US20250252627

Publication date:

2025-08-07

Section:

Physics

Class:

G06T11/60

Inventors:

Fatih Murat PORIKLI 🇺🇸 San Diego, CA, United States

Debasmit DAS 🇺🇸 San Diego, CA, United States

Hyojin PARK 🇺🇸 San Diego, CA, United States

Munawar HAYAT 🇺🇸 San Diego, CA, United States

Applicant:

QUALCOMM Incorporated 🇺🇸 San Diego, CA, United States

Drawings (4 of 14)

Drawing 01 for TEMPORALLY CONSISTENT AND SEMANTICS GUIDED TEXT-BASED VIDEO EDITING GENERATIVE ARTIFICIAL INTELLIGENCE (AI) MODEL WITH IMPROVED INITIALIZATION

Drawing 02 for TEMPORALLY CONSISTENT AND SEMANTICS GUIDED TEXT-BASED VIDEO EDITING GENERATIVE ARTIFICIAL INTELLIGENCE (AI) MODEL WITH IMPROVED INITIALIZATION

Drawing 03 for TEMPORALLY CONSISTENT AND SEMANTICS GUIDED TEXT-BASED VIDEO EDITING GENERATIVE ARTIFICIAL INTELLIGENCE (AI) MODEL WITH IMPROVED INITIALIZATION

Drawing 04 for TEMPORALLY CONSISTENT AND SEMANTICS GUIDED TEXT-BASED VIDEO EDITING GENERATIVE ARTIFICIAL INTELLIGENCE (AI) MODEL WITH IMPROVED INITIALIZATION

Smart overview of the Invention

The method focuses on text-based video editing using a generative AI model that ensures temporal consistency and semantic guidance. It involves receiving a sequence of video frames as input and a corresponding text prompt. Features from the video input are extracted to form a latent representation, which is then modified by injecting noise conditioned on the original video. This noise-injected latent is processed by an artificial neural network (ANN) model to adapt the video according to the text prompt.

Background

Artificial neural networks, particularly convolutional neural networks (CNNs), are pivotal in various technological applications like image recognition and autonomous driving. Text-to-video models typically use text prompts, sometimes with additional guidance such as poses or edges, to edit videos. Conventional methods often require substantial computational resources for fine-tuning, which can be challenging for edge devices limited to inferencing capabilities without on-device learning or backpropagation.

Challenges Addressed

Existing methods converting image-based models to video-based ones may face inefficiencies and semantic inconsistencies due to computational demands and lack of structural guidance. The disclosed method addresses these issues by using semantic information to guide text-based video editing, enhancing temporal consistency while reducing computational complexity and latency.

Applications

The described techniques have potential applications in areas such as autonomous vehicles and advanced driver assistance systems (ADAS), extended reality (XR), and image processing. By generating diverse training videos through text prompts, perception systems can be trained more effectively for various real-world scenarios, improving their ability to respond to different conditions.

Advantages

The approach offers several advantages including increased temporal consistency in edited videos and reduced computational demands. It enables the generation of diverse video data for training perception systems, facilitating better generalization across different environments and conditions. This capability is crucial for developing robust autonomous systems that can handle a wide range of real-world situations.