US20250252627
2025-08-07
Physics
G06T11/60
The method focuses on text-based video editing using a generative AI model that ensures temporal consistency and semantic guidance. It involves receiving a sequence of video frames as input and a corresponding text prompt. Features from the video input are extracted to form a latent representation, which is then modified by injecting noise conditioned on the original video. This noise-injected latent is processed by an artificial neural network (ANN) model to adapt the video according to the text prompt.
Artificial neural networks, particularly convolutional neural networks (CNNs), are pivotal in various technological applications like image recognition and autonomous driving. Text-to-video models typically use text prompts, sometimes with additional guidance such as poses or edges, to edit videos. Conventional methods often require substantial computational resources for fine-tuning, which can be challenging for edge devices limited to inferencing capabilities without on-device learning or backpropagation.
Existing methods converting image-based models to video-based ones may face inefficiencies and semantic inconsistencies due to computational demands and lack of structural guidance. The disclosed method addresses these issues by using semantic information to guide text-based video editing, enhancing temporal consistency while reducing computational complexity and latency.
The described techniques have potential applications in areas such as autonomous vehicles and advanced driver assistance systems (ADAS), extended reality (XR), and image processing. By generating diverse training videos through text prompts, perception systems can be trained more effectively for various real-world scenarios, improving their ability to respond to different conditions.
The approach offers several advantages including increased temporal consistency in edited videos and reduced computational demands. It enables the generation of diverse video data for training perception systems, facilitating better generalization across different environments and conditions. This capability is crucial for developing robust autonomous systems that can handle a wide range of real-world situations.