Invention Title:

Video Generation Method and Apparatus, and Storage Medium

Publication number:

US20250363706

Publication date:

2025-11-27

Section:

Physics

Class:

G06T13/40

Inventors:

Ziyang Zhang 🇨🇳 Shenzhen, China

Weihua He 🇨🇳 Shenzhen, China

Yihang LOU 🇨🇳 Beijing, China

Jingyi Zhang 🇨🇳 Hangzhou, China

Yaoyuan WANG 🇨🇳 Beijing, China

Xianyu Wang 🇨🇳 Shenzhen, China

Assignee:

HUAWEI TECHNOLOGIES CO., LTD. 🇨🇳 Shenzhen, China

Applicant:

Huawei Technologies Co., Ltd. 🇨🇳 Shenzhen, China

Smart overview of the Invention

The patent application describes a method for generating a video that combines a static image of a target person with dynamic facial features and synchronized audio. It leverages a driving video containing dynamic facial expressions and motions to animate the target person's image, creating a dynamic video. The process involves lip synchronization with target audio, enhancing both the facial features and image quality to produce a lifelike video representation of the target person.

Background

With advancements in artificial intelligence, virtual digital human technology aims to simulate realistic human appearances and behaviors, crucial for concepts like the metaverse. Traditional methods require extensive video data of real humans to train models, which is resource-intensive and limits video quality. This patent addresses these challenges by reducing the need for large datasets while improving image quality and realism.

Methodology

The method begins by obtaining a static image of the target person, target audio, and a driving video with dynamic facial features. It transfers these dynamic features to the static image, creating a dynamic video where the target person's face mimics expressions and motions from the driving video. Lip synchronization is achieved by aligning the dynamic video with the target audio, followed by enhancing both facial dynamics and image quality.

Technical Implementation

Key-point detection is used to map facial features from the driving video onto the target person's image. This involves identifying dynamic facial key points in both videos and establishing a mapping relationship to ensure accurate feature migration. The process allows for high-fidelity replication of expressions and motions, resulting in a dynamic video that closely resembles the driving video's characteristics.

Enhancements

To further improve realism, the method enhances dynamic facial features and overall image quality post-lip synchronization. This involves refining facial expressions in the video and remapping them onto the target person's image using another set of key-point mappings. The result is a high-definition, lifelike video with precise lip synchronization that does not require extensive custom training data.