Invention Title:

SPEECH RECOGNITION-BASED IMAGE GENERATION SYSTEM AND METHOD

Publication number:

US20260044997

Publication date:

2026-02-12

Section:

Physics

Class:

G06T11/00

Inventors:

Sungbo Yang 🇰🇷 Hwaseong-si, South Korea

Yicheng FAN 🇨🇳 Shandong, China

Yao YAO 🇨🇳 Shandong, China

Assignees:

Hyundai Motor Company 🇰🇷 Seoul, South Korea

KIA CORPORATION 🇰🇷 Seoul, South Korea

Applicants:

Hyundai Motor Company 🇰🇷 Seoul, South Korea

Kia Corporation 🇰🇷 Seoul, South Korea

Smart overview of the Invention

A system and method are introduced for generating images from speech inputs using advanced speech recognition and image generation techniques. The system comprises several interconnected components, including a speech recognition apparatus, a language understanding apparatus, a cloud image generation apparatus, and a display apparatus. These components work together to convert spoken user inputs into visual outputs by extracting semantic information and applying a stability diffusion algorithm for image creation.

Components and Functionality

The speech recognition apparatus captures and processes user speech, converting it into digital signals and extracting relevant features. It includes modules for noise reduction, filtering, enhancement, and spectrum analysis, ultimately outputting text-based user requirements. The language understanding apparatus further processes this text to extract semantic information, utilizing vocabulary, grammar, and semantic analysis to understand the user's intent.

Image Generation Process

The cloud image generation apparatus is responsible for creating images based on the semantic information extracted by the language understanding apparatus. It employs a stability diffusion algorithm, leveraging models like CLIP and U-Net, to transform text into visual expressions. The process involves a series of diffusion and decoding steps to produce high-quality images that accurately reflect user input.

Advanced Features

In addition to image generation, the system is capable of creating video content. The video generation module uses similar stability diffusion techniques, applying latent code sampling and motion field calculations to produce sequential image frames that form a coherent video. This feature enhances the system's capability to deliver dynamic visual content from speech inputs.

Technical Considerations

The system addresses several limitations of existing technologies, such as accuracy in speech recognition, data acquisition challenges, and quality concerns. By integrating advanced machine learning models and algorithms, it enhances the accuracy and quality of generated images and videos. This approach ensures a more reliable and user-friendly experience, overcoming issues related to data quality, copyright, and user privacy.