Invention Title:

TEXT AND IMAGE GENERATION FOR CREATION OF IMAGERY FROM AUDIBLE INPUT

Publication number:

US20240338860

Publication date:

2024-10-10

Section:

Physics

Class:

G06T11/00

Inventor:

Alexander Ian Pfister TRZYNA Seattle, WA, United States

Assignee:

Microsoft Technology Licensing, LLC Redmond, WA, United States

Applicant:

Microsoft Technology Licensing, LLC Redmond, WA, United States

Smart overview of the Invention

Systems and methods are introduced for generating images in real-time based on audio input. Utilizing an artificial intelligence (AI) model, a live audio stream—such as conversations, speeches, or lectures—is transcribed into text using speech-to-text technology. A segment of this transcript is extracted and processed through a first language model (LM) to create a summary. This summary is then transformed into a prompt for a second model that generates an image, which is displayed simultaneously as the audio continues.

Enhancing Communication with Visual Aids

Integrating images with spoken information can significantly improve communication effectiveness. Visual aids make content more engaging and easier to understand, particularly for those who benefit from visual learning. By providing relevant imagery alongside verbal communication, individuals are more likely to recall information, enhancing both comprehension and retention.

Real-Time Image Generation Process

The described system captures live audio and converts it into a continuous text transcript. As segments are extracted, they are summarized using a large language model (LLM). Subsequently, this summary is used to generate an image via a text-to-image model. The resulting visuals are displayed on screens in real time, allowing for dynamic and interactive presentations.

Benefits for Diverse Learners

Visual aids play a crucial role in supporting individuals with various learning needs, including those with neurodivergences. The imagery generated can provide clarity and context that enhances understanding of spoken content. For example, individuals with autism spectrum disorder may find visual representations less anxiety-inducing and easier to process than auditory information alone, making this technology particularly valuable in educational and professional settings.

System Architecture and Functionality

The system comprises various computing devices capable of processing audio input and generating images through integrated applications. These devices can range from personal computers to virtual reality platforms. The architecture allows for flexibility in input methods, utilizing microphones to capture audio streams effectively. The image generator operates as either part of existing applications or as a standalone module, ensuring seamless integration into multiple environments.