Invention Title:

MULTIMODAL HUMAN-MACHINE INTERACTIONS FOR INTERACTIVE SYSTEMS AND APPLICATIONS

Publication number:

US20250181138

Publication date:

2025-06-05

Section:

Physics

Class:

G06F3/011

Inventors:

Razvan DINU 🇷🇴 Voluntari, Romania

Pascal Joël Bérard 🇨🇭 Zürich, Switzerland

Christian Eduard Schüller 🇨🇭 Frauenfeld, Switzerland

Severin Achill Klingler 🇨🇭 Stettlen, Switzerland

Applicant:

NVIDIA Corporation 🇺🇸 Santa Clara, CA, United States

Smart overview of the Invention

The patent application discusses a platform designed for interactive systems that utilize multimodal human-machine interactions. This platform hosts interactive agents, such as avatars or robots, capable of executing flows that dictate the logic and sequence of interactions across multiple modalities. These modalities include speech, gestures, and visual elements, allowing simultaneous or sequential actions to engage users effectively. The system aims to enhance interaction by supporting various channels for both bot and user actions, making interactions more natural and dynamic.

Background

Conversational AI has evolved to incorporate multimodal interactions, blending text, speech, gestures, and visual elements to create more lifelike exchanges. However, designing such systems is complex due to the nuanced nature of human communication. The transition from simple text-based chatbots to fully interactive avatars involves overcoming challenges like non-sequential interaction handling and the uncanny valley effect. Current systems often struggle with integrating multiple modalities seamlessly, limiting their ability to provide rich user experiences.

System Features

The proposed system introduces an interaction modeling language and API that standardizes interaction categorization and supports multimodal interactions. It utilizes an event-driven architecture to manage interaction flows and deploys large language models for sensory processing and action execution. The system allows designers to customize interaction patterns using a standardized schema that classifies actions by modality and type. This framework aims to simplify the development of complex interaction patterns while ensuring compatibility across different technologies.

Multimodal Interactions

Interactive agents within this platform can execute a variety of flows involving multimodal actions. For instance, an avatar can perform speech and gestures simultaneously or in sequence, engaging users through multiple interaction channels. The system supports diverse actions from both bots and users, enhancing the flexibility of interactions. By incorporating backchanneling techniques like posture mirroring and vocal feedback, the system seeks to create more natural conversations between users and interactive agents.

Standardization and Flexibility

The platform employs a standardized API to represent human-machine interactions uniformly across components of the interactive system. This standardization facilitates resolving conflicts between simultaneous actions and ensures consistency in representing states of multimodal actions. By using a common protocol for all activities, the system allows seamless integration and communication between different technologies, enhancing the adaptability of interactive systems to new advancements in AI models and frameworks.