Invention Title:

MULTI-MODALITY REINFORCEMENT LEARNING IN LOGIC-RICH SCENE GENERATION

Publication number:

US20250299061

Publication date:

2025-09-25

Section:

Physics

Class:

G06N3/092

Inventors:

Nilesh Jain 🇺🇸 Portland, OR, United States

Peixi Xiong 🇺🇸 Hillsboro, OR, United States

Assignee:

INTEL CORPORATION 🇺🇸 Santa Clara, CA, United States

Applicant:

Intel Corporation 🇺🇸 Santa Clara, CA, United States

Smart overview of the Invention

Generating high-quality images of logic-rich three-dimensional (3D) scenes from natural language text prompts involves complex reasoning and spatial understanding. A reinforcement learning framework is employed to train a policy network, which refines text prompts to create 3D scenes. The process includes rendering these scenes into images, with an action agent modifying text, a generation agent producing images, and a reward agent evaluating them. This approach optimizes visual accuracy and semantic alignment between images and text prompts.

Background

The technology of generating 3D scenes from text uses artificial intelligence and computer graphics to create environments based on descriptions. It has transformative potential across industries such as virtual reality, gaming, architecture, education, and simulation. By converting written language into detailed 3D scenes, this technology enhances user engagement and provides innovative solutions for design and planning.

Technical Approach

A policy network is central to improving the alignment between generated images and text meaning. Trained via reinforcement learning with multimodality-based feedback, it refines input text prompts to better capture complex interactions. This involves using a neural network with interconnected layers to process data, optimizing parameters based on feedback to improve rendered image quality and semantic accuracy.

Reinforcement Learning Framework

The framework involves agents for different tasks: an action agent modifies text prompts iteratively, a generation agent converts modified prompts into 3D scenes, and a reward agent evaluates image quality against ground truth data. The reward agent uses metrics like object presence and visual quality to guide the policy network in refining text prompts for improved image generation.

Applications and Results

The system significantly enhances image generation from natural language descriptions by addressing challenges in spatial reasoning and subject-object relationships. Applications include virtual reality, autonomous systems, AI-driven design, content creation, and interactive environments. Experiments show improved performance in object presence matches and scene coherence compared to other methods.