US20250299061
2025-09-25
Physics
G06N3/092
Generating high-quality images of logic-rich three-dimensional (3D) scenes from natural language text prompts involves complex reasoning and spatial understanding. A reinforcement learning framework is employed to train a policy network, which refines text prompts to create 3D scenes. The process includes rendering these scenes into images, with an action agent modifying text, a generation agent producing images, and a reward agent evaluating them. This approach optimizes visual accuracy and semantic alignment between images and text prompts.
The technology of generating 3D scenes from text uses artificial intelligence and computer graphics to create environments based on descriptions. It has transformative potential across industries such as virtual reality, gaming, architecture, education, and simulation. By converting written language into detailed 3D scenes, this technology enhances user engagement and provides innovative solutions for design and planning.
A policy network is central to improving the alignment between generated images and text meaning. Trained via reinforcement learning with multimodality-based feedback, it refines input text prompts to better capture complex interactions. This involves using a neural network with interconnected layers to process data, optimizing parameters based on feedback to improve rendered image quality and semantic accuracy.
The framework involves agents for different tasks: an action agent modifies text prompts iteratively, a generation agent converts modified prompts into 3D scenes, and a reward agent evaluates image quality against ground truth data. The reward agent uses metrics like object presence and visual quality to guide the policy network in refining text prompts for improved image generation.
The system significantly enhances image generation from natural language descriptions by addressing challenges in spatial reasoning and subject-object relationships. Applications include virtual reality, autonomous systems, AI-driven design, content creation, and interactive environments. Experiments show improved performance in object presence matches and scene coherence compared to other methods.