Invention Title:

END-TO-END MOBILE USER INTERFACE NAVIGATION WITH VISION LANGUAGE ACTION MODELS

Publication number:

US20260127015

Publication date:

2026-05-07

Section:

Physics

Class:

G06F9/453

Inventors:

Jeffrey W. Nichols 🇺🇸 San Diego, CA, United States

Alexander Toshev 🇺🇸 San Francisco, CA, United States

Yinfei Yang 🇺🇸 Sunnyvale, CA, United States

Zhen Yang 🇺🇸 Santa Clara, CA, United States

Di FENG 🇺🇸 Palo Alto, CA, United States

Keen YOU 🇺🇸 Cupertino, CA, United States

Anuj MAHAJAN 🇺🇸 San Jose, CA, United States

Harsh AGRAWAL 🇺🇸 New York, NY, United States

Meng-Ta CHOU 🇺🇸 Los Gatos, CA, United States

Andres ROMERO MIER Y TERAN 🇺🇸 San Jose, CA, United States

Adolfo LOPEZ MENDEZ 🇺🇸 Sunnyvale, CA, United States

Kenneth JUNG 🇺🇸 Sunnyvale, CA, United States

Abhishek SUNDARARAJAN 🇺🇸 Campbell, CA, United States

Pengfei DOU 🇺🇸 San Jose, CA, United States

Haotian ZHANG 🇺🇸 Sunnyvale, CA, United States

Zifeng HUANG 🇺🇸 Redwood City, CA, United States

Eldon K. SCHOOP 🇺🇸 Seattle, WA, United States

Zhe GAN 🇺🇸 Kirkland, WA, United States

Mohana Prasad SATHYA MOORTHY 🇺🇸 San Jose, CA, United States

Applicant:

Apple Inc. 🇺🇸 Cupertino, CA, United States

Smart overview of the Invention

The described technology focuses on enhancing mobile user interface (UI) navigation through the use of vision language action models. It involves receiving language instructions and visual inputs from a mobile device's UI, which are then tokenized and processed using a multi-modal large language model. This process generates action outputs that are converted into executable commands to perform navigation tasks on the device.

Technical Background

Machine learning, particularly large language models, requires substantial computational resources due to their size and complexity. These models, characterized by millions or billions of parameters, are pivotal in enabling autonomous agents to perform tasks traditionally done by humans. The integration of these models into mobile UI navigation aims to automate interactions, thus enhancing productivity and safety during daily activities.

Challenges and Solutions

Several challenges exist in deploying autonomous UI agents, including complex modular designs, limited online evaluation capabilities, and a lack of high-quality navigation datasets. To address these, the technology utilizes an end-to-end vision language action (VLA) model, which simplifies workflows and enhances model performance through synthetic data generation and exploration mechanisms. This approach allows for improved navigation accuracy and adaptability across different mobile platforms.

Functionality and Features

The autonomous UI agent supports both multi-step and single-step navigation, facilitated by a user intent prediction component. This enables the agent to summarize and infer user goals, thus improving interaction versatility. The technology also incorporates synthetic data generation to correct navigation errors and employs chain-of-thought reasoning to enhance decision-making accuracy in complex scenarios.

Implementation and Benefits

The integration of multi-modal language-vision processes and scalable data generation forms a robust architecture for mobile navigation. This configuration enhances the computing functionality of electronic devices by efficiently utilizing processing and memory resources. The network environment includes various electronic devices and servers, offering flexibility in deployment and promoting efficient, autonomous mobile interactions.