US20260127015
2026-05-07
Physics
G06F9/453
The described technology focuses on enhancing mobile user interface (UI) navigation through the use of vision language action models. It involves receiving language instructions and visual inputs from a mobile device's UI, which are then tokenized and processed using a multi-modal large language model. This process generates action outputs that are converted into executable commands to perform navigation tasks on the device.
Machine learning, particularly large language models, requires substantial computational resources due to their size and complexity. These models, characterized by millions or billions of parameters, are pivotal in enabling autonomous agents to perform tasks traditionally done by humans. The integration of these models into mobile UI navigation aims to automate interactions, thus enhancing productivity and safety during daily activities.
Several challenges exist in deploying autonomous UI agents, including complex modular designs, limited online evaluation capabilities, and a lack of high-quality navigation datasets. To address these, the technology utilizes an end-to-end vision language action (VLA) model, which simplifies workflows and enhances model performance through synthetic data generation and exploration mechanisms. This approach allows for improved navigation accuracy and adaptability across different mobile platforms.
The autonomous UI agent supports both multi-step and single-step navigation, facilitated by a user intent prediction component. This enables the agent to summarize and infer user goals, thus improving interaction versatility. The technology also incorporates synthetic data generation to correct navigation errors and employs chain-of-thought reasoning to enhance decision-making accuracy in complex scenarios.
The integration of multi-modal language-vision processes and scalable data generation forms a robust architecture for mobile navigation. This configuration enhances the computing functionality of electronic devices by efficiently utilizing processing and memory resources. The network environment includes various electronic devices and servers, offering flexibility in deployment and promoting efficient, autonomous mobile interactions.