US20260044427
2026-02-12
Physics
G06F11/3433
The patent outlines a method for optimizing the execution of transformer models on device clusters through parallel processing. The approach involves generating multiple candidate execution plans that represent different ways to partition the devices for parallel execution. The system evaluates these plans to determine the one that uses the least resources, simulating the execution to ensure efficiency. This method addresses the growing complexity and computational demands of large language models in artificial intelligence.
Generative AI models, particularly large language models, require significant computational power due to their size and complexity. Distributing the workload effectively across multiple devices while maintaining synchronization and managing dependencies is challenging. The dynamic nature of computational resources further complicates this process, necessitating a more adaptive approach to device partitioning and workload management.
The proposed method includes receiving internal representations of a transformer model, a device cluster, and a workload. It generates several candidate execution plans, each offering a unique parallel schedule for device partitioning. The system then evaluates these plans based on resource usage and selects the optimal one that minimizes resource consumption. This evaluation process includes simulating the model's execution on the device cluster.
The system architecture comprises a memory, a processor system, and storage media containing instructions for executing the method. The processor system performs operations like generating parallel schedules, executing the transformer model, and evaluating resource usage. The transformer model is represented by a chain of cells, each containing specific tasks, and the parallel schedule divides these cells into sequential stages for execution across the device cluster.
Implementation involves searching for parallel schedules that partition the device cluster into model replicas and stages. Each model replica is a copy of the transformer model, and each stage divides the model's repeating blocks into manageable parts. The system also determines the number of cell replicas within each stage, mapping tasks to devices efficiently. This structured approach allows for scalable and efficient execution of transformer models on distributed hardware.