Invention Title:

AUTOMATIC PARALLEL EXECUTION OF ARTIFICIAL INTELLIGENCE WORKLOADS

Publication number:

US20260044427

Publication date:

2026-02-12

Section:

Physics

Class:

G06F11/3433

Inventors:

Fanny Nina Paravecino 🇺🇸 San Jose, CA, United States

Timothy Lawrence Harris 🇬🇧 Cambridge, United Kingdom

Alexander WETMORE 🇺🇸 Redmond, WA, United States

Woosuk KWON 🇺🇸 Berkeley, CA, United States

Assignee:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Applicant:

Microsoft Technology Licensing, LLC 🇺🇸 Redmond, WA, United States

Smart overview of the Invention

The patent outlines a method for optimizing the execution of transformer models on device clusters through parallel processing. The approach involves generating multiple candidate execution plans that represent different ways to partition the devices for parallel execution. The system evaluates these plans to determine the one that uses the least resources, simulating the execution to ensure efficiency. This method addresses the growing complexity and computational demands of large language models in artificial intelligence.

Challenges in AI Workloads

Generative AI models, particularly large language models, require significant computational power due to their size and complexity. Distributing the workload effectively across multiple devices while maintaining synchronization and managing dependencies is challenging. The dynamic nature of computational resources further complicates this process, necessitating a more adaptive approach to device partitioning and workload management.

Methodology

The proposed method includes receiving internal representations of a transformer model, a device cluster, and a workload. It generates several candidate execution plans, each offering a unique parallel schedule for device partitioning. The system then evaluates these plans based on resource usage and selects the optimal one that minimizes resource consumption. This evaluation process includes simulating the model's execution on the device cluster.

System Architecture

The system architecture comprises a memory, a processor system, and storage media containing instructions for executing the method. The processor system performs operations like generating parallel schedules, executing the transformer model, and evaluating resource usage. The transformer model is represented by a chain of cells, each containing specific tasks, and the parallel schedule divides these cells into sequential stages for execution across the device cluster.

Implementation Details

Implementation involves searching for parallel schedules that partition the device cluster into model replicas and stages. Each model replica is a copy of the transformer model, and each stage divides the model's repeating blocks into manageable parts. The system also determines the number of cell replicas within each stage, mapping tasks to devices efficiently. This structured approach allows for scalable and efficient execution of transformer models on distributed hardware.