Invention Title:

MULTI-INSTANCE GPU AWARE AUTOSCALING IN AI MODEL SERVICE

Publication number:

US20260073467

Publication date:

2026-03-12

Section:

Physics

Class:

G06T1/20

Inventors:

Abhishek Malvankar 🇺🇸 White Plains, NY, United States

Eun Kyung Lee 🇺🇸 Bedford Corners, NY, United States

Chen Wang 🇺🇸 Chappaqua, NY, United States

Rina Inoue 🇯🇵 Sumida-ku, Japan

Yue Zhu 🇺🇸 West Harrision, NY, United States

Assignee:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 ARMONK, NY, United States

Applicant:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 Armonk, NY, United States

Smart overview of the Invention

The patent application introduces a method for optimizing the use of multi-instance GPUs in AI model services, specifically for handling inference requests in Large Language Models (LLMs). By analyzing these requests, the system determines the necessary computing resources to process them efficiently. The approach involves using Multi-Instance GPUs (MIGs), which allow a single GPU to be partitioned into smaller, isolated instances, each with dedicated resources. This partitioning facilitates the simultaneous execution of multiple workloads on a single GPU, enhancing resource utilization and efficiency.

Resource Allocation

The system computes the required amount of computing resources to generate intermediate results during an inference request. It achieves this by executing the LLM using a set of MIGs, where each MIG is composed of slices of a corresponding GPU. Instructions are sent to a controller associated with the MIG to adjust the available resources for a specific MIG slice. This dynamic allocation ensures that the inference request uses the optimal amount of computing resources, balancing performance with operational costs.

Efficiency and Cost Management

Efficient resource allocation is crucial to maintaining optimal model performance and minimizing latency. Overallocating GPUs can lead to underutilization, increased operational costs, and longer wait times for other workloads due to resource unavailability. Additionally, excessive allocation can cause unnecessary wear and tear on GPUs, leading to premature failure and wasted energy. The system addresses these challenges by ensuring resources are allocated based on the specific needs of each workload, optimizing both cost and performance.

Adaptive Utilization

The system recognizes that different workloads require varied computing resources, and optimal GPU configurations can differ across models. Factors such as inference request types, arrival patterns, and model configurations influence GPU utilization. By adapting to these variables, the system ensures that resources are used efficiently, accommodating the unique requirements of different models and optimizing their operation within defined cost and performance parameters.

Technological Advancements

Multi-Instance GPU (MIG) technology plays a pivotal role in this innovation, allowing a single GPU to be effectively partitioned for multiple users or processes. This capability enhances the flexibility and scalability of AI model services, enabling them to handle diverse and simultaneous workloads more efficiently. The system's approach to GPU-aware autoscaling represents a significant advancement in managing AI model services, balancing resource allocation with performance and cost considerations.