Invention Title:

EFFICIENT ALLOCATION OF MULTI-INSTANCE GPU IN AI MODEL SERVICE

Publication number:

US20260072741

Publication date:

2026-03-12

Section:

Physics

Class:

G06F9/5016

Inventors:

Eun Kyung Lee 🇺🇸 Bedford Corners, NY, United States

Chen Wang 🇺🇸 Chappaqua, NY, United States

Christopher Scott MILITE 🇺🇸 Southbury, CT, United States

Rina Inoue 🇯🇵 Sumida-ku, Japan

Yue Zhu 🇺🇸 West Harrision, NY, United States

Maxwell Preston Calman 🇺🇸 Briarcliff Manor, NY, United States

Thomas Morris 🇺🇸 New Fairfield, CT, United States

Assignee:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 ARMONK, NY, United States

Applicant:

INTERNATIONAL BUSINESS MACHINES CORPORATION 🇺🇸 Armonk, NY, United States

Smart overview of the Invention

An efficient method for allocating multi-instance GPUs in AI model services is described. This system analyzes inference requests to determine execution parameters and assesses the computing environment of a Large Language Model (LLM) to extract environmental parameters. The system utilizes a database of profiles to select optimal configurations for executing inference requests, aiming to enhance energy and resource efficiency.

Methodology

The process involves selecting a profile from a database using parameters derived from both the execution request and the environment. Instructions are then sent to a controller associated with a multi-instance GPU (MIG), which modifies the computing resources available to a specific MIG slice based on the profile's performance specifications. This ensures that inference requests are executed efficiently using the adjusted computing resources.

Benefits

This approach addresses the challenge of balancing performance and cost. By preventing overallocation of resources, the system reduces unnecessary costs and resource wastage, while also minimizing latency for queued workloads. Efficient resource allocation also helps in prolonging the lifespan of GPUs by preventing premature wear and tear.

Applications

Beyond the primary goal of optimizing GPU usage, this system is applicable to various AI-driven industries where LLMs are deployed. Different workloads and models require distinct GPU configurations, and this method ensures optimal operation across diverse scenarios. It can manage resources for multiple models, adjusting configurations to meet specific performance and cost parameters.

Considerations

The system recognizes that GPU utilization is influenced by multiple factors, including the type and timing of inference requests, software configurations, and model-specific requirements. By understanding these variables, the system can dynamically adjust resources to maintain efficient operations and meet varying demands throughout different times and conditions.