US20260072741
2026-03-12
Physics
G06F9/5016
An efficient method for allocating multi-instance GPUs in AI model services is described. This system analyzes inference requests to determine execution parameters and assesses the computing environment of a Large Language Model (LLM) to extract environmental parameters. The system utilizes a database of profiles to select optimal configurations for executing inference requests, aiming to enhance energy and resource efficiency.
The process involves selecting a profile from a database using parameters derived from both the execution request and the environment. Instructions are then sent to a controller associated with a multi-instance GPU (MIG), which modifies the computing resources available to a specific MIG slice based on the profile's performance specifications. This ensures that inference requests are executed efficiently using the adjusted computing resources.
This approach addresses the challenge of balancing performance and cost. By preventing overallocation of resources, the system reduces unnecessary costs and resource wastage, while also minimizing latency for queued workloads. Efficient resource allocation also helps in prolonging the lifespan of GPUs by preventing premature wear and tear.
Beyond the primary goal of optimizing GPU usage, this system is applicable to various AI-driven industries where LLMs are deployed. Different workloads and models require distinct GPU configurations, and this method ensures optimal operation across diverse scenarios. It can manage resources for multiple models, adjusting configurations to meet specific performance and cost parameters.
The system recognizes that GPU utilization is influenced by multiple factors, including the type and timing of inference requests, software configurations, and model-specific requirements. By understanding these variables, the system can dynamically adjust resources to maintain efficient operations and meet varying demands throughout different times and conditions.