Invention Title:

SYSTEMS AND METHODS FOR ADAPTIVE ALLOCATION AND MANAGEMENT OF PROCESSING RESOURCES IN DATA CENTERS USING DYNAMIC ATTENTION-BASED GRAPH NEURAL NETWORKS

Publication number:

US20260023642

Publication date:

2026-01-22

Section:

Physics

Class:

G06F11/0793

Inventors:

Vibhor AGRAWAL 🇺🇸 Fremont, CA, United States

Vadim Gechman 🇮🇱 Hulda, Israel

Assignee:

NVIDIA Corporation 🇺🇸 Santa Clara, CA, United States

Applicant:

NVIDIA Corporation 🇺🇸 Santa Clara, CA, United States

Smart overview of the Invention

The patent application details a system designed for the adaptive allocation and management of processing resources within data centers. It leverages dynamic attention-based graph neural networks to optimize resource management by analyzing telemetry and task assignment data. The system aims to enhance performance and mitigate potential failures by evaluating the performance state of nodes, each representing one or more processing resources.

Data Collection and Processing

The system collects telemetry data and task assignment data from multiple processing resources, such as GPUs. This data is used to generate feature vectors that feed into a machine learning model employing a graph attention network (GAT). The GAT consists of multiple nodes, each corresponding to specific processing resources, and allows the system to determine the performance state of these nodes.

Performance Optimization

Based on the performance state determined by the GAT, the system can execute various actions to optimize processing resource performance. These actions may include redistributing workloads, scheduling maintenance, adjusting cooling or power systems, and implementing load balancing strategies. The goal is to enhance resource efficiency and prevent node failures.

Predictive Capabilities

The system is also capable of predicting potential future failures of nodes. By analyzing performance states, the system can identify failure patterns and design redundant systems to manage workloads from failing nodes proactively. This predictive approach helps in maintaining uninterrupted data center operations.

Applications and Implementations

The described system can be implemented in various contexts, including data centers, cloud computing environments, and edge devices. It supports applications in fields like AI, big data analytics, virtual reality, and more. The flexibility of the system allows it to be used in simulations, digital twins, and generative AI operations, among other advanced computing tasks.