Invention Title:

EMBEDDING A STATE SPACE MODEL ON MODELS-ON-SILICON HARDWARE ARCHITECTURE

Publication number:

US20260010782

Publication date:
Section:

Physics

Class:

G06N3/063

Inventors:

Assignee:

Applicant:

Smart overview of the Invention

The patent application discusses embedding a state space model, specifically a Mamba-based block, onto a silicon chip architecture. This integration leverages specialized hardware modules like an optimized selective scan unit and a 1D convolution unit to enhance processing speed, power efficiency, and performance. By organizing model parameters in sequential read memories based on a predetermined timing sequence, the architecture becomes highly efficient for AI tasks on resource-constrained devices.

Background

Deep neural networks (DNNs), including large language models (LLMs), are widely used in AI applications like computer vision and natural language processing due to their accuracy. However, their high computational demands make them challenging to implement on edge devices with limited resources. The patent addresses the need for a cost-effective solution for AI inference tasks, which are typically resource-intensive and require real-time performance.

Technical Challenges

Current solutions like GPUs, TPUs, and CPUs are not optimized for dedicated AI inference tasks, leading to inefficiencies in power consumption and performance. For instance, GPUs repeatedly load model weights from memory, consuming significant power and time. FPGA solutions, while customizable, require extensive programming effort and are not as power-efficient. CPUs, on the other hand, are not suitable for large-scale matrix multiplications essential for machine learning inference tasks.

Models-on-Silicon Architecture

The proposed models-on-silicon architecture introduces a chip design that embeds LLM weights and inference architecture directly onto the hardware. This design eliminates the need to load weights repeatedly, optimizing performance and power efficiency. The architecture uses sequential read-only memory to store weights and key-value caches, simplifying data access and maximizing performance. Custom-built circuits perform logic operations efficiently, reducing power consumption and area.

Implementation Details

The architecture includes custom circuits such as a read-only memory for precomputed values and a multiplier circuit for embedding and weight values. These circuits are designed for specific floating-point operations, enhancing efficiency for AI inference tasks. By placing memories close to logic operations, the architecture minimizes data retrieval from main memory, further boosting performance and efficiency.