# System Architecture and Software Stack for GDDR6-AiM



AiM : Accelerator-in-Memory

### **AiM Concept**

The Accelerator-in-Memory (AiM) is a GDDR6-based Processing-in-Memory device designed to accelerate memory-intensive Machine Learning applications in memory.

### **Conventional System vs. AiM System**





### **AiM Subsystem and Software Stack**



### AiM FPGA Platform (w/ CPU Host)



# CONTENTS



### I • GDDR6-AiM Overview

- II AiM Subsystem
- III AiM Software Stack
- **IV** Performance Evaluation

### **CACM'20: Domain Specific Accelerators**



#### DOI:10.1145/3361682

DSAs gain efficiency from specialization and performance from parallelism.

BY WILLIAM J. DALLY, YATISH TURAKHIA, AND SONG HAN

# Domain-Specific Hardware Accelerators

look to alternative architectures with lower overhead, such as domain-specific accelerators, to continue scaling of performance and efficiency. There are several ways to realize domain-specific accelerators as discussed in the sidebar on accelerator options.

Ο

A domain-specific accelerator is a hardware computing engine that is specialized for a particular domain of applications. Accelerators have been designed for graphics,26 deep learning,16 simulation,2 bioinformatics,49 image processing,<sup>38</sup> and many other tasks. Accelerators can offer orders of magnitude improvements in performance/cost and performance/W compared to general-purpose computers. For example, our bioinformatics accelerator, Darwin,49 is up to 15,000× faster than a CPU at reference-based, long-read assembly. The performance and efficiency of accelerators is due to a combination of specialized operations,

### AiM as a Domain Specific Memory



4

#### DOI:10.1145/3361682

DSAs gain efficiency from specialization and performance from parallelism.

Accelerator-in-Memory

look to alternative architectures with lower overhead, such as domain-specific accelerators, to continue scaling of performance and efficiency. There are several ways to realize domain-spe-

**Efficiency:** specialization for **a particular domain** of applications Target domain: memory bound DNN applications

Main goals:

- 1) Performance: high degrees of bank-level parallelism
- 2) Power: reduce significant data movement
- 3) Cost: commodity DRAM-based (GDDR6)

### **GDDR6-AiM Key Operation: Matrix × Vector**



Multiply-And-Accumulate (MAC)

- Performs MAC operation on sixteen BF16 weight matrix and vector elements (corresponds to a single DRAM column access, i.e. 32B).
- Computation results are stored in a dedicated MAC\_REG set and can be later accessed by the user.







**Activation Function Module** 

- Performs Activation Function (AF) computation by linearly interpolating pre-stored AF template data using MAC calculation results.
- Interpolation results are stored in a dedicated **AF\_REG** set and can be later accessed by the user.



- MAC and Activation Function operations can be performed in all banks in paralle
- Weight matrix data is sourced from Banks; Vector data is sourced from the Global Buffer.
- MAC results are stored in latches collectively referred to as MAC\_REG.
- Activation Function results are stored in latches collectively referred to as AF\_REG.

### **GDDR6-AiM Key Operation: Matrix × Vector**



Multiply-And-Accumulate (MAC)

- Performs MAC operation on **sixteen** BF16 weight matrix and vector elements (corresponds to a single DRAM column access, i.e. 32B).
- Computation results are stored in a dedicated MAC\_REG set and can be later accessed by the user.







**Activation Function Module** 

- Performs Activation Function (AF) computation by linearly interpolating pre-stored AF template data using MAC calculation results.
- Interpolation results are stored in a dedicated **AF\_REG** set and can be later accessed by the user.



- MAC and Activation Function operations can be performed in all banks in parallel.
- Weight matrix data is sourced from Banks; Vector data is sourced from the Global Buffer.
- MAC results are stored in latches collectively referred to as MAC\_REG.
- Activation Function results are stored in latches collectively referred to as AF\_REG.

## **GDDR6-AiM Key Operation: Matrix × Vector**



Multiply-And-Accumulate (MAC)

- Performs MAC operation on **sixteen** BF16 weight matrix and vector elements (corresponds to a single DRAM column access, i.e. 32B).
- Computation results are stored in a dedicated MAC\_REG set and can be later accessed by the user.







#### **Activation Function Module**

- Performs Activation Function (AF) computation by linearly interpolating pre-stored AF template data using MAC calculation results.
- Interpolation results are stored in a dedicated AF\_REG set and can be later accessed by the user.



- MAC and Activation Function operations can be performed in all banks in parallel.
- Weight matrix data is sourced from Banks; Vector data is sourced from the Global Buffer.
- MAC results are stored in latches collectively referred to as MAC\_REG.
- Activation Function results are stored in latches collectively referred to as AF\_REG.



SK hynix's very first GDDR6-based processing-in-memory (PIM) product sample, called Accelerator-in-Memory (AiM)

| GDDR6-AiM*                        |                                           |  |  |
|-----------------------------------|-------------------------------------------|--|--|
| DRAM Type                         | GDDR6                                     |  |  |
| Process Technology                | 1y                                        |  |  |
| Memory Density                    | 8Gb (4Gb DDP)                             |  |  |
| Organization                      | 2CH/Chip, x16 mode only                   |  |  |
| IO Data rate                      | 16 Gb/s/pin (@1.25V)                      |  |  |
| Bandwidth                         | 64 GB/s                                   |  |  |
| Processing Unit (PU)              | 16 PU/die, 32 PU/Chip                     |  |  |
| Operating Speed                   | 1 GHz                                     |  |  |
| Compute Throughput                | 1TFLOPS/Chip                              |  |  |
| Numeric Precision                 | Brain Floating Point 16 (BF16)            |  |  |
| Activation Function suppo<br>rt** | Sigmoid, tanh, GELU, ReLU, Leaky<br>ReLU, |  |  |
| Targets                           | Memory-bound DNN applications             |  |  |



Ο



[GDDR6-AiM Floorplan]

\* S. Lee, et al. "A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications", 2022 INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC). IEEE, 2022 \*\* With using internal lookup table and linear interpolation unit. Any customized function may apply with accuracy limitation.

### **GDDR6-AiM Feature Summary**



| hynix's very first GDDR           |                                           |                       | , Sil Accel                     | era o -ii -Memo | ory (AiN     |
|-----------------------------------|-------------------------------------------|-----------------------|---------------------------------|-----------------|--------------|
|                                   |                                           | _                     |                                 |                 |              |
| Process Technology                | 1у                                        |                       |                                 | (and the second | Statement of |
| Memory Density                    | 8Gb (4Gb DDP)                             |                       |                                 |                 |              |
| Organization                      | 2CH/Chip, x16 mode only                   | ВК О                  | ВК 3                            | ВК 4            |              |
| IO Data rate                      | 16 Gb/s/pin (@1.25V)                      | Putter PU             | PU                              |                 | in e.        |
| Bandwidth                         | 64 GB/s                                   | PU                    | innin in PU                     |                 | Some set     |
| Processing Unit (PU)              | 16 PU/die, 32 PU/Chip                     | BK 1                  | BK 2                            | BK 5            | B            |
| Operating Speed                   | 1 GHz                                     |                       |                                 |                 |              |
| Compute Throughput                | 1TFLOPS/Chip                              | ВК 8                  | BK 11                           | BK 12           | Bł           |
| Numeric Precision                 | Brain Floating Point 16 (BF16)            | PU                    | A STREET STREET & STREET STREET | PU              |              |
| Activation Function suppo<br>rt** | Sigmoid, tanh, GELU, ReLU, Leaky<br>ReLU, | ВК 9                  | ВК 10                           | BK 13           | BI           |
| Targets                           | Memory-bound DNN applications             | [GDDR6-AiM Floorplan] |                                 |                 |              |

\* S. Lee, et al. "A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications", 2022 INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC). IEEE, 2022 \*\* With using internal lookup table and linear interpolation unit. Any customized function may apply with accuracy limitation.

### New Commands introduced in AiM

Ο

| Bank Activation     |                                                                                   |  |  |
|---------------------|-----------------------------------------------------------------------------------|--|--|
| ACT4, ACT16         | Activate four/sixteen banks in parallel                                           |  |  |
| ACTAF4, ACTAF16     | Activate rows storing Activation Functions LUTs in four/sixteen banks in parallel |  |  |
| Compute Commands    |                                                                                   |  |  |
| MACSB, MAC4B, MACAB | Perform MAC in one/four/sixteen banks in parallel                                 |  |  |
| AF                  | Compute Activation Function in all banks                                          |  |  |
| EWMUL               | Perform element-wise multiplication                                               |  |  |
| Data Commands       |                                                                                   |  |  |
| RDCP                | Copy data from a bank to the Global Buffer                                        |  |  |
| WRCP                | Copy data from the Global Buffer to a bank                                        |  |  |
| WRGB*               | Write to Global Buffer (often Activation vector data)                             |  |  |
| RDMAC*              | Read from MAC result register                                                     |  |  |
| RDAF*               | Read from Activation Function result register                                     |  |  |
| WRMAC*              | Write to MAC result register (or WRBIAS as often BIAS data is written)            |  |  |
| WRBK                | Write to all activated banks in parallel                                          |  |  |

# CONTENTS

Ι.



### • GDDR6-AiM Overview

## II • AiM Subsystem

III • AiM Software Stack

### **IV** • Performance Evaluation

### AiM Subsystem: High-Performance Scalable Reference Design

**AiM Subsystem** is a hardware bridge between the host and the AiM devices. It is designed to 1) maximize compute throughput for a set of AiM devices and to 2) minimize software stack overhead.



12

### AiM Subsystem: High-Performance Scalable Reference Design

**AiM Subsystem** is a hardware bridge between the host and the AiM devices. It is designed to 1) maximize compute throughput for <u>a set of AiM devices</u> and to 2) minimize software stack overhead.



#### **1** AiM Command/Data DMA engine

Decodes AiM instructions from software and provides direct memory access for the host.

#### 2 AiM Multicasting Interconnect

Enables efficient workload distribution through flexible instruction parallelism. Supports unicast, multicast, and broadcast modes.





BROADCAST





### **3** AiM Controller

Generates and schedules low-level AiM and typical DRAM commands.

## **AiM Control Flow from Host Processor**



**AiM Command** 

- New commands named **AiM commands** are introduced to control computing operations in AiM
- AiM software stack generates **AiM Host (Macro) Instructions** and each **Macro Instruction** is converted into • the corresponding **multiple AiM (Micro) commands** by the Command Generator and AiM Controller



#### **ISR Instructions**

# CONTENTS

Ι.



### • GDDR6-AiM Overview

II • AiM Subsystem

### **III** • AiM Software Stack

### **IV** • Performance Evaluation

### Motivation



- In order to increase the value of new memory solutions such as AiM, expanding the ecosystem is of paramount importance.
- To achieve this goal, we have developed an SDK to allow any DNN models to be easily adapted by operatorlevel APIs.
- The proposed API abstracts the functionality of PIM operations and provides ease of programmability to users.



### **AiM Execution Sequence for Fully Connected Layer**





### **AiM Software Stack**



- SK hynix has implemented full software stack (device drivers, runtime library, frameworks and applications)
- Supports an AiM software emulator that allows users to develop AI applications without evaluation board



### **AiM Operation Kernels**

• The AiM Runtime Library provides a number of AiM OP kernels for Deep Learning Frameworks and AI Applications by exposing operator-level APIs.



AiM Execution Provider

O PyTorch AiM Extension

## **AiM Integration on PyTorch**

- The framework provides abstraction functions for various PIM operations and easy programmability for developers.
- If users simply apply "to\_aim" API to any PyTorch operations, the runtime library will convert them to the operator-level AiM APIs defined in the runtime library.







# **AiM Integration on ONNX Runtime**

 In order to run ONNX Runtime applications on AiM, developers simply add "AiMExecutionProvider" to the "EP\_List", which will make the AiM execution provider the default provider, and some nodes in the ONNX graph will be offloaded to AiM.



Application (GPT-2, LSTM, ...

Deep Learning Framework

AiM Runtime Library

O PyTorch AiM Extension

Management

ONNX

AiM Execution Provide

### **AiM Memory Allocator**

• The memory allocator allocates buffers from the AiM and host DRAM, and manages the virtual address of each buffer for each process. This allocator manages three types of memory (Host DRAM, AiM, and FPGA GPR) and has a hierarchical structure.





AiM Memory Allocator API

- AiMMalloc
- AiMMatrixMalloc
- AiMMemcpy
- AiMFree

25



### **AiM Host (ISR) Instruction Generator**

• The process of generating the host instructions and dispatching them to the hardware.



**SK** hynix

Application (GPT-2, LSTM, ...

### **AiM Software Emulator**

- With the AiM performance model, users can estimate not only the total execution time of AI applications but also the execution time of each individual host instruction.
- The other benefit of using the software emulator is the flexibility of the hardware.



Estimated Performance from AnalytiEsalMattleh Board

OP Kernel] Estimated Performance (PID : 43809) ==



ONNX

AiM Execution Provider

SR Instructio

ODMA

**Device** Driver

Application (GPT-2, LSTM, ...

Deep Learning Frameworks

AiM Runtime Librar

Instruction

Dispatche

O PyTorch

AiM Device Driver

Management

Allocato

# CONTENTS

- I GDDR6-AiM Overview
- II AiM Subsystem
- III AiM Software Stack

### **IV** • **Performance Evaluation**

### Perfomance Evaluation: GPT-2 and GPT-3

Ο

- Performance Evaluation on GPT-2 and GPT-3 (Memory intensive workload)
- Higher gains are expected if AiM is directly deployed on the memory channels running at 16Gb/s/pin, as demonstrated by "AiM Projected" with our performance analytical model
- By having a more memory intensive workload, AiM will be more effective



Parameter Size : 1.5B

SK hynix can provide the following items for your research
✓ AiM SDK (including AiM Software Emulator)
✓ [Optional] AiM FMC card ×2

Please contact us at

SKhynix\_PIM@skhynix.com

**AiM Platform Distribution** 





Ω

### • Circuit-level

- [ISSCC'22] <u>A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS</u> <u>MAC Operation and Various Activation Functions for Deep-Learning Applications</u>
  - ISSCC'22 demo. (short video): <u>https://www.youtube.com/watch?v=3LbvwrJFYoA</u>
- [JSSC'23] <u>A 1ynm 1.25V 8Gb 16Gb/s/Pin GDDR6-Based Accelerator-in-Memory Supporting 1TFLOPS MAC</u> Operation and Various Activation Functions for Deep Learning Application

### • Architectures/Software stack

- [Hot Chips'22] <u>System Architecture and Software Stack for GDDR6-AiM</u>
- White papers (released with AI Hardware Summit'22):
  - (1) <u>Accelerator-in-Memory: SK hynix's GDDR6-based Processing-in-Memory Device</u>
  - (2) <u>Accelerator-in-Memory Platform: Hardware Perspective</u>
  - (3) <u>Accelerator-in-Memory Platform: Software Perspective</u>
  - (From <u>https://product.skhynix.com</u>)