

Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

> Christina Giannoula Ivan Fernandez, Juan Gomez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu









UNIVERSIDAD DE MÁLAGA

### Our Work

### Efficient Algorithmic Designs

The first open-source Sparse Matrix Vector Multiplication (SpMV) software package, SparseP, for real Processing-In-Memory (PIM) systems

SparseP is Open-Source

SparseP: https://github.com/CMU-SAFARI/SparseP

### **Extensive Characterization**

The first comprehensive analysis of SpMV on the first real commercial PIM architecture

Recommendations for Architects and Programmers

Full Paper: https://arxiv.org/pdf/2201.05072.pdf

# Sparse Matrix Vector Multiplication

Sparse Matrix Vector Multiplication (SpMV):

- Widely-used kernel in graph processing, machine learning, scientific computing ...
- A highly memory-bound kernel

**Roofline Model** 



# Real Processing-In-Memory Systems

Real Near-Bank Processing-In-Memory (PIM) Systems:

- High levels of parallelism
- Low memory access latency
- Large aggregate memory bandwidth



# Real Processing-In-Memory Systems

Real Near-Bank Processing-In-Memory (PIM) Systems:

- High levels of parallelism
- Low memory access latency
- Large aggregate memory bandwidth



# SparseP: SpMV Library for Real PIMs

Our Contributions:

- 1. Design efficient SpMV kernels for current and future PIM systems
  - 25 SpMV kernels
    - 4 compressed matrix formats (CSR, COO, BCSR, BCOO)
    - 6 data types
    - 4 data partitioning techniques
    - Various load balancing schemes among PIM cores/threads
    - 3 synchronization approaches
- 2. Provide a comprehensive analysis of SpMV on the first commercially-available real PIM system **UP** 
  - 26 sparse matrices
  - Comparisons to state-of-the-art CPU and GPU systems
  - Recommendations for software, system and hardware designers

mem



### SpMV Kernels for Real PIM Systems

# Key Takeaways from Our Study

### Conclusion

### SpMV Execution on a PIM System



# Data Partitioning Techniques

SparseP supports two types of data partitioning techniques:

### **1D** Partitioning





# **1D Partitioning Technique**

Load-Balancing Approaches:

- CSR, COO:
  - Balance Rows
  - Balance NNZs \*
- BCSR, BCOO:
  - Balance Blocks ^
  - Balance NNZs ^
- \* row-granularity for CSR
- ^ block-row-granularity for BCSR

# **1D Partitioning Technique**

Load-Balancing of #NNZs:

• CSR (row-granularity), COO



#### row-order



#### nnz-order



# **1D Partitioning Technique**

Load-Balancing of #NNZs:

- CSR (row-granularity), COO
- BCSR (block-row-granularity), BCOO



#### block-row-order



#### block-order





### Load-Balance across Threads

#### Multithreaded PIM Cores:

#### **1D Partitioning**





### Various load-balance schemes across threads

### Load-Balance across Threads

#### Multithreaded PIM Cores:

#### **1D Partitioning**

### 2D Partitioning



### Various load-balance schemes across threads

### Load-Balance across Threads

### Multithreaded PIM Cores:

#### **1D Partitioning**





- Various load-balance schemes across threads
- Various synchronization approaches among threads

# Synchronization Approaches

Multithreaded PIM Core:





### Fine-Grained (lb-fg)





# SparseP Software Package

#### 25 SpMV kernels for PIM Systems $\rightarrow$

#### https://github.com/CMU-SAFARI/SparseP

| Partitioning                            | Matrix Format | Load-Balancing     |  |
|-----------------------------------------|---------------|--------------------|--|
| <b>9x</b><br>1D<br>Kernels              | CSR           | rows, nnzs *       |  |
|                                         | C00 🔺         | rows, nnzs *, nnzs |  |
|                                         | BCSR          | blocks ^, nnzs ^   |  |
|                                         | BCOO 🔺        | blocks, nnzs       |  |
| <b>4x</b><br>2D<br>Equally-Sized Tiles  | CSR           |                    |  |
|                                         | C00 🔺         |                    |  |
|                                         | BCSR          |                    |  |
|                                         | BCOO 🔺        |                    |  |
| <b>6x</b><br>2D<br>Equally-Wide Tiles   | CSR           | nnzs *             |  |
|                                         | C00 🔺         | nnzs               |  |
|                                         | BCSR          | blocks ^, nnzs ^   |  |
|                                         | BCOO 🔺        | blocks, nnzs       |  |
| <b>6x</b><br>2D<br>Variable-Sized Tiles | CSR           | nnzs *             |  |
|                                         | C00 🔺         | nnzs               |  |
|                                         | BCSR          | blocks ^, nnzs ^   |  |
|                                         | BCOO 🔺        | blocks, nnz        |  |

Load-balance across PIM cores/threads:

- \* row-granularity (CSR)
- ^ block-row-granularity (BCSR)

#### Synchronization

among threads of a PIM core:

Ib-cg, lb-fb, lf (COO, BCOO)

#### Data Types:

- 8-bit integer
- 16-bit integer
- 32-bit integer
- 64-bit integer
- 32-bit float
- 64-bit float



### SpMV Kernels for Real PIM Systems

# Key Takeaways from Our Study

### Conclusion

### **UPMEM-based PIM System**

- 20 UPMEM PIM DIMMs with 2560 PIM cores in total
- Each multithreaded PIM core supports 24 threads



# Sparse Matrix Data Set

### 26 sparse matrices\*:

- Diverse sparsity patterns
- Variability on irregular patterns
- Variability on block patterns



### Scale-Free Matrix



\* Suite Sparse Matrix Collection: <a href="https://sparse.tamu.edu/">https://sparse.tamu.edu/</a>

### Kernel Execution on One PIM Core



### Lock-Based Synchronization

16 threads, COO, 32-bit integer



Fine-Grained Locking: memory accesses to the local DRAM bank are serialized in the DMA engine of the UPEM PIM hardware.

DRAM Bank

### Lock-Based Synchronization

16 threads, COO, 32-bit integer

#### Key Takeaway 1

Fine-grained locking approaches cannot improve performance over coarse-grained locking, when the PIM hardware does not support concurrent accesses to the local DRAM bank.



### Recommendation 1

Provide low-cost synchronization support and hardware support to enable concurrent memory accesses to the local DRAM bank, and integrate multiple DRAM banks per PIM core to increase execution parallelism.



Load-balancing #NNZ typically provides high computation balance across threads of a compute-limited PIM core



Load-balancing #NNZs: one single thread performs a much higher #memory accesses and #synchronization operations than the rest

# Load-Balance within a PIM Core

### 16 threads, 32-bit integer

### Key Takeaway 2

High operation imbalance in computation, synchronization, or memory instructions executed by multiple threads of a PIM core can cause high performance overhead.



### Recommendation 2

Design algorithms that provide high load balance across threads of PIM core in terms of computations, synchronization points and memory accesses.

### Scalability within a PIM Core 32-bit integer



### Kernel Execution on Multiple PIM Cores



### **Comparison of Compressed Formats**





In scale-free matrices, COO + BCOO provide higher non-zero element balance across PIM cores than CSR + BCSR, respectively.

### **Comparison of Compressed Formats**



In scale-free matrices, COO + BCOO provide higher non-zero element balance across threads than CSR + BCSR, respectively.



COO + BCOO formats provide higher non-zero element balance across PIM cores + threads than CSR + BCSR, respectively.

# **Comparison of Compressed Formats**

2048 PIM Cores, 32-bit integer

1D

2D Equally-Sized

#### Key Takeaway 3

The compressed matrix format used to store the input matrix determines the data partitioning across DRAM banks of PIM-enabled memory. As a result, it affects the load-balance across PIM cores (and threads of a PIM core) with corresponding performance implications.

| matrices 2D Equally-Wide | matrices<br><b>2D Variable-Sized</b> |                  |            |
|--------------------------|--------------------------------------|------------------|------------|
| regular matrices         | scale-free                           | regular matrices | scale-free |

#### **Recommendation 3**

5

Speedup

Design compressed data structures that can be effectively partitioned across DRAM banks, with the goal of providing high computation balance across PIM cores (and threads of a PIM core).

regular matrices

scale-free matrices regular matrices

scale-free matric**3**s**3** 

### End-to-End Performance





<u>1D</u>: #bytes to load the input vector grows linearly to #PIM cores

# Scalability

COO format, 32-bit integer

### Key Takeaway 4

The 1D-partitioned kernels are severely **bottlenecked** by the high data transfer costs to **broadcast** the whole input vector **into DRAM banks** of all PIM cores, through the narrow off-chip memory bus.



### **Recommendation 4**

Optimize the broadcast collective collective in data transfers to PIM-enabled memory to efficiently copy the input data into DRAM banks in the PIM system.



<u>2D Equally-Sized: kernel</u> time is limited by only a few PIM cores assigned to the 2D tiles with the largest #NNZs



high amount of zero padding to gather the output vector  $\rightarrow$  parallel transfers supported at rank granularity = 64 PIM cores

## Scalability

COO format, 32-bit integer

### Key Takeaway 5

The 2D equally-wide and variable-sized kernels need fine-grained parallel data transfers at DRAM bank granularity (zero padding) to be supported by the PIM system to achieve high performance.



### Optimize the gather collective operation at DRAM bank granularity in data transfers from PIM-enabled memory to efficiently retrieve the output results to the host CPU.

## **Comparison of Sparse Matrices**



<u>1D:</u> **#PIM cores** that provides the best performance depends on the **sparsity pattern** of the input matrix



<u>2D</u>: **#vertical partitions** that provides the best performance depends on the **sparsity pattern** of the input matrix

## **Comparison of PIM Systems**



## **Comparison of PIM Systems**



<u>2D</u>: **#vertical partitions** that provides the best performance depends on the underlying hardware characteristics

## Various Matrices and PIM Systems





kernelmerge

### Key Takeaway 6

**1D** 

There is no one-size-fits-all parallelization approach for SpMV, since the performance of each scheme depends on the characteristics of the input matrix and the underlying PIM hardware.



### 1D vs 2D

### Up to 2528 PIM Cores, 32-bit float



Best-performing SpMV execution: trades off computation with lower data transfer costs

## 1D vs 2D

### Key Takeaway 7

Expensive data transfers to/from PIM-enabled memory performed via the narrow memory bus impose significant performance overhead to end-to-end SpMV execution. Thus, it is hard to fully exploit all available PIM cores of the system.

### 0.8 0.6 0.4

### Recommendation 7

Design high-speed communication channels and optimized libraries in data transfers to/from PIM-enabled memory, provide hardware support to effectively overlap computation with data transfers in the PIM system, and/or integrate PIM-enabled memory as the main memory of the system.

## SpMV Execution on Various Systems

### **GPU** System **CPU** System Execute the kernel Execute the kernel **GPU** Cores **B** Retrieve **1** Load the Main Memory the final vector input vector bus **1** Host DRAM DRAM **GPU** Global Main Memory CPU Bank Memory bus Host DRAM DRAM DRAM DRAM Brnk CPU SMX2 SMX2 Real PIM Retrieve the Load the Execute Merge the System the kernel partial results partial results input vector **PIM-Enabled Memory** Main Memory PIM PIM PIM Host CPU <u>Core</u> Core Core DRAM DRAM DRAM DRAM DRAM Bar Brik bus bus +

| System |                           | Peak Performance | Bandwidth              | TDP    |                    |
|--------|---------------------------|------------------|------------------------|--------|--------------------|
| CPU    | Intel Xeon<br>Silver 4110 | 660 GFlops       | 23.1 GB/s              | 2x85 W | Processor-         |
| GPU    | NVIDIA<br>Tesla V100      | 14.13 TFlops     | <mark>897 G</mark> B/s | 300 W  | Centric            |
| PIM    | UPMEM<br>1st Gen.         | 4.66 GFlops      | 1.77 TB/s              | 379 W  | Memory-<br>Centric |
|        |                           |                  |                        |        | 48                 |

### • Kernel-Only (COO, 32-bit float):

- CPU = 0.51% of Peak Perf.
- GPU = 0.21% of Peak Perf.
- PIM (1D) = 50.7% of Peak Perf.

| System |                           | Peak Performance | Bandwidth | TDP    |                    |
|--------|---------------------------|------------------|-----------|--------|--------------------|
| CPU    | Intel Xeon<br>Silver 4110 | 660 GFlops       | 23.1 GB/s | 2x85 W | Processor-         |
| GPU    | NVIDIA<br>Tesla V100      | 14.13 TFlops     | 897 GB/s  | 300 W  | Centric            |
| PIM    | UPMEM<br>1st Gen.         | 4.66 GFlops      | 1.77 TB/s | 379 W  | Memory-<br>Centric |
|        |                           |                  |           |        | 10                 |

49

- Kernel-Only (COO, 32-bit float):
  - CPU = 0.51% of Peak Perf.
  - GPU = 0.21% of Peak Perf.
  - PIM (1D) = 50.7% of Peak Perf.
- End-to-End (COO, 32-bit float):
  - CPU = **4.08 GFlop/s**
  - GPU = 1.92 GFlop/s
  - PIM (1D) = 0.11 GFlop/s

| S   | ystem                     | Peak Performance | Bandwidth | TDP    |                    |
|-----|---------------------------|------------------|-----------|--------|--------------------|
| CPU | Intel Xeon<br>Silver 4110 | 660 GFlops       | 23.1 GB/s | 2x85 W | Processor-         |
| GPU | NVIDIA<br>Tesla V100      | 14.13 TFlops     | 897 GB/s  | 300 W  | Centric            |
| PIM | UPMEM<br>1st Gen.         | 4.66 GFlops      | 1.77 TB/s | 379 W  | Memory-<br>Centric |
|     |                           |                  |           |        | F٥                 |

 $\mathbf{D}\mathbf{U}$ 

- Kernel-Energy (COO, 32-bit float):
  - CPU = 0.247 J
  - GPU = 0.051 J
  - PIM (1D) = 0.179 J

<u>PIM</u>: 1.38x higher energy efficiency over CPU

| S   | ystem                     | Peak Performance | Bandwidth | TDP    |                    |
|-----|---------------------------|------------------|-----------|--------|--------------------|
| CPU | Intel Xeon<br>Silver 4110 | 660 GFlops       | 23.1 GB/s | 2x85 W | Processor-         |
| GPU | NVIDIA<br>Tesla V100      | 14.13 TFlops     | 897 GB/s  | 300 W  | Centric            |
| PIM | UPMEM<br>1st Gen.         | 4.66 GFlops      | 1.77 TB/s | 379 W  | Memory-<br>Centric |
|     |                           |                  |           |        |                    |

- Kernel-Energy (COO, 32-bit float):
  - CPU = 0.247 J
  - GPU = 0.051 J
  - PIM (1D) = 0.179 J

| System     | Peak Performance | Bandwidth | TDP |  |
|------------|------------------|-----------|-----|--|
| Intel Veen |                  |           |     |  |

# Many more results in the full paper: <a href="https://arxiv.org/pdf/2201.05072.pdf">https://arxiv.org/pdf/2201.05072.pdf</a>

1st Gen.

Centric 52



## SpMV Kernels for Real PIM Systems

## Key Takeaways from Our Study

### Conclusion

## Conclusion

- SpMV is a fundamental linear algebra kernel for important applications (HPC, machine learning, graph analytics... )
- SpMV is a highly memory-bound kernel in processor-centric systems (e.g., CPU and GPU systems)
- Real near-bank PIM systems can tackle the data movement bottleneck (high parallelism, large aggregate memory bandwidth)
- Key Contributions:
  - *SparseP* : first open-source SpMV library for real PIM systems
  - Comprehensive characterization and analysis of SPMV on the first real PIM system
  - Recommendations to improve multiple aspects of future PIM hardware and software

### Our Work

SparseP: https://github.com/CMU-SAFARI/SparseP

Full Paper: https://arxiv.org/pdf/2201.05072.pdf

## **Ongoing Project**

Accelerating Graph Processing Applications on Real PIM Systems

- Motivation: Graph processing kernels are primarily bottlenecked by data movement between memory and processors
- Our Goals:
  - Design efficient linear and non-linear algebraic graph processing kernels tailored for real PIM systems
  - Provide an extensive characterization analysis of the most widely-used graph processing kernels on a real PIM architecture
- Project Status:
  - Analysis of SparseP kernels in the context of linear algebraic graph processing applications [done]
  - Implementation of various SpMSpV kernels (CSR, COO, CSC formats) [done]
  - Analysis of SpMSpV kernels in the context of linear algebraic graph processing applications [done]
  - Implementation of BFS, SSSP, and PR algorithms using the SpMV and SpMSpV kernels [todo]
  - Implementation of adaptive linear algebraic graph processing applications [todo]
  - Implementation of non-linear algebraic graph processing applications [todo]
  - Comparative analysis of linear and non-linear algebraic graph processing applications on the UpMEM System [todo] 55