Tutorial on Memory-Centric Computing: Processing-Using-Memory

> Geraldo F. Oliveira Prof. Onur Mutlu

> > ISCA 2024 29 June 2024



**ETH** zürich



- Introduction to Memory-Centric Computing Systems
- Invited Talk by Prof. Minsoo Rhu: "Memory-Centric Computing Systems – For AI and Beyond"
- Coffee Break
- Real-World Processing-Near-Memory Systems
- Processing-Using-Memory Architectures for Bulk Bitwise Op.
- Invited Talk by Prof. Saugata Ghose:
  "RACER and ReRAM PUM"
- PIM Programming & Infrastructure for PIM Research
- Closing Remarks

# Processing in Memory: Two Approaches

Processing near Memory
 Processing using Memory

## Starting Simple: Data Copy and Initialization

memmove & memcpy: 5% cycles in Google's datacenter [Kanev+ ISCA'15]





### VM Cloning Deduplication



Many more

Page Migration

Today's Systems: Bulk Data Copy



Future Systems: In-Memory Copy



# Brief Review: Inside A DRAM Chip

## **Inside a DRAM Chip**



## **Inside a DRAM Chip: Another View**



## **DRAM Cell Operation**



## DRAM Cell Operation (1/3)



## **DRAM Cell Operation (2/3)**



## **DRAM Cell Operation (3/3)**



Future Systems: In-Memory Copy



### RowClone: In-DRAM Row Copy



## RowClone: Intra-Subarray



## RowClone: Latency and Energy Savings



Seshadri et al., "RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data," MICRO 2013.

### More on RowClone

 Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization" Proceedings of the <u>46th International Symposium on Microarchitecture</u>

(*MICRO*), Davis, CA, December 2013. [<u>Slides (pptx) (pdf)</u>] [<u>Lightning Session</u> <u>Slides (pptx) (pdf)</u>] [<u>Poster (pptx) (pdf)</u>]

### RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri Yoongu Kim Chris Fallin\* Donghyuk Lee vseshadr@cs.cmu.edu yoongukim@cmu.edu cfallin@c1f.net donghyuk1@cmu.edu Rachata Ausavarungnirun Gennady Pekhimenko Yixin Luo rachata@cmu.edu gpekhime@cs.cmu.edu yixinluo@andrew.cmu.edu Onur Mutlu Phillip B. Gibbons† Michael A. Kozuch† Todd C. Mowry onur@cmu.edu phillip.b.gibbons@intel.com michael.a.kozuch@intel.com tcm@cs.cmu.edu Carnegie Mellon University †Intel Pittsburgh <sup>18</sup>

### RowClone Extensions and Follow-Up Work

- Can we do faster inter-subarray copy?
  Yes, see LISA [Chang et al., HPCA 2016]
- Can we enable data movement at smaller granularities within a bank?
  - □ Yes, see FIGARO [Wang et al., MICRO 2020]
- Can we do better inter-bank copy?
  Yes, see Network-on-Memory [CAL 2020]
- Can similar ideas and DRAM properties be used to perform computation on data?
  - Yes, see Ambit [Seshadri et al., CAL 2015, MICRO 2017]

## LISA: Increasing Connectivity in DRAM

 Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee, Moinuddin K. Qureshi, and Onur Mutlu,
 "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM"
 Proceedings of the <u>22nd International Symposium on High-</u> <u>Performance Computer Architecture</u> (HPCA), Barcelona, Spain, March 2016.
 [Slides (pptx) (pdf)]
 [Source Code]

### Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM

Kevin K. Chang<sup>†</sup>, Prashant J. Nair<sup>\*</sup>, Donghyuk Lee<sup>†</sup>, Saugata Ghose<sup>†</sup>, Moinuddin K. Qureshi<sup>\*</sup>, and Onur Mutlu<sup>†</sup> <sup>†</sup>Carnegie Mellon University <sup>\*</sup>Georgia Institute of Technology

## Moving Data Inside DRAM?



# Goal: Provide a new substrate to enable wide connectivity between subarrays

## **Key Idea and Applications**

- Low-cost Inter-linked subarrays (LISA)
  - Fast bulk data movement between subarrays
  - Wide datapath via isolation transistors: 0.8% DRAM chip area



 LISA is a versatile substrate → new applications Fast bulk data copy: Copy latency 1.363ms→0.148ms (9.2x) → 66% speedup, -55% DRAM energy

In-DRAM caching: Hot data access latency 48.7ns→21.5ns (2.2x) → 5% speedup

Fast precharge: Precharge latency 13.1ns→5.0ns (2.6x) → 8% speedup

### More on LISA

 Kevin K. Chang, Prashant J. Nair, Saugata Ghose, Donghyuk Lee, Moinuddin K. Qureshi, and Onur Mutlu,
 "Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM"
 Proceedings of the <u>22nd International Symposium on High-</u> <u>Performance Computer Architecture</u> (HPCA), Barcelona, Spain, March 2016.
 [Slides (pptx) (pdf)]
 [Source Code]

### Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM

Kevin K. Chang<sup>†</sup>, Prashant J. Nair<sup>\*</sup>, Donghyuk Lee<sup>†</sup>, Saugata Ghose<sup>†</sup>, Moinuddin K. Qureshi<sup>\*</sup>, and Onur Mutlu<sup>†</sup> <sup>†</sup>Carnegie Mellon University <sup>\*</sup>Georgia Institute of Technology

## FIGARO: Fine-Grained In-DRAM Copy

 Yaohua Wang, Lois Orosa, Xiangjun Peng, Yang Guo, Saugata Ghose, Minesh Patel, Jeremie S. Kim, Juan Gómez Luna, Mohammad Sadrosadati, Nika Mansouri Ghiasi, and Onur Mutlu,
 "FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching"
 Proceedings of the <u>53rd International Symposium on</u> Microarchitecture (MICRO), Virtual, October 2020.

- FIGARO: Improving System Performance via Fine-Grained In-DRAM Data Relocation and Caching
- Yaohua Wang<sup>\*</sup> Lois Orosa<sup>†</sup> Xiangjun Peng<sup> $\odot$ \*</sup> Yang Guo<sup>\*</sup> Saugata Ghose<sup> $\diamond$ ‡</sup> Minesh Patel<sup>†</sup> Jeremie S. Kim<sup>†</sup> Juan Gómez Luna<sup>†</sup> Mohammad Sadrosadati<sup>§</sup> Nika Mansouri Ghiasi<sup>†</sup> Onur Mutlu<sup>†‡</sup>
- \*National University of Defense Technology <sup>†</sup>ETH Zürich <sup> $\odot$ </sup>Chinese University of Hong Kong <sup> $\diamond$ </sup>University of Illinois at Urbana–Champaign <sup>‡</sup>Carnegie Mellon University <sup>§</sup>Institute of Research in Fundamental Sciences

### Network-On-Memory: Fast Inter-Bank Copy

- Seyyed Hossein SeyyedAghaei Rezaei, Mehdi Modarressi, Rachata Ausavarungnirun, Mohammad Sadrosadati, Onur Mutlu, and Masoud Daneshtalab,
  - "NoM: Network-on-Memory for Inter-Bank Data Transfer in Highly-Banked Memories"

IEEE Computer Architecture Letters (CAL), to appear in 2020.

#### NoM: NETWORK-ON-MEMORY FOR INTER-BANK DATA TRANSFER IN HIGHLY-BANKED MEMORIES

Seyyed Hossein SeyyedAghaei Rezaei<sup>1</sup> Mehdi Modarressi<sup>1,3</sup> Rachata Ausavarungnirun<sup>2</sup> Mohammad Sadrosadati<sup>3</sup> Onur Mutlu<sup>4</sup> Masoud Daneshtalab<sup>5</sup>

<sup>1</sup>University of Tehran <sup>2</sup>King Mongkut's University of Technology North Bangkok <sup>3</sup>Institute for Research in Fundamental Sciences <sup>4</sup>ETH Zürich <sup>5</sup>Mälardalens University

## (Truly) In-Memory Computation

- We can support in-DRAM AND, OR, NOT, MAJ
- At low cost
- Using analog computation capability of DRAM
  - Idea: activating multiple rows performs computation
- 30-60X performance and energy improvement
  - Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," MICRO 2017.

- New memory technologies enable even more opportunities
  - Memristors, resistive RAM, phase change mem, STT-MRAM, ...
  - Can operate on data with minimal movement

### In-DRAM AND/OR: Triple Row Activation



### In-DRAM Bulk Bitwise AND/OR Operation

### • BULKAND A, $B \rightarrow C$

- Semantics: Perform a bitwise AND of two rows A and B and store the result in row C
- R0 reserved zero row, R1 reserved one row
- D1, D2, D3 Designated rows for triple activation
- 1. RowClone A into D1
- 2. RowClone B into D2
- 3. RowClone R0 into D3
- 4. ACTIVATE D1,D2,D3
- 5. RowClone Result into C

### More on In-DRAM Bulk AND/OR

 Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry,
 <u>"Fast Bulk Bitwise AND and OR in DRAM"</u> <u>IEEE Computer Architecture Letters</u> (CAL), April 2015.

## Fast Bulk Bitwise AND and OR in DRAM

Vivek Seshadri\*, Kevin Hsieh\*, Amirali Boroumand\*, Donghyuk Lee\*, Michael A. Kozuch<sup>†</sup>, Onur Mutlu\*, Phillip B. Gibbons<sup>†</sup>, Todd C. Mowry\* \*Carnegie Mellon University <sup>†</sup>Intel Pittsburgh

### In-DRAM NOT: Dual Contact Cell



Figure 5: A dual-contact cell connected to both ends of a sense amplifier Idea: Feed the negated value in the sense amplifier into a special row

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.



## In-DRAM NOT Operation



Figure 5: Bitwise NOT using a dual contact capacitor

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.

### Performance: In-DRAM Bitwise Operations



Figure 9: Throughput of bitwise operations on various systems.

|                | Design         | not   | and/or | nand/nor | xor/xnor |
|----------------|----------------|-------|--------|----------|----------|
| DRAM &         | DDR3           | 93.7  | 137.9  | 137.9    | 137.9    |
| Channel Energy | Ambit          | 1.6   | 3.2    | 4.0      | 5.5      |
| (nJ/KB)        | $(\downarrow)$ | 59.5X | 43.9X  | 35.1X    | 25.1X    |

Table 3: Energy of bitwise operations. ( $\downarrow$ ) indicates energy reduction of Ambit over the traditional DDR3-based design.

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.

### Bulk Bitwise Operations in Workloads



### SAFARI

[1] Li and Patel, BitWeaving, SIGMOD 2013[2] Goodwin+, BitFunnel, SIGIR 2017

### In-DRAM Acceleration of Database Queries





# Figure 11: Speedup offered by Ambit over baseline CPU with SIMD for BitWeaving

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.

### More on Ambit

 Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry,
 <u>"Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using</u> <u>Commodity DRAM Technology"</u> *Proceedings of the <u>50th International Symposium on</u> <u>Microarchitecture</u> (MICRO), Boston, MA, USA, October 2017.* 

[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)]

Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology

Vivek Seshadri<sup>1,5</sup> Donghyuk Lee<sup>2,5</sup> Thomas Mullins<sup>3,5</sup> Hasan Hassan<sup>4</sup> Amirali Boroumand<sup>5</sup> Jeremie Kim<sup>4,5</sup> Michael A. Kozuch<sup>3</sup> Onur Mutlu<sup>4,5</sup> Phillip B. Gibbons<sup>5</sup> Todd C. Mowry<sup>5</sup>

<sup>1</sup>Microsoft Research India <sup>2</sup>NVIDIA Research <sup>3</sup>Intel <sup>4</sup>ETH Zürich <sup>5</sup>Carnegie Mellon University
### In-DRAM Bulk Bitwise Execution

 Vivek Seshadri and Onur Mutlu,
 "In-DRAM Bulk Bitwise Execution Engine" Invited Book Chapter in Advances in Computers, to appear in 2020.
 [Preliminary arXiv version]

### In-DRAM Bulk Bitwise Execution Engine

Vivek Seshadri Microsoft Research India visesha@microsoft.com Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch

### SIMDRAM Framework

 Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu, "SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM" Proceedings of the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Virtual, March-April 2021.
 [2-page Extended Abstract]
 [Short Talk Slides (pptx) (pdf)]
 [Talk Slides (pptx) (pdf)]
 [Short Talk Video (5 mins)]
 [Full Talk Video (27 mins)]

#### SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM

\*Nastaran Hajinazar<sup>1,2</sup> Nika Mansouri Ghiasi<sup>1</sup> Juan Gómez-Luna<sup>1</sup> Sven Gregorio<sup>1</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup> João Dinis Ferreira<sup>1</sup> Saugata Ghose<sup>3</sup>

<sup>1</sup>ETH Zürich

<sup>2</sup>Simon Fraser University

<sup>3</sup>University of Illinois at Urbana–Champaign

# **SIMDRAM Framework: Overview**





# **SIMDRAM Framework: Step 1**





Memory Controller

### **Step 1: Naïve MAJ/NOT Implementation**



**Naïvely** converting AND/OR/NOT-implementation to MAJ/NOT-implementation leads to an unoptimized circuit

### **Step 1: Efficient MAJ/NOT Implementation**



### Step 1 generates an optimized MAJ/NOT-implementation of the desired operation

<sup>4</sup> L. Amarù et al, "Majority-Inverter Graph: A Novel Data-Structure and Algorithms for Efficient Logic Optimization", DAC, 2014.



# **SIMDRAM Framework: Step 2**





Memory Controller

# **Step 2: µProgram Generation**

- **µProgram:** A series of microarchitectural operations (e.g., ACT/PRE) that SIMDRAM uses to execute SIMDRAM operation in DRAM
- **Goal of Step 2**: To generate the µProgram that executes the desired SIMDRAM operation in DRAM

**Task 1: Allocate DRAM rows to the operands** 

Task 2: Generate µProgram



# **SIMDRAM Framework: Step 3**





# **Step 3: µProgram Execution**

- **SIMDRAM control unit:** handles the execution of the μProgram at runtime
- Upon receiving a **bbop instruction**, the control unit:
  - 1. Loads the  $\mu$ Program corresponding to SIMDRAM operation
  - 2. Issues the sequence of DRAM commands (ACT/PRE) stored in the μProgram to SIMDRAM subarrays to perform the in-DRAM operation



# **More in the Paper**



Handling limited subarray size

**Security implications** 

Limitations of our framework



# **SIMDRAM Key Results**

### Evaluated on:

- 16 complex in-DRAM operations
- 7 commonly-used real-world applications

### **SIMDRAM provides:**

- 88× and 5.8× the throughput of a CPU and a high-end GPU, respectively, over 16 operations
- 257× and 31× the energy efficiency of a CPU and a high-end GPU, respectively, over 16 operations
- 21× and 2.1× the performance of a CPU an a high-end GPU, over seven real-world applications

### More on SIMDRAM

 Nastaran Hajinazar, Geraldo F. Oliveira, Sven Gregorio, Joao Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gomez-Luna, and Onur Mutlu, "SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM" Proceedings of the <u>26th International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Virtual, March-April 2021.
 [2-page Extended Abstract]
 [Short Talk Slides (pptx) (pdf)]
 [Talk Slides (pptx) (pdf)]
 [Short Talk Video (5 mins)]
 [Full Talk Video (27 mins)]

#### SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM

\*Nastaran Hajinazar<sup>1,2</sup> Nika Mansouri Ghiasi<sup>1</sup> Juan Gómez-Luna<sup>1</sup> Sven Gregorio<sup>1</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup> João Dinis Ferreira<sup>1</sup> Saugata Ghose<sup>3</sup>

<sup>1</sup>ETH Zürich

<sup>2</sup>Simon Fraser University

<sup>3</sup>University of Illinois at Urbana–Champaign

### SIMDRAM: Follow-Ups

#### Limitations of current substrate?

- Computing granularity
- Data layout conversion
- High-latency bit-serial operations
- Assembly-like programming model
- Application scope
- ...

- We are working on even better processing-using-memory substrates
  - One step at a time!

### Limitations of PUD Systems: Overview

#### PUD systems suffer from three sources of inefficiency due to the large and rigid DRAM access granularity

#### **SIMD Underutilization**

- due to data parallelism variation within and across applications
- leads to throughput and energy waste

#### **2** Limited Computation Support

- due to a lack of low-cost interconnects across columns
- limits PUD operations to only parallel map constructs

#### **3** Challenging Programming Model

- due to a lack of compiler support for PUD systems
- creates a burden on programmers, limiting PUD adoption

### **Problem & Goal**



# DRAM's hierarchical organization can enable <u>fine-grained access</u>



#### **Fine-Grained DRAM:**

#### segments the global wordline to access individual DRAM mats

#### **Fine-Grained DRAM:**

#### segments the global wordline to access individual DRAM mats



global sense amplifier

#### Fine-grained DRAM for energy-efficient DRAM access:

[Cooper-Balis+, 2010]: Fine-Grained Activation for Power Reduction in DRAM
[Udipi+, 2010]: Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores
[Zhang+, 2014]: Half-DRAM
[Ha+, 2016]: Improving Energy Efficiency of DRAM by Exploiting Half Page Row Access
[O'Connor+, 2017]: Fine-Grained DRAM
[Olgun+, 2024]: Sectored DRAM



#### **Fine-grained DRAM for processing-using-DRAM:**

#### **1** Improves SIMD utilization

for a single PUD operation, only access the DRAM mats with target data



#### Fine-grained DRAM for processing-using-DRAM:

#### **1** Improves SIMD utilization

- for a single PUD operation, only access the DRAM mats with target data
- for multiple PUD operations, execute independent operations concurrently
   → multiple instruction, multiple data (MIMD) execution model

#### segmented global wordline



global sense amplifier

#### **Fine-grained DRAM for processing-using-DRAM:**

#### **Improves SIMD utilization**

for a single PUD operation, only access the DRAM mats with target data

for multiple PUD operations, execute independent operations concurrently
 → multiple instruction, multiple data (MIMD) execution model

#### **2** Enables low-cost interconnects for vector reduction

- global and local data buses can be used for inter-/intra-mat communication



#### Fine-grained DRAM for processing-using-DRAM:

#### Improves SIMD utilization

- for a single PUD operation, only access the DRAM mats with target data
- for multiple PUD operations, execute independent operations concurrently
   → multiple instruction, multiple data (MIMD) execution model
- **2** Enables low-cost interconnects for vector reduction
  - global and local data buses can be used for inter-/intra-mat communication

#### **3** Eases programmability

- SIMD parallelism in a DRAM mat is on par with vector ISAs' SIMD width **SAFARI** 

### MIMDRAM: Overview

MIMDRAM is a hardware/software co-designed PUD system that enables fine-grained PUD computation at low cost and programming effort

#### Main components of MIMDRAM:

#### **1** Hardware

- DRAM array modification to enable fine-grained PUD computation
- inter- and intra-mat interconnects to enable PUD vector reduction
- control unit design to orchestrate PUD execution

### 2 Software

- compiler support to transparently generate PUD instructions
- system support to map and execute PUD instructions

### **MIMDRAM:** Modifications to DRAM Chip





### MIMDRAM: Control Unit Design

#### The control unit schedules and orchestrates the execution of multiple PUD operations transparently



### MIMDRAM: Compiler Support

#### Transparently: <u>extract</u> SIMD parallelism from an application, and <u>schedule</u> PUD instructions while maximizing <u>utilization</u>

#### **Three new LLVM-based passes targeting PUD execution**



### **Evaluation:**

#### **Single Application Analysis - Energy Efficiency**



MIMDRAM significantly improves energy efficiency compared to CPU (30.6x), GPU (6.8x), and SIMDRAM (14.3x)

### More on MIMDRAM

 Geraldo F. Oliveira, Ataberk Olgun, Abdullah Giray Yağlıkçı, F. Nisa Bostancı, Juan Gómez-Luna, Saugata Ghose, and Onur Mutlu

" MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing"

*Proceedings of the <u>30th International Symposium on High-</u> <u><i>Performance Computer Architecture (HPCA)*</u>, Edinburgh, Scotland, March 2024.

MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing

Geraldo F. Oliveira<sup>†</sup> Ataberk Olgun<sup>†</sup> Abdullah Giray Yağlıkçı<sup>†</sup> F. Nisa Bostancı<sup>†</sup> Juan Gómez-Luna<sup>†</sup> Saugata Ghose<sup>‡</sup> Onur Mutlu<sup>†</sup>

<sup>†</sup> ETH Zürich <sup>‡</sup> Univ. of Illinois Urbana-Champaign

#### AFARI https://arxiv.org/pdf/2402.19080.pdf

### In-DRAM Physical Unclonable Functions

 Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu, "The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern DRAM <u>Devices"</u> Proceedings of the <u>24th International Symposium on High-Performance Computer</u> <u>Architecture</u> (HPCA), Vienna, Austria, February 2018.

 [Lightning Talk Video]
 [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]
 [Full Talk Lecture Video (28 minutes)]

#### The DRAM Latency PUF:

Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

> Jeremie S. Kim<sup>†§</sup> Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich

### In-DRAM True Random Number Generation

 Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, and Onur Mutlu, "D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput" Proceedings of the <u>25th International Symposium on High-Performance Computer</u> Architecture (HPCA), Washington, DC, USA, February 2019. [Slides (pptx) (pdf)] [Full Talk Video (21 minutes)] [Full Talk Lecture Video (27 minutes)] Top Picks Honorable Mention by IEEE Micro.

### D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput

Jeremie S. Kim<sup>‡§</sup>

Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Lois Orosa<sup>§</sup> Onur Mutlu<sup>§‡</sup> <sup>‡</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich

### In-DRAM True Random Number Generation

 Ataberk Olgun, Minesh Patel, A. Giray Yaglikci, Haocong Luo, Jeremie S. Kim, F. Nisa Bostanci, Nandita Vijaykumar, Oguz Ergin, and Onur Mutlu, <u>"QUAC-TRNG: High-Throughput True Random Number Generation Using</u> <u>Quadruple Row Activation in Commodity DRAM Chips"</u> *Proceedings of the <u>48th International Symposium on Computer Architecture</u> (<i>ISCA*), Virtual, June 2021.
 [Slides (pptx) (pdf)]
 [Short Talk Slides (pptx) (pdf)]
 [Talk Video (25 minutes)]
 [SAFARI Live Seminar Video (1 hr 26 mins)]

#### QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

Ataberk OlgunMinesh PatelA. Giray YağlıkçıHaocong LuoJeremie S. KimF. Nisa BostancıNandita VijaykumarOğuz ErginOnur Mutlu§ETH Zürich†TOBB University of Economics and TechnologyOUniversity of Toronto

### In-DRAM True Random Number Generation

F. Nisa Bostanci, Ataberk Olgun, Lois Orosa, A. Giray Yaglikci, Jeremie S. Kim, Hasan Hassan, Oguz Ergin, and Onur Mutlu,
 "DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators"
 Proceedings of the <u>28th International Symposium on High-Performance Computer</u>
 <u>Architecture</u> (HPCA), Virtual, April 2022.
 [Slides (pptx) (pdf)]
 [Short Talk Slides (pptx) (pdf)]

### DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators

F. Nisa Bostanci<sup>†§</sup> Ataberk Olgun<sup>†§</sup> Lois Orosa<sup>§</sup> A. Giray Yağlıkçı<sup>§</sup>
 Jeremie S. Kim<sup>§</sup> Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup>

<sup>†</sup>TOBB University of Economics and Technology

<sup>§</sup>ETH Zürich

**SAFARI** 

https://arxiv.org/pdf/2201.01385.pdf

### In-DRAM Lookup-Table Based Execution

João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S. Kim, Geraldo F. Oliveira, Taha Shahroodi, Anant Nori, and Onur Mutlu, "pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables" *Proceedings of the <u>55th International Symposium on Microarchitecture</u> (<i>MICRO*), Chicago, IL, USA, October 2022. [Slides (pptx) (pdf)] [Longer Lecture Slides (pptx) (pdf)] [Lecture Video (26 minutes)] [arXiv version] [Source Code (Officially Artifact Evaluated with All Badges)] *Officially artifact evaluated as available, reusable and reproducible.* 



#### pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

João Dinis Ferreira§Gabriel Falcao†Juan Gómez-Luna§Mohammed Alser§Lois Orosa§▽Mohammad Sadrosadati§Jeremie S. Kim§Geraldo F. Oliveira§Taha Shahroodi‡Anant Nori\*Onur Mutlu§§ETH Zürich†IT, University of Coimbra∇Galicia Supercomputing Center‡TU Delft

SAFARI

#### https://arxiv.org/pdf/2104.07699.pdf

### Limitations of Processing-using-DRAM

| Data Movement             | RowClone, Seshadri+ 2013<br>LISA, Chang+ 2013 |
|---------------------------|-----------------------------------------------|
| <b>Bitwise Operations</b> | Ambit, Seshadri+ 2017                         |
| Bit Shifting              | DRISA, Li+ 2017                               |
| Arithmetic Operations     | SIMDRAM, Hajinazar & Oliveira+ 2021           |

# Existing Processing-using-DRAM architectures only support a limited range of operations

## The Goal of pLUTo

### **Extend** Processing-using-DRAM to support the execution of **arbitrarily complex operations**



# pLUTo: Key Idea (x) input (f(x)) output
### pLUTo: Key Idea



# pLUTo: Key Idea

## Replace computation with memory accesses $\rightarrow pLUTo LUT Query$ operation





### **System Integration**



### Performance (normalized to area)

Average speedup normalized to area across 7 real-world workloads



pLUTo provides *substantially higher* performance per unit area than *both* the CPU and the GPU

### **Energy Consumption**

Average energy consumption across 7 real-world workloads



pLUTo *significantly reduces energy consumption* compared to processor-centric architectures for various workloads

### More Results in the Paper

- Comparison with FPGA
- Area Overhead Analysis

- Subarray-Level Parallelism
- LUT Loading Overhead
- Circuit-Level Reliability & Correctness
   Range of Supported Operations



#### pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

Mohammed Alser§ João Dinis Ferreira§ Gabriel Falcao<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Lois Orosa<sup>§</sup>∇ Jeremie S. Kim§ Mohammad Sadrosadati<sup>§</sup> Geraldo F. Oliveira§ Taha Shahroodi<sup>‡</sup> Anant Nori\* Onur Mutlu<sup>§</sup> §ETH Zürich <sup>∇</sup>*Galicia Supercomputing Center* <sup>‡</sup>TU Delft <sup>†</sup>IT, University of Coimbra \*Intel

#### SAFARI

#### https://arxiv.org/pdf/2104.07699.pdf

### SRC TECHCON Presentation

#### Geraldo F. Oliveira

- pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables
- https://arxiv.org/pdf/2104.07699.pdf



pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables, SRC TECHCON 2023



SAFARI

#### https://youtu.be/9t1FJQ6nNw4?si=bhylWCLZde2DC7os

#### Bulk Bitwise Operations in Real DRAM Chips

 Ismail Emir Yüksel, Yahya Can Tugrul Ataberk Olgun, F. Nisa Bostancı, A. Giray Yaglıkçı, Geraldo F. Oliveira, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Onur Mutlu, "Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis," *Proceedings of the <u>30th International Symposium on High-</u> <i>Performance Computer Architecture (HPCA)*, Edinburgh, Scotland, March 2024.

#### **Functionally-Complete Boolean Logic in Real DRAM Chips:** Experimental Characterization and Analysis

İsmail Emir Yüksel Yahya Can Tuğrul Ataberk Olgun F. Nisa Bostancı A. Giray Yağlıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez-Luna Mohammad Sadrosadati Onur Mutlu

ETH Zürich

https://arxiv.org/pdf/2402.18736

### The Capability of COTS DRAM Chips

We **demonstrate** that **COTS DRAM chips**:

Can simultaneously activate up to48 rows in two neighboring subarrays

Can perform **NOT operation** with up to **32 output operands** 

Can perform up to **16-input** AND, NAND, OR, and NOR operations

2

3

### **DRAM Testing Infrastructure**

- Developed from DRAM Bender [Olgun+, TCAD'23]\*
- Fine-grained control over DRAM commands, timings, and temperature



**SAFARI** \*Olgun et al., "<u>DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure</u> to Easily Test State-of-the-art DRAM Chips," TCAD, 2023.

### **DRAM Chips Tested**

- 256 DDR4 chips from two major DRAM manufacturers
- Covers different die revisions and chip densities

| Chip Mfr. | #Modules<br>(#Chips) | Die<br>Rev. | Mfr.<br>Date <sup>a</sup> | Chip<br>Density | Chip<br>Org. | Speed<br>Rate |
|-----------|----------------------|-------------|---------------------------|-----------------|--------------|---------------|
| SK Hynix  | 9 (72)               | М           | N/A                       | 4Gb             | x8           | 2666MT/s      |
|           | 5 (40)               | А           | N/A                       | 4Gb             | x8           | 2133MT/s      |
|           | 1 (16)               | А           | N/A                       | 8Gb             | x8           | 2666MT/s      |
|           | 1 (32)               | А           | 18-14                     | 4Gb             | x4           | 2400MT/s      |
|           | 1 (32)               | А           | 16-49                     | 8Gb             | x4           | 2400MT/s      |
|           | 1 (32)               | М           | 16-22                     | 8Gb             | x4           | 2666MT/s      |
| Samsung   | 1 (8)                | F           | 21-02                     | 4Gb             | x8           | 2666MT/s      |
|           | 2 (16)               | D           | 21-10                     | 8Gb             | x8           | 2133MT/s      |
|           | 1 (8)                | А           | 22-12                     | 8Gb             | x8           | 3200MT/s      |

### The Capability of COTS DRAM Chips

We **demonstrate** that **COTS DRAM chips**:

# Can simultaneously activate up to48 rows in two neighboring subarrays

Can perform **NOT operation** with **up to 32** output operands

#### Can perform up to 16-input AND, NAND, OR, and NOR operations



### **Characterization Methodology**

- To understand which and how many rows are simultaneously activated
  - Sweep Row A and Row B addresses



### **Key Results**

COTS DRAM chips have **two distinct** sets of activation patterns in **neighboring subarrays** when two rows are activated with **violated timings** 

**Exactly the same number** of rows in each subarray are activated

**Twice as many** rows in one subarray **compared to its neighbor subarray** are activated





A total of **48 rows** 

### The Capability of COTS DRAM Chips

We **demonstrate** that **COTS DRAM chips**:

# Can simultaneously activate up to48 rows in two neighboring subarrays

2 Can perform **NOT operation** with **up to 32** output operands

#### Can perform **up to 16-input** AND, NAND, OR, and NOR operations



### **Characterization Methodology**

Sweep Row A and Row B addresses



• Sweep **DRAM chip temperature** 





#### **Key Takeaways from In-DRAM NOT Operation**

Key Takeaway 1

#### **COTS DRAM chips can perform NOT operations** with up to 32 destination rows

Key Takeaway 2

Temperature has a small effect on the reliability of NOT operations



### The Capability of COTS DRAM Chips

We **demonstrate** that **COTS DRAM chips**:

Can simultaneously activate up to 48 rows in two neighboring subarrays

> Can perform **NOT operation** with **up to 32** output operands

Can perform **up to 16-input** AND, NAND, OR, and NOR operations



3

### Key Idea

#### Manipulate the bitline voltage to express a wide variety of functions using multiple-row activation in neighboring subarrays



#### **Two-Input AND and NAND Operations**





 $V_{DD} = 1 \& GND = 0$ COM **REF** X 0  $\mathbf{0}$ 0 1 1 0  $\mathbf{O}$ 1 1 AND NAND

#### **Key Takeaways from In-DRAM Operations**

**Key Takeaway 1** 

COTS DRAM chips can perform {2, 4, 8, 16}-input AND, NAND, OR, and NOR operations

**Key Takeaway 2** 

COTS DRAM chips can perform AND, NAND, OR, and NOR operations with very high reliability

**Key Takeaway 3** 

Data pattern slightly affects the reliability of AND, NAND, OR, and NOR operations

### Real Processing Using Memory Prototype

- End-to-end RowClone & TRNG using off-the-shelf DRAM chips
- Idea: Violate DRAM timing parameters to mimic RowClone

#### PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Ataberk Olgun<sup>§†</sup> Juan Gómez Luna<sup>§</sup> Konstantinos Kanellopoulos<sup>§</sup> Behzad Salami<sup>§\*</sup> Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich <sup>†</sup>TOBB ETÜ <sup>\*</sup>BSC

<u>https://arxiv.org/pdf/2111.00082.pdf</u> <u>https://github.com/cmu-safari/pidram</u> <u>https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s</u>

### PidRAM

Goal: Develop a flexible platform to explore end-to-end implementations of PuM techniques

• Enable rapid integration via key components

#### Hardware





Easy-to-extend Memory Controller

2 ISA-transparent PuM Controller

#### Software





**2** Custom Supervisor Software

### Real Processing Using Memory Prototype



https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

### PiDRAM Workflow



1- User application interfaces with the OS via system calls

**2-** OS uses PuM Operations Library (pumolib) to convey operation related information to the hardware *using* 

**3-** STORE instructions that target the memory mapped registers of the PuM Operations Controller (POC)

**4-** POC oversees the execution of a PuM operation (e.g., RowClone, bulk bitwise operations)

**5-** Scheduler arbitrates between regular (load, store) and PuM operations and issues DRAM commands with custom timings

### Real Processing Using Memory Prototype

#### E README.md

0

#### Building a PiDRAM Prototype

To build PiDRAM's prototype on Xilinx ZC706 boards, developers need to use the two sub-projects in this directory. fpga-zynq is a repository branched off of UCB-BAR's fpga-zynq repository. We use fpga-zynq to generate rocket chip designs that support end-to-end DRAM PuM execution. controller-hardware is where we keep the main Vivado project and Verilog sources for PiDRAM's memory controller and the top level system design.

#### **Rebuilding Steps**

- Navigate into fpga-zynq and read the README file to understand the overall workflow of the repository

   Follow the readme in fpga-zynq/rocket-chip/riscv-tools to install dependencies
- Create the Verilog source of the rocket chip design using the ZynqCopyFPGAConfig

   Navigate into zc706, then run make rocket C0NFIG=ZynqCopyFPGAConfig -j<number of cores>
- 3. Copy the generated Verilog file (should be under zc706/src) and overwrite the same file in controllerhardware/source/hdl/impl/rocket-chip
- 4. Open the Vivado project in controller-hardware/Vivado\_Project using Vivado 2016.2
- 5. Generate a bitstream
- 6. Copy the bitstream (system\_top.bit) to fpga-zynq/zc706
- 7. Use the ./build\_script.sh to generate the new boot.bin under fpga-images-zc706 , you can use this file to program the FPGA using the SD-Card
  - For details, follow the relevant instructions in fpga-zynq/README.md

You can run programs compiled with the RISC-V Toolchain supplied within the fpga-zynq repository. To install the toolchain, follow the instructions under fpga-zynq/rocket-chip/riscv-tools.

#### **Generating DDR3 Controller IP sources**

We cannot provide the sources for the Xilinx PHY IP we use in PiDRAM's memory controller due to licensing issues. We describe here how to regenerate them using Vivado 2016.2. First, you need to generate the IP RTL files:

1- Open IP Catalog 2- Find "Memory Interface Generator (MIG 7 Series)" IP and double click

#### https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram

https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

#### **Microbenchmark Copy/Initialization Throughput**



#### In-DRAM Copy and Initialization improve throughput by 119x and 89x

SAFARI Økasırga

### **PiDRAM is Open Source**

SAFARI @kasırga

### https://github.com/CMU-SAFARI/PiDRAM

| CMU-SAFARI / PiDRAM (Public)                                            |                                                                          | 🕅 🗘 Edi                      | t Pins 👻 💿 Wate | ch 3 - 4 Fork 2                     | <b>☆</b> Star (21) ►                     |  |
|-------------------------------------------------------------------------|--------------------------------------------------------------------------|------------------------------|-----------------|-------------------------------------|------------------------------------------|--|
| <> Code  O Issues 1 Pull requests                                       | 🕑 Actions 🗄 Projects 🖽 Wiki                                              | 🛈 Security 🗠 Insights        | 😂 Settings      |                                     |                                          |  |
| <mark>ট master ২</mark> ট 2 branches 📀 0 tags                           | S                                                                        | Go to file Add file          | e ▼ Code ▼      | About                               | ¢                                        |  |
| olgunataberk Fix small mistake in READ                                  | PiDRAM is the first flexible end-to-end<br>framework that enables system |                              |                 |                                     |                                          |  |
| controller-hardware                                                     | Add files via upload                                                     |                              | 7 months ago    | Processing-using-Memory techniques. |                                          |  |
| 📘 fpga-zynq                                                             | fpga-zynq Adds instructions to reproduce two key results 7 months ago    |                              |                 |                                     | Prototype on a RISC-V rocket chip system |  |
| README.md                                                               |                                                                          | 7 months ago                 | our preprint:   |                                     |                                          |  |
|                                                                         |                                                                          |                              |                 | https://arxiv.org/abs/2             | .111.00082                               |  |
| i≡ README.md                                                            | 🛱 Readme                                                                 |                              |                 |                                     |                                          |  |
|                                                                         |                                                                          |                              |                 | ☆ 21 stars                          |                                          |  |
| PIDRAM                                                                  | <ul> <li>3 watching</li> </ul>                                           |                              |                 |                                     |                                          |  |
|                                                                         |                                                                          |                              |                 | <b>೪ 2</b> forks                    |                                          |  |
| PiDRAM is the first flexible end-to-er                                  | nd framework that enables system integ                                   | ration studies and evaluatio | n of real       |                                     |                                          |  |
| Processing-using-Memory (PuM) tec<br>memory controller that can perform | Releases<br>No releases published<br>Create a new release                |                              |                 |                                     |                                          |  |
| required to build PiDRAM and develo                                     |                                                                          |                              |                 |                                     |                                          |  |

#### 100

### **Extended Version on ArXiv**

SAFARI @kasırga

### https://arxiv.org/abs/2111.00082

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | All fields 🗸 Search                                                                                                        |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|--|--|
| Help   Ac                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | vanced Search                                                                                                              |  |  |
| Computer Science > Hardware Architecture                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Download:                                                                                                                  |  |  |
| [Submitted on 29 Oct 2021 (v1), last revised 19 Dec 2021 (this version, v3)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | • PDF                                                                                                                      |  |  |
| PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Other formats                                                                                                              |  |  |
| Ataberk Olgun, Juan Gómez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oğuz Ergin, Onur Mutlu<br>Processing-using-memory (PuM) techniques leverage the analog operation of memory cells to perform computation. Several recent works have demonstrated<br>PuM techniques in off-the-shelf DRAM devices. Since DRAM is the dominant memory technology as main memory in current computing systems, these PuM<br>techniques represent an opportunity for alleviating the data movement bottleneck at very low cost. However, system integration of PuM techniques imposes<br>non-trivial challenges that are yet to be solved. Design space exploration of potential solutions to the PuM integration challenges requires appropriate tools to | Current browse context:<br>cs.AR<br>< prev   next ><br>new   recent   2111<br>Change to browse by:<br>cs                   |  |  |
| develop necessary hardware and software components. Unfortunately, current specialized DRAM-testing platforms, or system simulators do not provide the<br>flexibility and/or the holistic system view that is necessary to deal with PuM integration challenges.<br>We design and develop PiDRAM, the first flexible end-to-end framework that enables system integration studies and evaluation of real PuM techniques.<br>PiDRAM provides software and hardware components to rapidly integrate PuM techniques across the whole system software and hardware stack (e.g.,                                                                                                                                                                                    | References & Citations <ul> <li>NASA ADS</li> <li>Google Scholar</li> <li>Semantic Scholar</li> </ul>                      |  |  |
| necessary modifications in the operating system, memory controller). We implement PiDRAM on an FPGA-based platform along with an open-source RISC-V system. Using PiDRAM, we implement and evaluate two state-of-the-art PuM techniques: in-DRAM (i) copy and initialization, (ii) true random number generation. Our results show that the in-memory copy and initialization techniques can improve the performance of bulk copy operations by 12.6x and bulk initialization operations by 14.6x on a real system. Implementing the true random number generator requires only 190 lines of Verilog and 74 lines of C code using PiDRAM's software and hardware components.                                                                                   | DBLP - CS Bibliography<br>listing   bibtex<br>Juan Gómez-Luna<br>Behzad Salami<br>Hasan Hassan<br>Oguz Ergin<br>Onur Mutlu |  |  |
| Comments: 15 pages, 12 figures Subjects: Hardware Architecture (cs. AR)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Export Bibtex Citation                                                                                                     |  |  |
| Cite as: arXiv:2111.00082 [cs.AR]<br>(or arXiv:2111.00082v3 [cs.AR] for this version)<br>https://doi.org/10.48550/arXiv.2111.00082                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Bookmark<br>💥 💀 👾 📾                                                                                                        |  |  |



### Long Talk + Tutorial on Youtube

### https://youtu.be/s\_z\_S6FYpC8



102

### In-DRAM Physical Unclonable Functions

 Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu, "The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern DRAM Devices" Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018. [Lightning Talk Video] [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Full Talk Lecture Video (28 minutes)]

#### The DRAM Latency PUF:

#### Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

Jeremie S. Kim<sup>†§</sup> Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich

#### In-DRAM True Random Number Generation

 Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, and Onur Mutlu, "D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput" Proceedings of the <u>25th International Symposium on High-Performance Computer</u> Architecture (HPCA), Washington, DC, USA, February 2019. [Slides (pptx) (pdf)] [Full Talk Video (21 minutes)] [Full Talk Lecture Video (27 minutes)] Top Picks Honorable Mention by IEEE Micro.

#### D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput

Jeremie S. Kim<sup>‡§</sup>

Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Lois Orosa<sup>§</sup> Onur Mutlu<sup>§‡</sup> <sup>‡</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich

#### In-DRAM True Random Number Generation

 Ataberk Olgun, Minesh Patel, A. Giray Yaglikci, Haocong Luo, Jeremie S. Kim, F. Nisa Bostanci, Nandita Vijaykumar, Oguz Ergin, and Onur Mutlu, <u>"QUAC-TRNG: High-Throughput True Random Number Generation Using</u> <u>Quadruple Row Activation in Commodity DRAM Chips"</u> *Proceedings of the <u>48th International Symposium on Computer Architecture</u> (<i>ISCA*), Virtual, June 2021.
 [Slides (pptx) (pdf)]
 [Short Talk Slides (pptx) (pdf)]
 [Talk Video (25 minutes)]
 [SAFARI Live Seminar Video (1 hr 26 mins)]

#### QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

Ataberk OlgunMinesh PatelA. Giray YağlıkçıHaocong LuoJeremie S. KimF. Nisa BostancıNandita VijaykumarOğuz ErginOnur Mutlu§ETH Zürich†TOBB University of Economics and TechnologyOUniversity of Toronto

### In-DRAM True Random Number Generation

F. Nisa Bostanci, Ataberk Olgun, Lois Orosa, A. Giray Yaglikci, Jeremie S. Kim, Hasan Hassan, Oguz Ergin, and Onur Mutlu,
 "DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators"
 Proceedings of the <u>28th International Symposium on High-Performance Computer</u>
 <u>Architecture</u> (HPCA), Virtual, April 2022.
 [Slides (pptx) (pdf)]
 [Short Talk Slides (pptx) (pdf)]

#### DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators

F. Nisa Bostanci<sup>†§</sup> Ataberk Olgun<sup>†§</sup> Lois Orosa<sup>§</sup> A. Giray Yağlıkçı<sup>§</sup>
 Jeremie S. Kim<sup>§</sup> Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup>

<sup>†</sup>TOBB University of Economics and Technology

nur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich

#### **SAFARI**

https://arxiv.org/pdf/2201.01385.pdf

#### Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories

Shuangchen Li<sup>1</sup>; Cong Xu<sup>2</sup>, Qiaosha Zou<sup>1,5</sup>, Jishen Zhao<sup>3</sup>, Yu Lu<sup>4</sup>, and Yuan Xie<sup>1</sup>

University of California, Santa Barbara<sup>1</sup>, Hewlett Packard Labs<sup>2</sup> University of California, Santa Cruz<sup>3</sup>, Qualcomm Inc.<sup>4</sup>, Huawei Technologies Inc.<sup>5</sup> {shuangchenli, yuanxie}ece.ucsb.edu<sup>1</sup>

#### **SAFARI** <u>https://cseweb.ucsd.edu/~jzhao/files/Pinatubo-dac2016.pdf</u> <sup>107</sup>

#### Pinatubo: RowClone and Bitwise Ops in PCM



Figure 2: Overview: (a) Computing-centric approach, moving tons of data to CPU and write back. (b) The proposed Pinatubo architecture, performs *n*-row bitwise operations inside NVM in one step.
## In-Flash Bulk Bitwise Execution

Jisung Park, Roknoddin Azizi, Geraldo F. Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, and <u>Onur Mutlu</u>,
 "Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory"
 *Proceedings of the 55th International Symposium on Microarchitecture (MICRO)*, Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Lecture Video (44 minutes)]
 [arXiv version]

### Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Jisung Park<sup>§∇</sup> Roknoddin Azizi<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Mohammad Sadrosadati<sup>§</sup> Rakesh Nadig<sup>§</sup> David Novo<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Myungsuk Kim<sup>‡</sup> Onur Mutlu<sup>§</sup>

<sup>§</sup>ETH Zürich <sup>¬</sup>POSTECH <sup>†</sup>LIRMM, Univ. Montpellier, CNRS <sup>‡</sup>Kyungpook National University

#### https://arxiv.org/pdf/2209.05566.pdf

SAFARI

# Aside: In-Memory Crossbar Computation



(a) Multiply-Accumulate operation

(b) Vector-Matrix Multiplier

Fig. 1. (a) Using a bitline to perform an analog sum of products operation. (b) A memristor crossbar used as a vector-matrix multiplier.

### SAFARI

Shafiee+, "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars", ISCA 2016.

## Aside: In-Memory Crossbar Computation



#### SAFARI

Tutorial on Memory-Centric Computing: Processing-Using-Memory

> Geraldo F. Oliveira Prof. Onur Mutlu

> > ISCA 2024 29 June 2024



**ETH** zürich



- Introduction to Memory-Centric Computing Systems
- Invited Talk by Prof. Minsoo Rhu: "Memory-Centric Computing Systems – For AI and Beyond"
- Coffee Break
- Real-World Processing-Near-Memory Systems
- Processing-Using-Memory Architectures for Bulk Bitwise Op.
- Invited Talk by Prof. Saugata Ghose:
   "RACER and ReRAM PUM"
- PIM Programming & Infrastructure for PIM Research
- Closing Remarks

#### SAFARI