Functionally-Complete Boolean Logic in Real DRAM Chips Experimental Characterization and Analysis

#### Ismail Emir Yüksel

Yahya C. Tugrul Ataberk Olgun F. Nisa Bostancı A. Giray Yaglıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez–Luna Mohammad Sadr Onur Mutlu

**ETH** zürich



## **Executive Summary**

- <u>Motivation</u>: Processing-using-DRAM can alleviate the performance and energy bottlenecks caused by data movement
  - Prior works show that existing DRAM chips can perform three-input majority and two-input AND and OR operations
- <u>Problem</u>: Proof-of-concept demonstrations on commercial off-the-shelf (COTS) DRAM chips do not provide
  - functionally-complete operations (e.g., NAND or NOR)
  - NOT operation
  - AND and OR operations with more than two inputs
- <u>Experimental Study</u>: 256 DDR4 chips from two major manufacturers
- <u>Key Results</u>:
  - COTS DRAM chips can perform NOT and {2, 4, 8, 16}-input AND, NAND, OR, and NOR operations with very high reliability (>94% success rate)
  - Data pattern and temperature only slightly affect the reliability of these operations (<1.98% decrease in success rate)</li>

### Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion

### Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion

### **Data Movement Bottleneck**

- Today's computing systems are processor centric
- All data is processed in the processor  $\rightarrow$  at great system cost



More than 60% of the total system energy is spent on data movement<sup>1</sup>

**SAFARI**<sup>1</sup> A. Boroumand et al., "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks," ASPLOS, 2018

### **Processing-In-Memory (PIM)**

- Two main approaches for Processing-In-Memory:
- **1 Processing-<u>Near</u>-Memory**: PIM logic is added near the memory arrays or to the logic layer of 3D-stacked memory
- 2 **Processing-<u>Using</u>-Memory**: uses the analog operational principles of memory cells to perform computation



### **DRAM Organization**

**DRAM Module** 



### **DRAM Open Bitline Architecture**





### **DRAM Open Bitline Architecture**



## **DRAM Operation**

#### **DRAM Subarray**



ACTIVATE (ACT): Fetch the row's content into the sense amplifiers

Column Access (RD/WR): Read/Write the target column and drive to I/O

PRECHARGE (PRE):Prepare the subarray for a new ACTIVATE

### Outline

Background

Goal & Overview

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion

### **Our Goal**

# Understand the **capability** of COTS DRAM chips **beyond just storing data**

### Rigorously **characterize the reliability** of this capability



COTS: Commercial Off-The-Shelf

### The Capability of COTS DRAM Chips

We **demonstrate** that **COTS DRAM chips**:

Can simultaneously activate up to48 rows in two neighboring subarrays

Can perform **NOT operation** with up to **32 output operands** 

Can perform up to **16-input** AND, NAND, OR, and NOR operations



2

3

### Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion

## **DRAM Testing Infrastructure**

- Developed from DRAM Bender [Olgun+, TCAD'23]\*
- Fine-grained control over DRAM commands, timings, and temperature



**SAFARI** \*Olgun et al., "<u>DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure</u> to Easily Test State-of-the-art DRAM Chips," TCAD, 2023.

## **DRAM Chips Tested**

- 256 DDR4 chips from two major DRAM manufacturers
- Covers different die revisions and chip densities

| Chip Mfr. | #Modules<br>(#Chips) | Die<br>Rev. | Mfr.<br>Date <sup>a</sup> | Chip<br>Density | Chip<br>Org. | Speed<br>Rate |
|-----------|----------------------|-------------|---------------------------|-----------------|--------------|---------------|
| SK Hynix  | 9 (72)               | М           | N/A                       | 4Gb             | x8           | 2666MT/s      |
|           | 5 (40)               | А           | N/A                       | 4Gb             | x8           | 2133MT/s      |
|           | 1 (16)               | А           | N/A                       | 8Gb             | x8           | 2666MT/s      |
|           | 1 (32)               | А           | 18-14                     | 4Gb             | x4           | 2400MT/s      |
|           | 1 (32)               | А           | 16-49                     | 8Gb             | x4           | 2400MT/s      |
|           | 1 (32)               | М           | 16-22                     | 8Gb             | x4           | 2666MT/s      |
| Samsung   | 1 (8)                | F           | 21-02                     | 4Gb             | x8           | 2666MT/s      |
|           | 2 (16)               | D           | 21-10                     | 8Gb             | x8           | 2133MT/s      |
|           | 1 (8)                | А           | 22-12                     | 8Gb             | x8           | 3200MT/s      |

## **Testing Methodology**

- Carefully sweep:
  - Row addresses: Row A and Row B
  - Timing parameters: Between ACT  $\rightarrow$  PRE and PRE  $\rightarrow$  ACT



### Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion

### The Capability of COTS DRAM Chips

We **demonstrate** that **COTS DRAM chips**:

### Can simultaneously activate up to 48 rows in two neighboring subarrays

### Can perform **NOT operation** with **up to 32** output operands

### Can perform **up to 16-input** AND, NAND, OR, and NOR operations



## **Key Observation**

Activating two rows in **quick succession** can **simultaneously** activate **multiple rows in neighboring subarrays** 



## **Characterization Methodology**

- To understand which and how many rows are simultaneously activated
  - Sweep Row A and Row B addresses



## **Key Results**

COTS DRAM chips have **two distinct** sets of activation patterns in **neighboring subarrays** when two rows are activated with **violated timings** 

**Exactly the same number** of rows in each subarray are activated **Twice as many** rows in one subarray **compared to its neighbor subarray** are activated



### Subarray X Up to **16 rows** Shared Sense Amplifiers Subarray Y Up to **32 rows**

A total of **48 rows** 

#### A total of **32 rows SAFARI**

## Key Takeaway

#### COTS DDAM chine have two dictinct cots of

### **COTS DRAM chips can simultaneously activate up to 48 rows in two neighboring subarrays**

#### in each subarray

#### <u>compared to its neighbor subar</u>

#### Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis

İsmail Emir Yüksel Yahya Can Tuğrul Ataberk Olgun F. Nisa Bostancı A. Giray Yağlıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez-Luna Mohammad Sadrosadati Onur Mutlu

ETH Zürich

(More results in the paper)

p to **16 rows** 

https://arxiv.org/pdf/2402.18736.pdf

### Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion

### The Capability of COTS DRAM Chips

### We **demonstrate** that **COTS DRAM chips**:



### Can perform **NOT operation** with **up to 32** output operands

### Can perform up to 16-input AND, NAND, OR, and NOR operations



2



### **Connect rows in neighboring subarrays** through **a NOT gate** by simultaneously activating rows



ACT src











### **Characterization Methodology**

Sweep Row A and Row B addresses



• Sweep DRAM chip temperature





## **Reliability Metric**

Success Rate (for a DRAM cell)

**Percentage of trials** where the **correct output** of a tested operation is stored in the cell



### **Key Takeaways from In-DRAM NOT Operation**

Key Takeaway 1

### **COTS DRAM chips can perform NOT operations** with up to 32 destination rows

Key Takeaway 2

Temperature has a small effect on the reliability of NOT operations



### **Performing NOT in COTS DRAM Chips**



### COTS DRAM chips can perform NOT operations with up to 32 destination rows

## **Impact of Temperature**

 Used destination cells that can perform NOT operation with >90% success rate at 50°C





### Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion
# The Capability of COTS DRAM Chips

### We **demonstrate** that **COTS DRAM chips**:



# Can perform **OT operation** with **up to** output operands

### Can perform **up to 16-input** AND, NAND, OR, and NOR operations



3

# Key Idea

#### Manipulate the bitline voltage to express a wide variety of functions using multiple-row activation in neighboring subarrays







**SAFARI** \*Gao et al., "FracDRAM: Fractional Values in Off-the-Shelf DRAM," in MICRO, 2022.





 $V_{DD} = 1 \& GND = 0$ 









 $V_{DD} = 1 \& GND = 0$ 







 $V_{DD} = 1 \& GND = 0$ 







 $V_{DD} = 1 \& GND = 0$ 









#### Many-Input AND, NAND, OR, and NOR Operations

# We can express AND, NAND, OR, and NOR operations by carefully manipulating the **reference voltage**

#### Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis

İsmail Emir Yüksel Yahya Can Tuğrul Ataberk Olgun F. Nisa Bostancı A. Giray Yağlıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez-Luna Mohammad Sadrosadati Onur Mutlu

ETH Zürich

(More details in the paper)

AVG(X,Y) https://arxiv.org/pdf/2402.18736.pdf

# **Characterization Methodology**

• Sweep Row A and Row B addresses



### **Key Takeaways from In-DRAM Operations**

**Key Takeaway 1** 

COTS DRAM chips can perform {2, 4, 8, 16}-input AND, NAND, OR, and NOR operations

**Key Takeaway 2** 

COTS DRAM chips can perform AND, NAND, OR, and NOR operations with very high reliability

**Key Takeaway 3** 

Data pattern slightly affects the reliability of AND, NAND, OR, and NOR operations

### Performing AND, NAND, OR, and NOR



COTS DRAM chips can perform {2, 4, 8, 16}-input AND, NAND, OR, and NOR operations

## Performing AND, NAND, OR, and NOR



COTS DRAM chips can perform 16-input AND, NAND, OR, and NOR operations with very high success rate (>94%)

# **Impact of Data Pattern**



# **Impact of Data Pattern**





# **Impact of Data Pattern**



Data pattern slightly affects the reliability of AND, NAND, OR, and NOR operations

# **More in the Paper**

- Detailed hypotheses & key ideas to perform
  - NOT operation
  - Many-input AND, NAND, OR, and NOR operations
- How the reliability of bitwise operations are affected by
  - The location of activated rows
  - Temperature (for AND, NAND, OR, and NOR)
  - DRAM speed rate
  - Chip density and die revision
- Discussion on the limitations of COTS DRAM chips

# Available on arXiv

#### Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis

İsmail Emir Yüksel Yahya Can Tuğrul Ataberk Olgun F. Nisa Bostancı A. Giray Yağlıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez-Luna Mohammad Sadrosadati Onur Mutlu

#### ETH Zürich

Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog operational properties of DRAM circuitry to enable massively parallel in-DRAM computation. PuD has the potential to significantly reduce or eliminate costly data movement between processing elements and main memory. A common approach for PuD architectures is to make use of bulk bitwise computation (e.g., AND, OR, NOT). Prior works experimentally demonstrate three-input MAJ (i.e., MAJ3) and two-input AND and OR operations in commercial off-the-shelf (COTS) DRAM chips. Yet, demonstrations on COTS DRAM chips do not provide a functionally complete set of operations (e.g., NAND or AND and NOT).

We experimentally demonstrate that COTS DRAM chips are capable of performing 1) functionally-complete Boolean operations: NOT, NAND, and NOR and 2) many-input (i.e., more than two-input) AND and OR operations. We present an extensive systems and applications [12, 13]. Processing-using-DRAM (PuD) [29–32] is a promising paradigm that can alleviate the data movement bottleneck. PuD uses the analog operational properties of the DRAM circuitry to enable massively parallel in-DRAM computation. Many prior works [29–53] demonstrate that PuD can greatly reduce or eliminate data movement.

A widely used approach for PuD is to perform bulk bitwise operations, i.e., bitwise operations on large bit vectors. To perform bulk bitwise operations using DRAM, prior works propose modifications to the DRAM circuitry [29–31, 33, 35, 36, 43, 44, 46, 48–58]. Recent works [38, 41, 42, 45] experimentally demonstrate the feasibility of executing data copy & initialization [42, 45], i.e., the RowClone operation [49], and a subset of bitwise operations, i.e., three-input bitwise majority (MAJ3) and two-input AND and OR operations in unmodified commercial off-the-shelf (COTS) DRAM chips by operating beyond

### https://arxiv.org/pdf/2402.18736.pdf

# Outline

Background

**Goal & Overview** 

**Experimental Methodology** 

Multiple-Row Activation in Neighboring Subarrays

**NOT Operation** 

AND, NAND, OR, and NOR Operations

Conclusion



# Conclusion

- We experimentally demonstrate that commercial off-the-shelf (COTS) DRAM chips can perform:
  - **Functionally-complete** Boolean operations: NOT, NAND, and NOR
  - Up to 16-input AND, NAND, OR, and NOR operations
- We characterize the success rate of these operations on 256 COTS DDR4 chips from two major manufacturers
- We highlight **two key results**:
  - We can perform NOT and
    - {2, 4, 8, 16}-input AND, NAND, OR, and NOR operations
      on COTS DRAM chips with very high success rates (>94%)
  - Data pattern and temperature only slightly affect the reliability of these operations

We believe these empirical results demonstrate the promising potential of using DRAM as a computation substrate

# Functionally-Complete Boolean Logic in Real DRAM Chips Experimental Characterization and Analysis

#### Ismail Emir Yüksel

Yahya C. Tugrul Ataberk Olgun F. Nisa Bostancı A. Giray Yaglıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez–Luna Mohammad Sadr Onur Mutlu

**ETH** zürich



Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips Experimental Characterization and Analysis





**ETH** zürich

#### İsmail Emir Yüksel

Yahya C. Tuğrul F. Nisa Bostancı Geraldo F. Oliveira

A. Giray Yağlıkçı Ataberk Olgun Melina Soysal Haocong Luo

Juan Gómez–Luna Mohammad Sadr Onur Mutlu



# **Executive Summary**

#### **Motivation:**

SAFAR

- Processing-Using-DRAM (PUD) alleviates data movement bottlenecks
- Commercial off-the-shelf (COTS) DRAM chips can perform three-input majority (MAJ3) and in-DRAM copy operations

**Goal:** To experimentally analyze and understand

- The computational capability of COTS DRAM chips beyond that of prior works
- The robustness of such capability under various operating conditions

#### Experimental Study: 120 DDR4 chips from two major manufacturers

- COTS DRAM chips can perform MAJ5, MAJ7, and MAJ9 operations and copy one DRAM row to up to 31 different rows at once
- Storing multiple redundant copies of MAJ's input operands (i.e., input replication) drastically increases robustness (>30% higher success rate)
- **Operating conditions** (temperature, voltage, and data pattern) **affect** the robustness of in-DRAM operations (by up to 11.52% success rate)

#### https://github.com/CMU-SAFARI/SiMRA-DRAM

### **Leveraging Simultaneous Many-Row Activation**







### **Leveraging Simultaneous Many-Row Activation**





### In-DRAM Multiple Row Copy (Multi-RowCopy)

Simultaneously activate many rows to copy **one row's content** to **multiple destination rows** 

RowClone



Multi-RowCopy



#### SAFARI

#### [Seshadri+ MICRO'13]

# **Robustness of Multi-RowCopy**



COTS DRAM chips can copy one row's content to up to 31 rows with a very high success rate

# Available on arXiv



levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of 1) simultaneously activating up to 32 rows (i.e., simultaneous many-row activation), 2) executing a majority of X (MAJX) operation where X>3 (i.e., MAJ5, MAJ7, and MAJ9 operations), and 3) copying a DRAM row (concurrently) to up to 31 other DRAM rows, which we call Multi-RowCopy. Second, storing multiple copies of MAJX's input operands on all simultaneously activated rows drastically increases the success rate (i.e., the percentage of DRAM cells that correctly perform the computation) of the MAJX operation. For example, MAJ3 with 32-row activation (i.e.,

based arithmetic [64, 66, 69, 72, 91, 127, 130, 131], and lookup table based operations [82, 106, 107, 132]. We refer to DRAMbased PUM as Processing-Using-DRAM (PUD) and the computation performed using DRAM cells as PUD operations.

PUD benefits from the bulk data parallelism in DRAM devices to perform bulk bitwise PUD operations. Prior works show that bulk bitwise operations are used in a wide variety of important applications, including databases and web search [64, 67, 79, 130, 133-140], data analytics [64, 141-144], graph processing [56, 80, 94, 130, 145], genome analysis [60, 99, 146-149], cryptography [150, 151], set operations [56, 64], and hyperdimensional computing [152–154].

#### https://arxiv.org/pdf/2405.06081

### **Our Work is Open Source and Artifact Evaluated**

| Code<br>Repro                        | ducible                                         |                                           | Dataset<br>Reproducible                                                                                                                                                                                                                                                       |
|--------------------------------------|-------------------------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SIMRA-DRAM Public                    |                                                 | ☆ Edit Pins ▼                             | • $\frac{9.9}{5}$ Fork 0 • $\bigstar$ Starred 6 •                                                                                                                                                                                                                             |
| 🐉 main 👻 🐉 1 Branch 😒 0 Tags         | Q Go to file                                    | t Add file $\checkmark$ Code $\checkmark$ | About 餘                                                                                                                                                                                                                                                                       |
| 🛎 unrealismail Update README.md      |                                                 | a51abfa · last month 🕚 5 Commits          | a51abfa · last month 🕚 5 Commits Source code & scripts for experimental characterization and demonstration of 1)                                                                                                                                                              |
| DRAM-Bender                          | initial comit                                   | last month                                | simultaneous many-row activation, 2) up<br>to nine-input majority operations and 3)<br>copying one row's content to up 31 rows<br>in real DDR4 DRAM chips. Described in<br>our DSN'24 paper by Yuksel et al. at<br>https://arxiv.org/abs/2405.06081<br>Readme<br>View license |
| analysis                             | initial comit                                   | last month                                |                                                                                                                                                                                                                                                                               |
| 📄 experimental_data                  | initial comit                                   | last month                                |                                                                                                                                                                                                                                                                               |
| LICENSE                              | initial comit                                   | last month                                |                                                                                                                                                                                                                                                                               |
| README.md                            | Update README.md                                | last month                                |                                                                                                                                                                                                                                                                               |
| 띠 README 화 License                   |                                                 | Ø ∷≣                                      | <ul><li>小 Activity</li><li>☑ Custom properties</li></ul>                                                                                                                                                                                                                      |
| Simultaneous Mar<br>DRAM Chips: Expe | y-Row Activation in (<br>erimental Characteriza | Off-the-Shelf<br>ation and Analysis       | <ul> <li>☆ 6 stars</li> <li>④ 4 watching</li> <li>♀ 0 forks</li> <li>Report repository</li> </ul>                                                                                                                                                                             |

#### https://github.com/CMU-SAFARI/SiMRA-DRAM

Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips Experimental Characterization and Analysis





İsmail Emir Yüksel



**ETH** zürich

Yahya C. Tuğrul F. Nisa Bostancı Geraldo F. Oliveira

A. Giray Yağlıkçı Ataberk Olgun Melina Soysal Haocong Luo

Juan Gómez–Luna Mohammad Sadr Onur Mutlu



# Functionally-Complete Boolean Logic in Real DRAM Chips Experimental Characterization and Analysis

### **Backup Slides**

#### Ismail Emir Yüksel

Yahya C. Tugrul Ataberk Olgun F. Nisa Bostancı

A. Giray Yaglıkçı Geraldo F. Oliveira Haocong Luo

Juan Gómez–Luna Mohammad Sadr Onur Mutlu

ETHzürich



# **Experimental Methodology**

#### We test all banks in each DRAM chip

### We test three neighboring subarray pairs in each bank

#### We test all possible combinations of activated rows



# **Performing NOT in COTS DRAM Chips**



As the number of destination rows increases, more DRAM cells produce incorrect results.

### The Coverage of Multiple-Row Activation



Figure 5: Coverage of each  $N_{RF}$ :  $N_{RL}$  activation type across tested  $R_F$  and  $R_L$  row pairs.

## **NOT vs. Activation Trend**



Figure 8: Success rate of the NOT operation vs.  $N_{RF}$ :  $N_{RL}$  activation type.

# Impact of Location in NOT Op.

• Categorize the distance between activated rows (source and destination rows) and the sense amplifiers into three regions: Far, Middle, and Close



The distance between activated rows and the sense amplifiers significantly affects the reliability
### The effect of DRAM Speed Rate on NOT



Figure 11: Success rate of the NOT operation for different DRAM speed rates.

### **Chip Density & Die Revision (NOT)**



Figure 12: Success rate of the NOT operation for different chip density and die revision combinations for two major manufacturers.

# Performing AND, NAND, OR, and NOR



### The reliability distributions are very similar between 1) AND-NAND and 2) OR – NOR operations.

# **Impact of Temperature**



Temperature has a small effect on the reliability of AND, NAND, OR, and NOR operations

### **Boolean Operations vs. Number of 1s**



Figure 16: Success rates of AND and OR operations based on the number of logic-1s in the input operands.

### **The Effect of the Location**



### **DRAM Speed Rate vs. Bitwise Ops.**



Figure 20: Success rates of AND, NAND, OR, and NOR operations for three DRAM speed rates.

### **Chip Density&Die Revision vs. Bitwise Ops.**



# **DRAM Cell Operation**





# **DRAM Cell Operation - PRECHARGE**



Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips Experimental Characterization and Analysis





**ETH** zürich

### İsmail Emir Yüksel

Yahya C. Tuğrul F. Nisa Bostancı Geraldo F. Oliveira

A. Giray Yağlıkçı Ataberk Olgun Melina Soysal Haocong Luo

Juan Gómez–Luna Mohammad Sadr Onur Mutlu



# **Executive Summary**

#### **Motivation:**

SAFAR

- Processing-Using-DRAM (PUD) alleviates data movement bottlenecks
- Commercial off-the-shelf (COTS) DRAM chips can perform three-input majority (MAJ3) and in-DRAM copy operations

**Goal:** To experimentally analyze and understand

- The computational capability of COTS DRAM chips beyond that of prior works
- The robustness of such capability under various operating conditions

### Experimental Study: 120 DDR4 chips from two major manufacturers

- COTS DRAM chips can perform MAJ5, MAJ7, and MAJ9 operations and copy one DRAM row to up to 31 different rows at once
- Storing multiple redundant copies of MAJ's input operands (i.e., input replication) drastically increases robustness (>30% higher success rate)
- **Operating conditions** (temperature, voltage, and data pattern) **affect** the robustness of in-DRAM operations (by up to 11.52% success rate)

#### https://github.com/CMU-SAFARI/SiMRA-DRAM

# Outline

**Motivation & Background** 

Goal

**Experimental Methodology** 

**Simultaneous Many-Row Activation** 

**MAJX Operation** 

**Multi-RowCopy Operation** 

Conclusion



# Outline

### **Motivation & Background**

Goal

**Experimental Methodology** 

**Simultaneous Many-Row Activation** 

**MAJX Operation** 

**Multi-RowCopy Operation** 

Conclusion



# **Data Movement Bottleneck**

- Today's computing systems are processor centric
- All data is processed in the processor  $\rightarrow$  at great system cost



More than 60% of the total system energy is spent on data movement<sup>1</sup>

# **Processing-In-Memory (PIM)**

### Two main approaches for Processing-In-Memory:

- **1 Processing-<u>Near</u>-Memory:** PIM logic is added near the memory arrays or to the logic layer of 3D-stacked memory
- 2 **Processing-Using-Memory:** uses the analog operational principles of memory cells to perform computation



# **Processing-In-Memory (PIM)**

### **Two main approaches for Processing-In-Memory:**

- Processing-<u>Near</u>-Memory: PIM logic is added near the memory arrays or to the logic layer of 3D-stacked memory
- 2 **Processing-Using-Memory:** uses the analog operational principles of memory cells to perform computation



# **DRAM Organization**



# **DRAM Operation**



**ACTIVATE (ACT):** 

Fetch the row's content into the **sense amplifiers** 



Column Access (RD/WR):

Read/Write the target column and drive to I/O

PRECHARGE (PRE): Prepare the bank for a new ACTIVATE



### In-DRAM Row-Copy (RowClone)

Copying the source (src) row's content to the destination (dst) row





### In-DRAM Row-Copy (RowClone)





#### [Seshadri+ MICRO'13]

### In-DRAM Row-Copy (RowClone)



#### [Seshadri+ MICRO'13]

# In-DRAM Majority-of-Three (MAJ3)

### Performing a MAJ3 operation using three rows as input operands



### MAJ3(a, b, b) = b

### SAFARI

[Seshadri+ MICRO'17]

# In-DRAM Majority-of-Three (MAJ3)

Activate three rows simultaneously



# Outline

### **Motivation & Background**

### Goal

**Experimental Methodology** 

**Simultaneous Many-Row Activation** 

**MAJX Operation** 

**Multi-RowCopy Operation** 

### Conclusion



# **Our Goal**

### **Experimentally** understand the **computational capability** of COTS DRAM chips

**Experimentally** analyze the **robustness** of such capability under various **operating conditions** 



# Outline

**Motivation & Background** 

Goal

Experimental Methodology

**Simultaneous Many-Row Activation** 

**MAJX Operation** 

**Multi-RowCopy Operation** 

Conclusion



# **DRAM Testing Infrastructure (I)**

### DRAM Bender DDR3/4 Testing Infrastructure





#### https://github.com/CMU-SAFARI/DRAM-Bender

| E CMU-SAFARI / DRAM-Bender |           |  |  |  |  |  |
|----------------------------|-----------|--|--|--|--|--|
|                            |           |  |  |  |  |  |
|                            | that ca   |  |  |  |  |  |
|                            |           |  |  |  |  |  |
|                            |           |  |  |  |  |  |
|                            | factors   |  |  |  |  |  |
|                            | مامانوريم |  |  |  |  |  |

කු

Bender is the first open DRAM testing infrastructure an be used to easily and rehensively test state-of-the-R4 modules of different form s. Five prototypes are available on different FPGA boards.

# **DRAM Testing Infrastructure (II)**

# Fine-grained control over DRAM commands, timings, temperature, and voltage



# **DRAM Chips Tested**

- 120 DDR4 chips from two major DRAM manufacturers
- Covers different die revisions and chip densities

| <br>DRAM Mfr. | #Modules | #Chips | Die Rev. | Density | Org. | Subarray Size |
|---------------|----------|--------|----------|---------|------|---------------|
| <br>SK Hynix  | 7        | 56     | М        | 4Gb     | x8   | 512 or 640    |
| (Mfr. H)      | 5        | 40     | А        | 4Gb     | x8   | 512           |
| Micron        | 4        | 16     | Е        | 16Gb    | x16  | 1024          |
| <br>(Mfr. M)  | 2        | 8      | В        | 16Gb    | x16  | 1024          |

# **Testing Methodology (I)**

- Carefully sweep
  - Row addresses: Row A and Row B (>3M row pairs)
  - Timing parameters: Between ACT → PRE and PRE → ACT



# **Testing Methodology (II)**



### **Robustness Metric: Success Rate**

Percentage of DRAM cells that produce correct output of a tested operation in all test trials



Success rate for this example: 66.67% (2/3)

# Outline

**Motivation & Background** 

Goal

**Experimental Methodology** 

Simultaneous Many-Row Activation

**MAJX Operation** 

**Multi-RowCopy Operation** 

Conclusion



# **Key Observation**

Activating two rows in **quick succession** can **simultaneously** activate **many rows in a subarray** 




# **Hypothesis: Row Decoder Circuitry**

Simultaneous many-row activation is possible due to the **hierarchical DRAM row decoder design** 



# **Row Decoder: A Tree Example**

• We can visualize the hierarchical row decoder circuitry as a tree



# Activating a Single Row



# **Activating Many Rows: A Walkthrough**

Back-to-back ACT commands with violated timings asserts many more signals in the row decoder



# **Hypothesis: Row Decoder Circuitry**



## Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis

İsmail Emir Yüksel<sup>1</sup> Yahya Can Tuğrul<sup>1,2</sup> F. Nisa Bostancı<sup>1</sup> Geraldo F. Oliveira<sup>1</sup> A. Giray Yağlıkçı<sup>1</sup> Ataberk Olgun<sup>1</sup> Melina Soysal<sup>1</sup> Haocong Luo<sup>1</sup> Juan Gómez-Luna<sup>1</sup> Mohammad Sadrosadati<sup>1</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>TOBB University of Economics and Technology

(More discussions & hypotheses in the paper)

https://arxiv.org/pdf/2405.06081



# **Characterization Methodology (I)**

If rows are activated, WR command overwrites all of the activated rows' content



# **Characterization Methodology (II)**

## Carefully sweep

- Row addresses: Row A and Row B
- Timing parameters: Between ACT → PRE and PRE → ACT
- Temperature (°C): 50, 60, 70, 80, and 90
- Wordline Voltage (V): 2.5, 2.4, 2.3, 2.2, and 2.1





## **Key Takeaways from Simultaneous Many-Row ACT**

Key Takeaway 1

COTS DRAM chips are capable of simultaneously activating 2, 4, 8, 16, and 32 rows

Key Takeaway 2

Simultaneous many-row activation is highly resilient to

temperature and wordline voltage changes



## **Robustness of Simultaneous Many-Row Activation**



# COTS DRAM chips can simultaneously activate 2, 4, 8, 16, and 32 rows in the same subarray

## Also in the Paper: Impact of Temperature & Voltage



## **Leveraging Simultaneous Many-Row Activation**







# Outline

**Motivation & Background** 

Goal

**Experimental Methodology** 

**Simultaneous Many-Row Activation** 

**MAJX** Operation

**Multi-RowCopy Operation** 

Conclusion



## **Leveraging Simultaneous Many-Row Activation**





# In-DRAM Majority-of-X (MAJX)

Simultaneously activate many rows to perform MAJX (where X>3) operations

MAJ5(a, b, b, b, a) = b



SAFARI

## MAJ7(a, b, b, a, a, b, b) = b



# MAJX in Real DRAM Chips

- For MAJX, we need to activate X rows simultaneously
- We can only simultaneously activate 2, 4, 8, 16, and 32 rows
- Question
  - How do we perform MAJX while simultaneously activating more than X rows?
- Answer
  - Making some rows neutral during the MAJX operation using the Frac operation\*





## **Leveraging Simultaneous Many-Row Activation**







## Improving the Robustness (Input Replication)

Storing **multiple copies** of MAJX input operands can **increase the robustness** of MAJX operations



SAFAR

MAJ6(a, b, b, a, b, b) = b



# **Characterization Methodology**

- Carefully sweep
  - Row addresses: Row A and Row B
  - Timing parameters: Between ACT → PRE and PRE → ACT
  - Temperature (°C): 50, 60, 70, 80, and 90
  - Wordline Voltage (V): 2.5, 2.4, 2.3, 2.2, and 2.1





# **Key Takeaways from MAJX Operation**

## **Key Takeaway 1**

COTS DRAM chips are capable of performing MAJ5, MAJ7, and MAJ9 operations

Key Takeaway 2

Storing multiple copies of MAJX's input operands significantly increases the MAJX's success rate

## **Key Takeaway 3**

Voltage and temperature slightly affect the success rate, whereas data pattern affects significantly



# **Robustness of MAJX Operations**



# COTS DRAM chips are capable of performing MAJ5, MAJ7, and MAJ9 operations

# **Impact of Input Replication**



Storing multiple copies of MAJ's input operands increases the success rate of MAJ3, MAJ5, MAJ7, and MAJ9 operations

# **Impact of Data Pattern**



11.52% decrease in success rate on average (up to 32.56%) across all tested MAJX operations

# Data pattern significantly affects the success rate of the MAJX operation

## Also in the Paper: Impact of Temperature & Voltage



# Outline

**Motivation & Background** 

Goal

**Experimental Methodology** 

**Simultaneous Many-Row Activation** 

**MAJX Operation** 

**Multi-RowCopy Operation** 

Conclusion



## **Leveraging Simultaneous Many-Row Activation**





## In-DRAM Multiple Row Copy (Multi-RowCopy)

Simultaneously activate many rows to copy **one row's content** to **multiple destination rows** 

RowClone



Multi-RowCopy



### SAFARI

#### [Seshadri+ MICRO'13]

# **Characterization Methodology (II)**

## Carefully sweep

- Row addresses: Row A and Row B
- Timing parameters: Between ACT → PRE and PRE → ACT
- Temperature (°C): 50, 60, 70, 80, and 90
- Wordline Voltage (V): 2.5, 2.4, 2.3, 2.2, and 2.1



# Key Takeaways from Multi-RowCopy

Key Takeaway 1

COTS DRAM chips are capable of copying one row's data to 1, 3, 7, 15, and 31 other rows at very high success rates

Key Takeaway 2

Multi-RowCopy in COTS DRAM chips is highly resilient to changes in data pattern, temperature, and wordline voltage



# **Robustness of Multi-RowCopy**



COTS DRAM chips can copy one row's content to up to 31 rows with a very high success rate

## **Impact of Data Pattern**



## Data pattern has a small effect

on the success rate of the Multi-RowCopy operation

## Also in the Paper: Impact of Temperature & Voltage



# More in the Paper

- Detailed hypotheses and key ideas on
  - Hypothetical row decoder circuitry
  - Input Replication
- More characterization results
  - Power consumption of simultaneous many-row activation
  - Effect of timing delays between ACT-PRE and PRE-ACT commands
  - Effect of temperature and wordline voltage
- Circuit-level (SPICE) experiments for input replication
- Potential performance benefits of enabling new in-DRAM operations
  - Majority-based computation
  - Content destruction-based cold-boot attack prevention
- Discussions on the limitations of tested COTS DRAM chips

## Available on arXiv



ness of these capabilities under various timing delays between DRAM commands, data patterns, temperature, and voltage levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of 1) simultaneously activating up to 32 rows (i.e., simultaneous many-row activation), 2) executing a majority of X (MAJX) operation where X>3 (i.e., MAJ5, MAJ7, and MAJ9 operations), and 3) copying a DRAM row (concurrently) to up to 31 other DRAM rows, which we call Multi-RowCopy. Second, storing multiple copies of MAJX's input operands on all simultaneously activated rows drastically increases the success rate (i.e., the percentage of DRAM cells that correctly perform the computation) of the MAJX operation. For example, MAJ3 with 32-row activation (i.e.,

based arithmetic [64, 66, 69, 72, 91, 127, 130, 131], and lookup table based operations [82, 106, 107, 132]. We refer to DRAMbased PUM as Processing-Using-DRAM (PUD) and the computation performed using DRAM cells as PUD operations.

PUD benefits from the bulk data parallelism in DRAM devices to perform bulk bitwise PUD operations. Prior works show that bulk bitwise operations are used in a wide variety of important applications, including databases and web search [64, 67, 79, 130, 133-140], data analytics [64, 141-144], graph processing [56, 80, 94, 130, 145], genome analysis [60, 99, 146-149], cryptography [150, 151], set operations [56, 64], and hyperdimensional computing [152–154].

## https://arxiv.org/pdf/2405.06081

## **Our Work is Open Source and Artifact Evaluated**

| Code<br>Reproducible                 |                                             |                                     | Dataset<br>Reproducible                                                                                                                                                                                                                                                          |
|--------------------------------------|---------------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Simra-Dram Public                    |                                             | 🔊 Edit Pins 👻 💿 Watch 4             | ▼ 😵 Fork 0 ▼ 🔶 Starred 6 ▼                                                                                                                                                                                                                                                       |
| 양 main 👻 양 1 Branch 🛇 0 Tags         | Q Go to file                                | t Add file - <> Code -              | About 鐐                                                                                                                                                                                                                                                                          |
| 🛎 unrealismail Update README.md      |                                             | a51abfa · last month 🕚 5 Commits    | a51abfa · last month 🕚 5 Commits Source code & scripts for experimental characterization and demonstration of 1)                                                                                                                                                                 |
| DRAM-Bender                          | initial comit                               | last month                          | simultaneous many-row activation, 2) up<br>to nine-input majority operations and 3)<br>copying one row's content to up 31 rows<br>in real DDR4 DRAM chips. Described in<br>our DSN'24 paper by Yuksel et al. at<br>https://arxiv.org/abs/2405.06081<br>Readme<br>Ar View license |
| analysis                             | initial comit                               | last month                          |                                                                                                                                                                                                                                                                                  |
| experimental_data                    | initial comit                               | last month                          |                                                                                                                                                                                                                                                                                  |
| LICENSE                              | initial comit                               | last month                          |                                                                                                                                                                                                                                                                                  |
| 🗅 README.md                          | Update README.md                            | last month                          |                                                                                                                                                                                                                                                                                  |
| □ <b>README</b> <sup>Δ</sup> License |                                             | Ø 🗄                                 | <ul><li>Activity</li><li>Custom properties</li></ul>                                                                                                                                                                                                                             |
| Simultaneous Mar<br>DRAM Chips: Expe | y-Row Activation in<br>rimental Characteriz | Off-the-Shelf<br>ation and Analysis | <ul> <li>☆ 6 stars</li> <li>◆ 4 watching</li> <li>♀ 0 forks</li> <li>Report repository</li> </ul>                                                                                                                                                                                |

## https://github.com/CMU-SAFARI/SiMRA-DRAM

# Outline

**Motivation & Background** 

Goal

**Experimental Methodology** 

**Simultaneous Many-Row Activation** 

**MAJX Operation** 

**Multi-RowCopy Operation** 



# Conclusion

SAFARI

We experimentally demonstrate that COTS DRAM chips can

- simultaneously activate up to 32 DRAM rows
- perform MAJ3, MAJ5, MAJ7, and MAJ9 operations
- copy one row's content to up to 31 rows

We characterize 120 DDR4 chips and highlight three key results

- Storing multiple copies of MAJX's input operands (i.e., input replication) drastically increases the success rate of MAJX operations
- Voltage and temperature slightly affect the success rate of MAJX operation, whereas data pattern affects significantly
- Multi-RowCopy is highly resilient to changes in data pattern, temperature, and wordline voltage

We believe these empirical results demonstrate the promising potential of using DRAM as a computation substrate

#### https://github.com/CMU-SAFARI/SiMRA-DRAM
Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips Experimental Characterization and Analysis





İsmail Emir Yüksel



**ETH** zürich

Yahya C. Tuğrul F. Nisa Bostancı Geraldo F. Oliveira

A. Giray Yağlıkçı Ataberk Olgun Melina Soysal Haocong Luo

Juan Gómez–Luna Mohammad Sadr Onur Mutlu



Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips Experimental Characterization and Analysis

### **Backup Slides**

### İsmail Emir Yüksel

Yahya C. Tuğrul F. Nisa Bostancı Geraldo F. Oliveira

A. Giray Yağlıkçı Ataberk Olgun Melina Soysal Haocong Luo

Juan Gómez–Luna Mohammad Sadr Onur Mutlu





### **Power Consumption of Many-Row ACT**



32-row activation consumes 21.19% less power than the most power-consuming single DRAM operation (i.e., REF)

Simultaneous many-row activation power draw likely meets the power budget of DDR4 chips

### **Impact of Temperature in Many-Row ACT**



Increasing temperature up to 90°C has a small effect on the success rate



# **Impact of Voltage in Many-Row ACT**



Reducing the wordline voltage only slightly affects the success rate

## **Impact of Temperature in MAJX**



# **Impact of Voltage in MAJX Operations**



Wordline voltage slightly affects the success rate of the MAJX operation



### Impact of Timing Delays in Many-Row ACT



### Impact of Timing Delays in MAJX



### Impact of Timing Delays in Multi-RowCopy



### Impact of Temperature in Multi-RowCopy



Increasing temperature up to 90°C has a very small effect on the success rate



# Impact of Voltage in Multi-RowCopy



Reducing the wordline voltage only slightly affects the success rate



## **Majority-based Computation**



New MAJX operations provide 121.61% (46.54%) higher performance over using only MAJ3 in Mfr. M (Mfr. H) on average.

### **Cold Boot Attack Prevention**



Multi-RowCopy-based content destruction outperforms both RowClone-based and Frac-based content destruction by up to 20.87× and 7.55×, respectively.

# **Frac Operation**



## **Input Replication in Real Chips**



### **DRAM Chips Tested: Extended Table**

| Module    | Chip     | Module Identifier          | #Modules | Freq   | Mfr. Date | Chip | Die  | Chip       | Subarray   |
|-----------|----------|----------------------------|----------|--------|-----------|------|------|------------|------------|
| Vendor    | Vendor   | Chip Identifier            | (#Chips) | (MT/s) | ww-yy     | Den. | Rev. | Org.       | Size       |
| TimeTec   | SK Hynix | TLRD44G2666HC18F-SBK [240] | 7 (56)   | 2666   | Unknown   | 4Gb  | М    | $\times 8$ | 512 or 640 |
|           |          | H5AN4G8NMFR-TFC [241]      |          |        |           |      |      |            |            |
| TeamGroup | SK Hynix | 76TT21NUS1R8-4G [242]      | 5 (40)   | 2133   | Unknown   | 4Gb  | М    | $\times 8$ | 512        |
|           |          | H5AN4G8NAFR-TFC [243]      |          |        |           |      |      |            |            |
| Micron    | Micron   | MTA4ATF1G64HZ-3G2E1 [244]  | 4 (16)   | 3200   | 46-20     | 16Gb | Е    | ×16        | 1024       |
|           |          | MT40A1G16KD-062E:E [245]   |          |        |           |      |      |            |            |
| Micron    | Micron   | MTA4ATF1G64HZ-3G2B2 [246]  | 2 (8)    | 2666   | 26-21     | 16Gb | В    | ×16        | 1024       |
|           |          | MT40A1G16RC-062E:B [247]   |          |        |           |      |      |            |            |

## **Row Decoder Circuitry**



### **Effect of Input Replication on the Bitline Deviation**



### Limitations of Tested COTS DRAM Chips (I)

#### Some COTS DRAM chips do not support all in-DRAM operations

- We do not observe simultaneous many-row activation in tested 64 Samsung chips
- <u>Hypothesis</u>
  - Internal DRAM circuitry ignores the PRE command or the second ACT command when the timing parameters are greatly violated

If such a limitation were not imposed, we believe these DRAM chips are also fundamentally capable of performing the operations we examine in this work



### Limitations of Tested COTS DRAM Chips (II)

- Tested COTS DRAM chips support only consecutive two row activation and simultaneous activation of 2, 4, 8, 16, and 32 rows
  - <u>Hypothesis</u>
    - This is due to our current infrastructure limitations, where we can issue DRAM commands at intervals of only 1.5ns.
    - Having fine-grained control on timing would allow us to deassert/assert desired intermediate signals in the row decoder circuitry

### Limitations of Tested COTS DRAM Chips (III)

- Performing in-DRAM operations potentially have an effect on transient errors in DRAM chips
  - We perform each test (a single data point in the distribution) 10K times
  - We do not observe any errors in rows outside of the simultaneously activated row group

We believe that investigating all potential effects (e.g., on transient errors) requires a much more extensive exploration of various aspects



## **Open Research Questions**

| 1 | Is it possible to <b>robustly</b> activate<br><b>more than four</b> DRAM rows simultaneously? |
|---|-----------------------------------------------------------------------------------------------|
|   |                                                                                               |
| 2 | What other PUD operations can be realized<br>in COTS DRAM chips?                              |
|   |                                                                                               |
| 3 | How <b>robustly</b> can PUD operations be performed<br>in COTS DRAM chips?                    |
|   |                                                                                               |
| 4 | Can the robustness of PUD operations be improved?                                             |
|   |                                                                                               |
| 5 | What are <b>the effects of operating conditions</b><br>on the robustness of PUD operations?   |
|   |                                                                                               |

# **DRAM Cell Operation**



# **DRAM Cell Operation - ACTIVATE**



# **DRAM Cell Operation - PRECHARGE**

