# HPCA 2023 Tutorial Real-world Processing-in-Memory Architectures # Processing-Near-Memory Real PNM Architectures Programming General-purpose PIM Dr. Juan Gómez Luna Professor Onur Mutlu #### Two PIM Approaches #### 5.2. Two Approaches: Processing Using Memory (PUM) vs. Processing Near Memory (PNM) Many recent works take advantage of the memory technology innovations that we discuss in Section 5.1 to enable and implement PIM. We find that these works generally take one of two approaches, which are categorized in Table 1: (1) processing using memory or (2) processing near memory. We briefly describe each approach here. Sections 6 and 7 will provide example approaches and more detail for both. Table 1: Summary of enabling technologies for the two approaches to PIM used by recent works. Adapted from [341] and extended. | Approach | <b>Example Enabling Technologies</b> | | | |-------------------------|-----------------------------------------|--|--| | | SRAM | | | | | DRAM | | | | Processing Using Memory | Phase-change memory (PCM) | | | | | Magnetic RAM (MRAM) | | | | | Resistive RAM (RRAM)/memristors | | | | | Logic layers in 3D-stacked memory | | | | | Silicon interposers | | | | Processing Near Memory | Logic in memory controllers | | | | | Logic in memory chips (e.g., near bank) | | | | | Logic in memory modules | | | | | Logic near caches | | | | | Logic near/in storage devices | | | Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" Invited Book Chapter in Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, Springer, to be published in 2021. [Tutorial Video on "Memory-Centric Computing Systems" (1 hour 51 minutes)] #### **PIM Becomes Real** - UPMEM, founded in January 2015, announces the first real-world PIM architecture in 2016 - UPMEM's PIM-enabled DIMMs start getting commercialized in 2019 - In early 2021, Samsung announces FIMDRAM at ISSCC conference - Samsung's LP-DDR5 and DIMM-based PIM announced a few months later - In early 2022, SK Hynix announces AiM and Alibaba announces HB-PNM at ISSCC conference #### Startup plans to embed processors in DRAM October 13, 2016 // By Peter Clarke Fabless chip company Upmem SAS (Grenoble, France), founded in January 2015, is developing a microprocessor for use in data-intensive applications in the datacenter that will sit embedded in DRAM to be close to the data. Placing hundreds or thousands of processing elements in DRAM able to perform work for a controlling server #### **UPMEM PIM** #### UPMEM Processing-in-DRAM Engine (2019) - Processing in DRAM Engine - Includes **standard DIMM modules**, with a **large** number of DPU processors combined with DRAM chips. - Replaces **standard** DIMMs #### **UPMEM DIMMS** - E19: 8 chips/DIMM (1 rank). DPUs @ 267 MHz - P21: 16 chips/DIMM (2 ranks). DPUs @ 350 MHz #### 2,560-DPU Processing-in-Memory System Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONLIR MUTIL IL ETH Zürich, Switzerland Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing—in-memory (PM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPIM), integrated in the same chip. This paper provides the first comprehensive analysis of the first publicly-available real-world PM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PthM (Processing, In-Memory benchmarks), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PTM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 40 and 2550 DPUs provides new insights about suitability of different workloads to the PIM systems with 40 and 2550 DPUs provides new insights about suitability of different workloads to the PIM systems with 40 flaggers of niture PIM systems. #### **Understanding a Modern PIM Architecture** # Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup> Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch). https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU-SAFARI/prim-benchmarks <sup>&</sup>lt;sup>1</sup>ETH Zürich <sup>&</sup>lt;sup>2</sup>American University of Beirut <sup>&</sup>lt;sup>3</sup>University of Malaga <sup>&</sup>lt;sup>4</sup>National Technical University of Athens #### **UPMEM Patent** | United States Patent Devaux et al. | | | (10) Patent No.: US 10,324,870 B2 (45) Date of Patent: Jun. 18, 2019 | | | |------------------------------------|-------------------|--------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--------------------------------------------------------------------------------------|--| | (54) | MEMORY<br>PROCESS | Y CIRCUIT WITH INTEGRATED<br>SOR | (56) | | | | (71) | A1:t. | UDMEM Comple (FD) | | U.S. PATENT DOCUMENTS | | | (71) | Applicant: | UPMEM, Grenoble (FR) | | 5,666,485 A * 9/1997 Suresh G06F 13/1605 | | | (72) | Inventors: | Fabrice Devaux, La Conversion (CH);<br>Jean-François Roy, Grenoble (FR) | | 710/113<br>6,463,001 B1 10/2002 Williams<br>7,349,277 B2* 3/2008 Kinsley G11C 11/406 | | | (73) | Assignee: | UPMEM, Grenoble (FR) | | 8,438,358 B1 * 5/2013 Kraipak G11C 7/04<br>711/167 | | | (*) | Notice: | Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days. | | (Continued) FOREIGN PATENT DOCUMENTS | | | (21) | Appl. No.: | 15/551,418 | EP<br>JP | 0780768 A1 6/1997<br>H03109661 A 5/1991 | | | (22) | PCT Filed | Feb. 12, 2016 | WO | | | #### (57) ABSTRACT A memory circuit having: a memory array including one or more memory banks; a first processor; and a processor control interface for receiving data processing commands directed to the first processor from a central processor, the processor control interface being adapted to indicate to the central processor when the first processor has finished accessing one or more of the memory banks of the memory array, these memory banks becoming accessible to the central processor. # **UPMEM PIM System Organization (I)** FIG. 1 schematically illustrates a computing system comprising DRAM circuits having integrated processors according to an example embodiment # **UPMEM PIM System Organization (II)** In a UPMEM-based PIM system UPMEM DIMMs coexist with regular DDR4 DIMMs # **UPMEM PIM System Organization (III)** - A UPMEM DIMM contains 8 or 16 chips - Thus, 1 or 2 ranks of 8 chips each - Inside each PIM chip there are: - 8 64MB banks per chip: Main RAM (MRAM) banks - 8 DRAM Processing Units (DPUs) in each chip, 64 DPUs per rank # **DRAM Processing Unit (I)** FIG. 4 schematically illustrates part of the computing system of FIG. 1 in more detail according to an example embodiment Fig 4 # **DRAM Processing Unit (II)** PIM Chip #### **DPU Pipeline** - In-order pipeline - Up to 425 MHz - Fine-grain multithreaded - 24 hardware threads - 14 pipeline stages - DISPATCH: Thread selection - FETCH: Instruction fetch - READOP: Register file - FORMAT: Operand formatting - ALU: Operation and WRAM - MERGE: Result formatting # Fine-grained Multithreading # Fine-Grained Multithreading (I) - Idea: Hardware has multiple thread contexts (PC+registers). Each cycle, fetch engine fetches from a different thread - By the time the fetched branch/instruction resolves, no instruction is fetched from the same thread - Branch/instruction resolution latency overlapped with execution of other threads' instructions - + No logic needed for handling control and data dependences within a thread - -- Single thread performance suffers - -- Extra logic for keeping thread contexts - Does not overlap latency if not enough threads to cover the whole pipeline # Fine-Grained Multithreading (II) - Idea: Switch to another thread every cycle such that no two instructions from a thread are in the pipeline concurrently - Tolerates the control and data dependence latencies by overlapping the latency with useful work from other threads - Improves pipeline utilization by taking advantage of multiple threads - Thornton, "Parallel Operation in the Control Data 6600," AFIPS 1964 - Smith, "A pipelined, shared resource MIMD computer," ICPP 1978 #### Lecture on Fine-Grained Multithreading #### **DPU Pipeline** - In-order pipeline - Up to 425 MHz - Fine-grain multithreaded - 24 hardware threads - 14 pipeline stages - DISPATCH: Thread selection - FETCH: Instruction fetch - READOP: Register file - FORMAT: Operand formatting - ALU: Operation and WRAM - MERGE: Result formatting #### **DPU Instruction Set Architecture** - Specific 32-bit ISA - Aiming at scalar, inorder, and multithreaded implementation - Allowing compilation of 64-bit C code - LLVM/Clang compiler https://sdk.upmem.com/2021.2.0/201\_IS.html# #### Microbenchmark for INT32 ADD Throughput ``` #define SIZE 256 int* bufferA = mem alloc(SIZE * sizeof(int)); C-based code for(int i = 0; i < SIZE; i++){</pre> int temp = bufferA[i]; 5 temp += scalar; bufferA[i] = temp; } move r2, 0 Poop of the second seco // Loop header lsl add r3, r0, r2, 2 // Address calculation // Load from WRAM // Add Store to WRAM // Index update jneq r2, 256, .LBB0 1 // Conditional jump ``` #### More on the UPMEM PIM Architecture #### **Understanding a Modern PIM Architecture** # Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup> <sup>1</sup>ETH Zürich Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch). https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU-SAFARI/prim-benchmarks <sup>&</sup>lt;sup>2</sup>American University of Beirut <sup>&</sup>lt;sup>3</sup>University of Malaga <sup>&</sup>lt;sup>4</sup>National Technical University of Athens The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched Operational Intensity (OP/B) #### KEY TAKEAWAY 1 The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable workloads are memory-bound. ### **CPU/GPU: Performance Comparison** #### KEY OBSERVATION The UPMEM-based PIM system can outperform a state-of-the-art GPU on workloads with three key characteristics: - Streaming memory accesses - No or little inter-DPU synchronization - No or little use of integer multiplication, integer division, or floating point operations These three key characteristics make a workload potentially suitable to the UPMEM PIM architecture. #### **KEY TAKEAWAY 2** The most well-suited workloads for the UPMEM PIM architecture use no arithmetic operations or use only simple operations (e.g., bitwise operations and integer addition/subtraction). #### KEY TAKEAWAY 3 The most well-suited workloads for the UPMEM PIM architecture require little or no communication across DPUs (inter-DPU communication). #### KEY TAKEAWAY 4 - UPMEM-based PIM systems **outperform state-of-the-art CPUs in terms of performance** (by 23.2× on 2,556 DPUs for 16 PrIM benchmarks) **and energy efficiency on most of PrIM benchmarks**. - UPMEM-based PIM systems **outperform state-of-the-art GPUs on a majority of PrIM benchmarks** (by 2.54× on 2,556 DPUs for 10 PrIM benchmarks), and the outlook is even more positive for future PIM systems. - UPMEM-based PIM systems are more energy-efficient than stateof-the-art CPUs and GPUs on workloads that they provide performance improvements over the CPUs and the GPUs. #### **PrIM Repository** - All microbenchmarks, benchmarks, and scripts - https://github.com/CMU-SAFARI/prim-benchmarks # Samsung FIMDRAM (aka HBM-PIM) #### Samsung Function-in-Memory DRAM (2021) Samsung Newsroom CORPORATE **PRODUCTS** **PRESS RESOURCES** VIEWS **ABOUT US** #### Samsung Develops Industry's First High Bandwidth Memory with Al Processing Power Korea on February 17, 2021 Audi Share ( The new architecture will deliver over twice the system performance and reduce energy consumption by more than 70% Samsung Electronics, the world leader in advanced memory technology, today announced that it has developed the industry's first High Bandwidth Memory (HBM) integrated with artificial intelligence (AI) processing power — the HBM-PIM The new processing-in-memory (PIM) architecture brings powerful AI computing capabilities inside high-performance memory, to accelerate large-scale processing in data centers, high performance computing (HPC) systems and AI-enabled mobile applications. Kwangil Park, senior vice president of Memory Product Planning at Samsung Electronics stated, "Our groundbreaking HBM-PIM is the industry's first programmable PIM solution tailored for diverse Al-driven workloads such as HPC, training and inference. We plan to build upon this breakthrough by further collaborating with Al solution providers for even more advanced PIM-powered applications." #### Samsung Function-in-Memory DRAM (2021) #### FIMDRAM based on HBM2 [3D Chip Structure of HBM with FIMDRAM] 128DQ / 8CH / 16 banks / BL4 32 PCU blocks (1 FIM block/2 banks) 1.2 TFLOPS (4H) FP16 ADD / Multiply (MUL) / Multiply-Accumulate (MAC) / Multiply-and- Add (MAD) #### ISSCC 2021 / SESSION 25 / DRAM / 25.4 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications Young-Cheon Kwon<sup>1</sup>, Suk Han Lee<sup>1</sup>, Jaehoon Lee<sup>1</sup>, Sang-Hyuk Kwon<sup>1</sup>, Je Min Ryu1, Jong-Pil Son1, Seongil O1, Hak-Soo Yu1, Haesuk Lee1, Soo Young Kim<sup>1</sup>, Youngmin Cho<sup>1</sup>, Jin Guk Kim<sup>1</sup>, Jongyoon Choi<sup>1</sup>, Hyun-Sung Shin1, Jin Kim1, BengSeng Phuah1, HyoungMin Kim1, Myeong Jun Song<sup>1</sup>, Ahn Choi<sup>1</sup>, Daeho Kim<sup>1</sup>, SooYoung Kim<sup>1</sup>, Eun-Bong Kim<sup>1</sup>, David Wang<sup>2</sup>, Shinhaeng Kang<sup>1</sup>, Yuhwan Ro<sup>3</sup>, Seungwoo Seo<sup>3</sup>, JoonHo Song<sup>3</sup>, Jaeyoun Youn1, Kyomin Sohn1, Nam Sung Kim1 <sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA 3Samsung Electronics, Suwon, Korea #### Samsung Function-in-Memory DRAM (2021) #### **Chip Implementation** - Mixed design methodology to implement FIMDRAM - Full-custom + Digital RTL [Digital RTL design for PCU block] #### ISSCC 2021 / SESSION 25 / DRAM / 25.4 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications Young-Cheon Kwon', Suk Han Let', Jaehoon Let', Sang-Hyuk Kwon', Je Min Fyu', Johg-Fil Son', Sengli O', Hak Soo Yi, Hassuk Let', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Cho', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Hyeng Jun Song', Ahn Chol', Jaebok Kim', Soo Young Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Man Sung Kim' Samsung Electronics, Hwaseong, Korea Samsung Electronics, San Jose, CA Samsung Electronics, Suwon, Korea #### FIMDRAM: System Organization - PIM units respond to standard DRAM column commands (RD or WR) - Compliant with unmodified JEDEC controllers - They execute one wide-SIMD operation commanded by a PIM instruction with deterministic latency in a lock-step manner - A PIM unit can get 16 16-bit operands from IOSAs, a register, and/or the result bus #### Lecture on FIMDRAM/HBM-PIM Processing-in-Memory Course: Lecture 4: Real-world PIM: Samsung HBM-PIM Architecture - Spring 2022 # Samsung AxDIMM # Samsung AxDIMM (2021) - DIMM-based PIM - DLRM recommendation system ### **AxDIMM System** ## **AxDIMM Design: Hardware Architecture** DDR4 slave PHY receives DRAM commands and NMP instructions (via DQ pins) from the host side ## **AxDIMM Design: Execution Flow** ## Lecture on AxDIMM # **SK Hynix AiM** # SK Hynix Accelerator-in-Memory (2022) **SK**hynix NEWSROOM ⊕ ENG ∨ INSIGHT **SK hynix STORY** PRESS CENTER **MULTIMEDIA** Search Q #### SK hynix Develops PIM, Next-Generation Al Accelerator February 16, 2022 SK hynix (or "the Company", <u>www.skhynix.com</u>) announced on February 16 that it has developed PIM\*, a next-generation memory chip with computing capabilities. \*PIM(Processing In Memory): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory It has been generally accepted that memory chips store data and CPU or GPU, like human brain, process data. SK hynix, following its challenge to such notion and efforts to pursue innovation in the next-generation smart memory, has found a breakthrough solution with the development of the latest technology. SK hynix plans to showcase its PIM development at the world's most prestigious semiconductor conference, 2022 ISSCC\*, in San Francisco at the end of this month. The company expects continued efforts for innovation of this technology to bring the memory-centric computing, in which semiconductor memory plays a central role, a step closer to the reality in devices such as smartphones. \*ISSCC: The International Solid-State Circuits Conference will be held virtually from Feb. 20 to Feb. 24 this year with a theme of "Intelligent Silicon for a Sustainable World" For the first product that adopts the PIM technology, SK hynix has developed a sample of GDDR6-AiM (Accelerator\* in memory). The GDDR6-AiM adds computational functions to GDDR6\* memory chips, which process data at 16Gbps. A combination of GDDR6-AiM with CPU or GPU instead of a typical DRAM makes certain computation speed 16 times faster. GDDR6-AiM is widely expected to be adopted for machine learning, high-performance computing, and big data computation and storage. 11.1 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications Seongju Lee, SK hynix, Icheon, Korea In Paper 11.1, SK Hynix describes an 1ynm, GDDR6-based accelerator-in-memory with a command set for deep-learning operation. The 8Gb design achieves a peak throughput of 1TFLOPS with 1GHz MAC operations and supports major activation functions to improve accuracy. # SK Hynix Accelerator-in-Memory (2022) • 4 Gb AiM die with 16 processing units (PUs) #### AiM Die Photograph #### 1 Process Unit (PU) Area | Total | 0.19mm <sup>2</sup> | |--------------------------|---------------------| | MAC | 0.11mm <sup>2</sup> | | Activation Function (AF) | 0.02mm <sup>2</sup> | | Reservoir Cap. | 0.05mm² | | Etc. | 0.01mm <sup>2</sup> | ## SK Hynix AiM: System Organization (2022) GDDR6-based AiM architecture ## Lecture on Accelerator-in-Memory # Alibaba HB-PNM ## Alibaba HB-PNM: Overall Architecture (2022) 3D-stacked logic die and DRAM die vertically bonded by hybrid bonding (HB) # Alibaba HB-PNM: Compute Engines Match engine and neural engine for matching and ranking in a recommendation system ## Lecture on HB-PNM # **More Real PIM** ## NeuroBlade HOME BLOCK FILE OBJECT DISK TAPE FLASH NVME SC Home > AI/ML > NeuroBladers build a processing-in-memory analytics chip and server ## NeuroBladers build a processing-inmemory analytics chip and server By Chris Mellor - October 6, 2021 An Israeli startup called NeuroBlade has exited stealth mode, built a processing-in-memor (PIM) analytics chip combining DRAM and thousands of cores, put four of them in an analytics accelerating server appliance box, and taken in \$83 million in B-round funding. The idea is to take a GPU approach to big data-style analytics and AI software by employing a massively parallel core design, but take it further by layering the cores on DRAM with a wide I/O bus architecture design linking the cores and memory to speed processing even more. This design vastly reduces data movement between storage and memory and also accelerates data transfer between memory and processing cores. ## NeuroBlade Patent (I) | , , | (12) United States Patent<br>Sity et al. | | (10) Patent No.: US 10,762,034 B2<br>(45) Date of Patent: Sep. 1, 2020 | | |------|------------------------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|--| | (54) | | Y-BASED DISTRIBUTED<br>SOR ARCHITECTURE | (56) References Cited U.S. PATENT DOCUMENTS | | | (71) | Applicant: | NeuroBlade, Ltd., Hod-Hashron (IL) | 4,837,747 A * 6/1989 Dosaka G11C 8/12 | | | (72) | Inventors: | <b>Elad Sity</b> , Kfar Saba (IL); <b>Eliad Hillel</b> , Kfar Saba (IL) | 5,155,729 A 10/1992 Rysko et al.<br>(Continued) | | | (73) | Assignee: | $NeuroBlade,\ Ltd.,\ {\it Hod-Hashron}\ ({\it IL})$ | FOREIGN PATENT DOCUMENTS | | | (*) | Notice: | Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days. | CA 2 149 479 C 5/2001 OTHER PUBLICATIONS | | | (21) | Appl. No. | 16/512,590 | Ahn et al., "A Scalable Processing-in-Memory Accelerator for | | | (22) | Filed: | Jul. 16, 2019 | Parallel Graph Processing," ISCA '15 (Jun. 13-17, 2015), pp. 105-117. | | #### (57) ABSTRACT Distributed processors and methods for compiling code for execution by distributed processors are disclosed. In one implementation, a distributed processor may include a substrate; a memory array disposed on the substrate; and a processing array disposed on the substrate. The memory array may include a plurality of discrete memory banks, and the processing array may include a plurality of processor subunits, each one of the processor subunits being associated with a corresponding, dedicated one of the plurality of discrete memory banks. The distributed processor may further include a first plurality of buses, each connecting one of the plurality of processor subunits to its corresponding, dedicated memory bank, and a second plurality of buses, each connecting one of the plurality of processor subunits to another of the plurality of processor subunits. ## NeuroBlade Patent (II) ## NeuroBlade: Xiphos - PIM XRAM chip - IMPU (Intensive Memory Processing Unit) - x86 CPU, 32 NVMe SSDs - PCIe fabric: "Everything is connected on top of PCIe fabric." - Wide I/O bus: multiple x16 PCIe buses Xiphos appliance. ## Variety of Current Real PIM Architectures ### Differences - Near-bank (UPMEM, FIMDRAM, AiM, HB-PNM) vs. near-chip (AxDIMM) - General-purpose (UPMEM) vs. special-function (FIMDRAM, AiM, HB-PNM) - FGMT (UPMEM) vs. SIMD (FIMDRAM, AiM, AxDIMM) vs. systolic array (HB-PNM) - Natively integer (UPMEM, HB-PNM) vs. floating point (FIMDRAM) - FP16 (FIMDRAM) vs. BF16 (AiM) vs. FP32 (AxDIMM) - DDR4 (UPMEM, AxDIMM) vs. LPDDR4 (HB-PNM) vs. HBM2 (FIMDRAM) vs. GDDR6 (AiM) ## **Common Characteristics** - These PIM systems have some common characteristics: - There is a host processor (CPU or GPU) with access to (1) standard main memory, and (2) PIM-enabled memory - PIM-enabled memory contains multiple PIM processing elements (PEs) with high bandwidth and low latency memory access - PIM PEs run only at a few hundred MHz and have a small number of registers and small (or no) cache/scratchpad - 4. PEs may need to communicate via the host processor # A State-of-the-Art PIM (PNM) System - These PIM systems have some common characteristics: - There is a host processor (CPU or GPU) with access to (1) standard main memory, and (2) PIM-enabled memory - 2. PIM-enabled memory contains multiple PIM processing elements (PEs) with high bandwidth and low latency memory access - 3. PIM PEs run only at a few hundred MHz and have a small number of registers and small (or no) cache/scratchpad - 4. PEs may need to communicate via the host processor # Programming a General-purpose PIM System ## **Accelerator Model (I)** UPMEM DIMMs coexist with conventional DIMMs Integration of UPMEM DIMMs in a system follows an accelerator model - UPMEM DIMMs can be seen as a loosely coupled accelerator - Explicit data movement between the main processor (host CPU) and the accelerator (UPMEM) - Explicit kernel launch onto the UPMEM processors - This resembles GPU computing # **GPU Computing** - Computation is offloaded to the GPU - Three steps - CPU-GPU data transfer (1) - GPU kernel execution (2) - GPU-CPU data transfer (3) https://www.youtube.com/watch?v=y40-tY5WJ8A https://safari.ethz.ch/digitaltechnik/spring2018/lib/exe/fetch.php?media=digitaldesign-2018-lecture22-gpuprogramming-afterlecture.pdf ## **Accelerator Model (II)** FIG. 6 is a flow diagram representing operations in a method of delegating a processing task to a DRAM processor according to an example embodiment # **System Organization** FIG. 1 schematically illustrates a computing system comprising DRAM circuits having integrated processors according to an example embodiment # First Programming Example: Vector Addition ## Observations, Recommendations, Takeaways #### GENERAL PROGRAMMING RECOMMENDATIONS - 1. Execute on the *DRAM Processing Units* (*DPUs*) **portions of parallel code** that are as long as possible. - 2. Split the workload into **independent data blocks**, which the DPUs operate on independently. - 3. Use **as many working DPUs** in the system as possible. - 4. Launch at least **11** *tasklets* (i.e., software threads) per DPU. #### PROGRAMMING RECOMMENDATION 1 For data movement between the DPU's MRAM bank and the WRAM, use large DMA transfer sizes when all the accessed data is going to be used. #### **KEY OBSERVATION 7** Larger CPU-DPU and DPU-CPU transfers between the host main memory and the DRAM Processing Unit's Main memory (MRAM) banks result in higher sustained bandwidth. #### KEY TAKEAWAY 1 The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable work- loads are memory-bound. ## **Vector Addition (VA)** - Our first programming example - We partition the input arrays across: - DPUs - Tasklets, i.e., software threads running on a DPU ## **UPMEM SDK Documentation** / User Manual #### **User Manual** #### **Getting started** - The UPMEM DPU toolchain - Notes before starting - The toolchain purpose - dpu-upmem-dpurte-clang - Limitations - The DPU Runtime Library - The Host Library - o dpu-lldb - Installing the UPMEM DPU toolchain - Dependencies - Python - Installation packages - Installation from tar.gz binary archive - Functional simulator - Hello World! Example - Purpose - Writing and building the program ## **General Programming Recommendations** From UPMEM programming guide\*, presentations\*, and white papers<sup>☆</sup> #### GENERAL PROGRAMMING RECOMMENDATIONS - 1. Execute on the *DRAM Processing Units* (*DPUs*) **portions of parallel code** that are as long as possible. - 2. Split the workload into **independent data blocks**, which the DPUs operate on independently. - 3. Use **as many working DPUs** in the system as possible. - 4. Launch at least **11** *tasklets* (i.e., software threads) per DPU. <sup>\*</sup> https://sdk.upmem.com/2021.1.1/index.html <sup>\*</sup> F. Devaux. "The true Processing In Memory accelerator," HotChips 2019. doi: 10.1109/HOTCHIPS.2019.8875680 ## **DPU Allocation** - dpu alloc() allocates a number of DPUs - Creates a dpu set ``` struct dpu_set_t dpu_set, dpu; uint32_t nr_of_dpus; // Allocate DPUs DPU_ASSERT(dpu_alloc(NR_DPUS, NULL, &dpu_set)); DPU_ASSERT(dpu_get_nr_dpus(dpu_set, &nr_of_dpus)); printf("Allocated %d DPU(s)\n", nr_of_dpus); ``` Can we allocate different DPU sets over the course of a program? Yes, we can. We show an example next We deallocate a DPU set with dpu free() ## DPU Allocation: Needleman-Wunsch (NW) In NW we change the number of DPUs in the DPU set as computation progresses ``` // Top-left computation on DPUs for (unsigned int blk = 1; blk <= (max_cols-1)/BL; blk++) {</pre> // If nr_of_blocks are lower than max dpus, // set nr_of_dpus to be equal with nr_of_blocks unsigned nr_of_blocks = blk; if (nr_of_blocks < max_dpus) {</pre> DPU_ASSERT(dpu_free(dpu_set)); DPU_ASSERT(dpu_alloc(nr_of_blocks, NULL, &dpu_set)); DPU_ASSERT(dpu_load(dpu_set, DPU_BINARY, NULL)); DPU_ASSERT(dpu_get_nr_dpus(dpu_set, &nr_of_dpus)); } else if (nr of dpus == max dpus) { } else { DPU ASSERT(dpu free(dpu set)); DPU_ASSERT(dpu_alloc(max_dpus, NULL, &dpu_set)); DPU_ASSERT(dpu_load(dpu_set, DPU_BINARY, NULL)); DPU ASSERT(dpu get_nr dpus(dpu set, &nr_of dpus)); ``` ## **Load DPU Binary** dpu\_load() loads a program in all DPUs of a dpu\_set ``` // Define the DPU Binary path as DPU_BINARY here #ifndef DPU_BINARY #define DPU_BINARY "./bin/dpu_code" #endif // Load binary DPU_ASSERT(dpu_load(dpu_set, DPU_BINARY, NULL)); ``` Is it possible to launch different kernels onto different DPUs? Yes, it is possible. This enables: - Workloads with task-level parallelism - Different programs using different DPU sets ## CPU-DPU/DPU-CPU Data Transfers - CPU-DPU and DPU-CPU transfers - Between host CPU's main memory and DPUs' MRAM banks - Serial CPU-DPU/DPU-CPU transfers: - A single DPU (i.e., 1 MRAM bank) - Parallel CPU-DPU/DPU-CPU transfers: - Multiple DPUs (i.e., many MRAM banks) - Broadcast CPU-DPU transfers: - Multiple DPUs with a single buffer ## **Serial Transfers** - dpu\_copy\_to(); - dpu\_copy\_from(); - We transfer (part of) a buffer to/from each DPU in the dpu\_set - DPU\_MRAM\_HEAP\_POINTER\_NAME: Start of the MRAM range that can be freely accessed by applications - We do not allocate MRAM explicitly ``` DPU_FOREACH (dpu_set, dpu) { DPU_ASSERT(dpu_copy_to(dpu, DPU_MRAM_HEAP_POINTER_NAME on the property of ``` ## **Parallel Transfers** - We push different buffers to/from a DPU set in one transfer - All buffers need to be of the same size - First, prepare (dpu\_prepare\_xfer);then, push (dpu\_push\_xfer) - Direction: - DPU XFER TO DPU - DPU\_XFER\_FROM\_DPU ``` DPU_FOREACH(dpu_set, dpu, i) { DPU_ASSERT(dpu_prepare_xfer(dpu, bufferA + input_size_dpu_8bytes * i)) DPU_ASSERT(dpu_push_xfer(dpu_set, DPU_XFER_TO_DPU DPU_MRAM_HEAP_POINTER_NAME, 0, input_size_dpu_8bytes * sizeof(T) DPU_XFER_DEFAULT)); DPU_FOREACH(dpu_set, dpu, i) { DPU_ASSERT(dpu_prepare_xfer(dpu, bufferB + input_size_dpu_8bytes * i)) DPU_ASSERT(dpu_prepare_xfer(dpu, bufferB + input_size_dpu_8bytes * i)) DPU_ASSERT(dpu_push_xfer(dpu_set, DPU_XFER_TO_DPU DPU_MRAM_HEAP_POINTER_NAME, input_size_dpu_8bytes * sizeof(T) input_size_dpu_8bytes * sizeof(T) DPU_XFER_DEFAULT)); DPU_XFER_DEFAULT)); DPU_XFER_DEFAULT)); ``` ## **Broadcast Transfers** - dpu\_broadcast\_to();Only CPU to DPU - We transfer the same buffer to all DPUs in the dpu\_set ``` DPU_ASSERT(dpu_broadcast_to(dpu_set, DPU_MRAM_HEAP_POINTER_NAME, 0, bufferA, input_size_dpu * sizeof(T) DPU_XFER_DEFAULT)); Pointer to main memory Transfer size ``` ## Different Types of Transfers in a Program - An example benchmark that uses both parallel and serial transfers - Select (SEL) - Remove even values #### Inter-DPU Communication There is no direct communication channel between DPUs - Inter-DPU communication takes place via the host CPU using CPU-DPU and DPU-CPU transfers - Example communication patterns: - Merging of partial results to obtain the final result - Only DPU-CPU transfers - Redistribution of intermediate results for further computation - DPU-CPU transfers and CPU-DPU transfers ## How Fast are these Data Transfers? - With a microbenchmark, we obtain the sustained bandwidth of all types of CPU-DPU and DPU-CPU transfers - Two experiments: - 1 DPU: variable CPU-DPU and DPU-CPU transfer size (8 bytes to 32 MB) - 1 rank: 32 MB CPU-DPU and DPU-CPU transfers to/from a set of 1 to 64 MRAM banks within the same rank - Preliminary experiments with more than one rank - Channel-level parallelism DDR4 bandwidth bounds the maximum transfer bandwidth The cost of the transfers can be amortized, if enough computation is run on the DPUs ## CPU-DPU/DPU-CPU Transfers: 1 DPU Data transfer size varies between 8 bytes and 32 MB #### KEY OBSERVATION 7 **Larger CPU-DPU and DPU-CPU transfers** between the host main memory and the DRAM Processing Unit's Main memory (MRAM) banks **result in higher sustained bandwidth**. # CPU-DPU/DPU-CPU Transfers: 1 Rank (I) - CPU-DPU (serial/parallel/broadcast) and DPU-CPU (serial/parallel) - The number of DPUs varies between 1 and 64 #### **KEY OBSERVATION 8** The **sustained bandwidth of parallel CPU-DPU and DPU-CPU transfers** between the host main memory and the DRAM Processing Unit's Main memory (MRAM) banks **increases with the number of DRAM Processing Units inside a rank**. # CPU-DPU/DPU-CPU Transfers: 1 Rank (II) - CPU-DPU (serial/parallel/broadcast) and DPU-CPU (serial/parallel) - The number of DPUs varies between 1 and 64 #### **KEY OBSERVATION 9** The sustained bandwidth of parallel CPU-DPU transfers is higher than the sustained bandwidth of parallel DPU-CPU transfers due to different implementations of CPU-DPU and DPU-CPU transfers in the UPMEM runtime library. The sustained bandwidth of broadcast CPU-DPU transfers (i.e., the same buffer is copied to multiple MRAM banks) is higher than that of parallel CPU-DPU transfers (i.e., different buffers are copied to different MRAM banks) due to higher temporal locality in the CPU cache hierarchy. # "Transposing" Library ## The library feeds DPUs with correct data Copyright UPMEM® 2019 **HOT CHIPS 31** ed licensed use limited to: ETH BIBLIOTHEK ZURICH. Downloaded on September 04,2020 at 13:55:41 UTC from IEEE Xplore. Restrictions 82 ## **Microbenchmark: CPU-DPU** CPU-DPU (serial/parallel/broadcast) and DPU-CPU (serial/parallel) ## **DPU Kernel Launch** - dpu\_launch() launches a kernel on a dpu\_set - DPU\_SYNCHRONOUS suspends the application until the kernel finishes - DPU\_ASYNCHRONOUS returns the control to the application - dpu\_sync or dpu\_status to check kernel completion ``` printf("Run program on DPU(s) \n"); // Run DPU kernel DPU_ASSERT(dpu_launch(dpu_set, DPU_SYNCHRONOUS)); ``` What does the asynchronous execution enable? #### Some ideas: - Task-level parallelism: concurrent execution of different kernels on different DPU sets - Concurrent heterogeneous computation on CPU and DPUs ### How to Pass Parameters to the Kernel? - We can use serial and parallel transfers - We pass them directly to the scratchpad memory of the DPU - Working RAM (WRAM): 64KB per DPU - This is useful for input parameters and some results ## **Recall: Vector Addition (VA)** - Our first programming example - We partition the input arrays across: - DPUs - Tasklets, i.e., software threads running on a DPU # Programming a DPU Kernel (I) Vector addition ``` Tasklet ID int main_kernel1() { Size of vector tile processed by a DPU unsigned int tasklet id = me() uint32 t input size dpu bytes = DPU INPUT ARGUMENTS.size; // Input size per DPU in bytes uint32_t input_size_dpu_bytes_transfer = DPU_INPUT_ARGUMENTS.transfer_size; // Transfer input size per DPU in bytes // Address of the current processing block in MRAM MRAM addresses of arrays A and B uint32 t base tasklet = tasklet id << BLOCK SIZE LOG2;</pre> uint32_t mram_base_addr_A = (uint32_t)DPU_MRAM_HEAP_POINTER; uint32_t mram_base_addr_B = (uint32_t)(DPU_MRAM_HEAP_POINTER + input_size_dpu_bytes_transfer); // Initialize a local cache to store the MRAM block T * cache A = (T *) mem alloc(BLOCK SIZE); WRAM allocation T *cache B = (T *) mem alloc(BLOCK SIZE); for(unsigned int byte_index = base_tasklet; byte_index < input_size_dpu_bytes; byte_index += BLOCK_SIZE * NR_TASKLETS){</pre> uint32_t l_size_bytes = (byte_index + BLOCK_SIZE >= input_size_dpu_bytes) ? (input_size_dpu_bytes - byte_index) : BLOCK_SIZE; // Load cache with current MRAM block MRAM-WRAM DMA mram_read((__mram_ptr void const*)(mram_base_addr_A + byte_index), cache_A, l_size_bytes); mram_read((__mram_ptr void const*)(mram_base_addr_B + byte_index), cache_B, l_size_bytes); transfers vector_addition(cache_B, cache_A, l_size_bytes >> DIV); Vector addition (see next slide) // Write cache to current MRAM block mram_write(cache_B, (__mram_ptr void*)(mram_base_addr_B + byte_index), l_size_bytes); WRAM-MRAM DMA transfer return 0; ``` # Programming a DPU Kernel (II) Vector addition ``` // vector_addition: Computes the vector addition of a cached block static void vector_addition(T *bufferB, T *bufferA, unsigned int l_size) { for (unsigned int i = 0; i < l_size; i++){ bufferB[i] += bufferA[i]; } </pre> ``` # Intra-DPU Synchronization ## **Synchronization Primitives** - A tasklet is the software abstraction of a hardware thread - Each tasklet can have its own memory space in WRAM - Tasklets can also share data in WRAM by sharing pointers - Tasklets within the same DPU can synchronize - Mutual exclusion ``` mutex lock(); mutex unlock(); ``` - Handshakes ``` handshake_wait_for(); handshake_notify(); ``` - Barriers - barrier\_wait(); - Semaphores - sem\_give(); sem\_take(); ## Parallel Reduction (I) Tasklets in a DPU can work together on a parallel reduction # Parallel Reduction (II) Each tasklet computes a local sum ## Parallel Reduction (III) Each tasklet computes a local sum ``` for(unsigned int byte_index = base_tasklet; byte_index < input_size_dpu_bytes; byte_index += BLOCK_SIZE * NR_TASKLETS){ // Bound checking uint32_t l_size_bytes = (byte_index + BLOCK_SIZE >= input_size_dpu_bytes) ? (input_size_dpu_bytes - byte_index) : BLOCK_SIZE; // Load cache with current MRAM block mram_read((_mram_ptr void const*)(mram_base_addr_A + byte_index), cache_A, l_size_bytes); // Reduction in each tasklet l_count += reduction(cache_A, l_size_bytes >> DIV); Accumulate in a local sum // Copy local count to shared array in WRAM message[tasklet_id] = l_count; Copy local sum into WRAM ``` ## **Final Reduction** A single tasklet can perform the final reduction ``` for(unsigned int byte_index = base_tasklet; byte_index < input_size_dpu_bytes; byte_index += BLOCK_SIZE * NR_TASKLETS){ // Bound checking uint32_t l_size_bytes = (byte_index + BLOCK_SIZE >= input_size_dpu_bytes) ? (input_size_dpu_bytes - byte_index) : BLOCK_SIZE; // Load cache with current MRAM block mram_read((__mram_ptr void const*)(mram_base_addr_A + byte_index), cache_A, l_size_bytes); // Reduction in each tasklet l_count += reduction(cache_A, l_size_bytes >> DIV); Accumulate in a local sum // Copy local_count to shared array in WRAM message[tasklet_id] = l_count; Copy local sum into WRAM ``` ``` // Single-thread reduction // Barrier barrier_wait(&my_barrier); Barrier synchronization if(tasklet_id == 0){ #pragma unroll for (unsigned int each_tasklet = 1; each_tasklet < NR_TASKLETS; each_tasklet++){ message[0] += message[each_tasklet]; Sequential accumulation } // Total count in this DPU result->t_count = message[0]; } ``` ## **Vector Reduction: Naïve Mapping** Slide credit: Hwu & Kirk # **Using Barriers: Tree-Based Reduction** - Multiple tasklets can perform a tree-based reduction - After every iteration tasklets synchronize with a barrier - Half of the tasklets retire at the end of an iteration ``` // Barrier barrier_wait(&my_barrier); #pragma unroll for (unsigned int offset = 1; offset < NR_TASKLETS; offset <<= 1){ if((tasklet_id & (2*offset - 1)) == 0){ message[tasklet_id] += message[tasklet_id + offset]; } // Barrier barrier_wait(&my_barrier); Barrier synchronization }</pre> ``` A handshake-based tree-based reduction is also possible. We can compare single-tasklet, barrier-based, and handshake-based versions\* ### Parallel Reduction on GPU #### **UPMEM SDK Documentation** / User Manual #### **User Manual** #### **Getting started** - The UPMEM DPU toolchain - Notes before starting - The toolchain purpose - dpu-upmem-dpurte-clang - Limitations - The DPU Runtime Library - The Host Library - o dpu-lldb - Installing the UPMEM DPU toolchain - Dependencies - Python - Installation packages - Installation from tar.gz binary archive - Functional simulator - Hello World! Example - Purpose - Writing and building the program # Microbenchmarking of UPMEM PIM ## **DPU Pipeline** - In-order pipeline - Up to 425 MHz - Fine-grain multithreaded - 24 hardware threads - 14 pipeline stages - DISPATCH: Thread selection - FETCH: Instruction fetch - READOP: Register file - FORMAT: Operand formatting - ALU: Operation and WRAM - MERGE: Result formatting ## **Arithmetic Throughput: Microbenchmark** #### Goal Measure the maximum arithmetic throughput for different datatypes and operations #### Microbenchmark - We stream over an array in WRAM and perform read-modify-write operations - Experiments on one DPU - We vary the number of tasklets from 1 to 24 - Arithmetic operations: add, subtract, multiply, divide - Datatypes: int32, int64, float, double - We measure cycles with an accurate cycle counter that the SDK provides - We include WRAM accesses (including address calculation) and arithmetic operation ## Microbenchmark for INT32 ADD Throughput ``` #define SIZE 256 int* bufferA = mem alloc(SIZE * sizeof(int)); C-based code for(int i = 0; i < SIZE; i++){</pre> int temp = bufferA[i]; 5 temp += scalar; bufferA[i] = temp; } 1 move r2, 0 2 .LBB0_1: 3 lsl_add r3, r 4 lw r4, r3, 0 5 add r4, r4, r1 6 sw r3, 0, r4 add // Loop header lsl add r3, r0, r2, 2 // Address calculation // Load from WRAM // Add Store to WRAM // Index update jneq r2, 256, .LBB0 1 // Conditional jump ``` ## **Arithmetic Throughput: 11 Tasklets** #### **KEY OBSERVATION 1** The arithmetic throughput of a DRAM Processing Unit saturates at 11 or more tasklets. This observation is consistent for different datatypes (INT32, INT64, UINT32, UINT64, FLOAT, DOUBLE) and operations (ADD, SUB, MUL, DIV). ## **Arithmetic Throughput: ADD/SUB** INT32 ADD/SUB are 17% faster than INT64 ADD/SUB #### Can we explain the peak throughput? Peak throughput at 11 tasklets. One instruction retires every cycle when the pipeline is full Arithmetic Throughput (in OPS) = $\frac{frequenc\overline{y}_{DPU}}{\#instructions}$ ## **Arithmetic Throughput: #Instructions** Compiler explorer: <a href="https://dpu.dev">https://dpu.dev</a> ``` #define BLOCK SIZE 1024 ☐ 11010 ☐ ./a.out ☑ .LX0: ☑ .text ☑ // 1 Benchmark 32bits: typedef int T; move r2, 0 void Benchmark 32bits(T *cache A, T scalar) { .LBB0 1: for (int i = 0; i < BLOCK SIZE / sizeof(T); i++){</pre> lsl add r3, r0, r2, 2 ///// WRAM READ ///// lw r4, r3, 0 T temp = cache_A[i]; add r4, r4, r1 sw r3, 0, r4 temp += scalar; // ADD add r2, r2, 1 10 jneq r2, 256, .LBB0 1 ///// WRAM WRITE ///// 11 10 jump r23 12 cache A[i] = temp; 11 Benchmark 64bits: 13 12 move r1, 0 14 13 .LBB1 1: 15 lsl add r4, r0, r1, 3 14 16 typedef long T long; ld d6, r4, 0 15 void Benchmark 64bits(T long *cache A, T long scalar) { 17 add r7, r7, r3 16 for (int i = 0; i < BLOCK SIZE / sizeof(T long); i++){</pre> 18 addc r6, r6, r2 17 19 ///// WRAM READ ///// sd r4, 0, d6 18 20 T long temp = cache A[i]; add r1, r1, 1 19 21 jneq r1, 128, .LBB1_1 20 22 temp += scalar; // ADD 21 jump r23 23 ``` - 6 instructions in the 32-bit ADD/SUB microbenchmark - 7 instructions in the 64-bit ADD/SUB microbenchmark 24 2526 27 # **Arithmetic Throughput: ADD/SUB** INT32 ADD/SUB are 17% faster than INT64 ADD/SUB #### Can we explain the peak throughput? Peak throughput at 11 tasklets. One instruction retires every cycle when the pipeline is full Arithmetic Throughput (in OPS) = $\frac{frequency_{DPU}}{\#instructions}$ 64-bit ADD/SUB: 7 instructions $\rightarrow$ 50.00 MOPS at $frequency_{DPU}$ = 350 MHz # **Arithmetic Throughput: MUL/DIV** # **Arithmetic Throughput: Native Support** #### **KEY OBSERVATION 2** - DPUs provide native hardware support for 32-and 64-bit integer addition and subtraction, leading to high throughput for these operations. - DPUs do not natively support 32- and 64-bit multiplication and division, and floating point operations. These operations are emulated by the UPMEM runtime library, leading to much lower throughput. ## Microbenchmark: Arithmetic Throughput Arithmetic throughput for different operations and datatypes ## **DPU: WRAM Bandwidth** PIM Chip #### **WRAM Bandwidth: Microbenchmark** - Goal - Measure the WRAM bandwidth for the STREAM benchmark - Microbenchmark - We implement the four versions of STREAM: COPY, ADD, SCALE, and TRIAD - The operations performed in ADD, SCALE, and TRIAD are addition, multiplication, and addition+multiplication, respectively - We vary the number of tasklets from 1 to 16 - We show results for 1 DPU - We do not include accesses to MRAM #### STREAM Benchmark in WRAM ``` // COPY 8 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> no arithmetic operations bufferB[i] = bufferA[i]; // ADD 16 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> ADD bufferC[i] = bufferA[i] + bufferB[i]; // SCALE 8 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> MUL bufferB[i] = scalar * bufferA[i]; // TRIAD 16 bytes read, 8 bytes written, for(int i = 0; i < SIZE; i++){</pre> MUL, ADD bufferC[i] = bufferA[i] + scalar * bufferB[i]; ``` ## **WRAM Bandwidth: STREAM** #### How can we estimate the bandwidth? Assuming that the pipeline is full, and *Bytes* is the number of bytes read and written: $$WRAM\ Bandwidth\ \left(in\frac{B}{S}\right) = \frac{Bytes \times frequency_{DPU}}{\#instructions}$$ #### **WRAM Bandwidth: COPY** COPY executes 2 instructions (WRAM load and store). With 11 tasklets, 11 × 16 bytes in 22 cycles: WRAM Bandwidth $$\left(in\frac{B}{S}\right) = 2,800 \frac{MB}{S}$$ at 350 MHz ## **WRAM Bandwidth: ADD** $$WRAM\ Bandwidth\ \left(in\frac{B}{S}\right) = \frac{Bytes \times frequency_{DPU}}{\#instructions}$$ ADD executes 5 instructions (2 ld, add, addc, sd). With 11 tasklets, 11 × 24 bytes in 55 cycles: WRAM Bandwidth $$\left(in\frac{B}{S}\right) = 1,680\frac{MB}{S}$$ at 350 MHz #### WRAM Bandwidth: Access Patterns All 8-byte WRAM loads and stores take one cycle when the DPU pipeline is full #### **KEY OBSERVATION 3** The sustained bandwidth provided by the DPU's internal Working memory (WRAM) is **independent of the memory access pattern** (either streaming, strided, or random access pattern). **All 8-byte WRAM loads and stores take one cycle**, when the DPU's pipeline is full (i.e., with 11 or more tasklets). ``` Microbenchmark: c[a[i]]=b[a[i]]; Unit-stride: a[i]=a[i-1]+1; Strided: a[i]=a[i-1]+stride; Random: a[i]=rand(); ``` ## Microbenchmark: STREAM and WRAM STREAM benchmark and WRAM access patterns ## **DPU: MRAM Latency and Bandwidth** PIM Chip #### **MRAM Bandwidth** - Goal - Measure MRAM bandwidth for different access patterns - Microbenchmarks - Latency of a single DMA transfer for different transfer sizes ``` • mram read(); // MRAM-WRAM DMA transfer ``` - mram write(); // WRAM-MRAM DMA transfer - STREAM benchmark - COPY, COPY-DMA - ADD, SCALE, TRIAD - Strided access pattern - Coarse-grain strided access - Fine-grain strided access - Random access pattern (GUPS) - We do include accesses to MRAM # MRAM Read and Write Latency (I) $$MRAM \ Bandwidth \ \left(in \frac{B}{S}\right) = \frac{size \times frequency_{DPU}}{MRAM \ Latency}$$ We can model the MRAM latency with a linear expression $MRAM\ Latency\ (in\ cycles) = \alpha + \beta \times size$ In our measurements, $\beta$ equals 0.5 cycles/byte. Theoretical maximum MRAM bandwidth = 700 MB/s at 350 MHz # MRAM Read and Write Latency (II) #### **KEY OBSERVATION 4** - The DPU's Main memory (MRAM) bank access latency increases linearly with the transfer size. - The maximum theoretical MRAM bandwidth is 2 bytes per cycle. # MRAM Read and Write Latency (III) Read and write accesses to MRAM are symmetric The sustained MRAM bandwidth increases with data transfer size #### **PROGRAMMING RECOMMENDATION 1** For data movement between the DPU's MRAM bank and the WRAM, use large DMA transfer sizes when all the accessed data is going to be used. # MRAM Read and Write Latency (IV) #### MRAM latency changes slowly between 8 and 128 bytes For small transfers, the fixed cost $(\alpha)$ dominates the variable cost $(\beta \times size)$ #### PROGRAMMING RECOMMENDATION 2 For small transfers between the MRAM bank and the WRAM, **fetch more bytes than necessary within a 128-byte limit**. Doing so increases the likelihood of finding data in WRAM for later accesses (i.e., the program can check whether the desired data is in WRAM before issuing a new MRAM access). # MRAM Read and Write Latency (V) 2,048-byte transfers are only 4% faster than 1,024-byte transfers Larger transfers require more WRAM, which may limit the number of tasklets #### PROGRAMMING RECOMMENDATION 3 **Choose the data transfer size between the MRAM bank and the WRAM based on the program's WRAM usage**, as it imposes a tradeoff between the sustained MRAM bandwidth and the number of tasklets that can run in the DPU (which is dictated by the limited WRAM capacity). #### **MRAM Bandwidth** - Goal - Measure MRAM bandwidth for different access patterns - Microbenchmarks - Latency of a single DMA transfer for different transfer sizes ``` mram read(); // MRAM-WRAM DMA transfer ``` - mram write(); // WRAM-MRAM DMA transfer - STREAM benchmark - COPY, COPY-DMA - ADD, SCALE, TRIAD - Strided access pattern - Coarse-grain strided access - Fine-grain strided access - Random access pattern (GUPS) - We do include accesses to MRAM #### STREAM Benchmark in MRAM ``` // COPY // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(uint64 t)); for(int i = 0; i < SIZE; i++){ bufferB[i] = bufferA[i]; // Write WRAM block to MRAM mram write(bufferB, ( mram ptr void*)mram address B, SIZE * sizeof(uint64 t)); // COPY-DMA // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(uint64 t)); // Write WRAM block to MRAM mram write(bufferA, ( mram ptr void*)mram address B, SIZE * sizeof(uint64 t)); ``` ## **STREAM Benchmark: COPY-DMA** The sustained bandwidth of **COPY-DMA** is close to the theoretical maximum (700 MB/s): ~1.6 TB/s for 2,556 DPUs **COPY-DMA** saturates with two tasklets, even though the DMA engine can perform only one transfer at a time Using two or more tasklets guarantees that there is always a DMA request enqueued to keep the DMA engine busy ## STREAM Benchmark: Bandwidth Saturation (I) COPY and ADD saturate at 4 and 6 tasklets, respectively #### **SCALE** and **TRIAD** saturate at 11 tasklets The latency of MRAM accesses becomes longer than the pipeline latency after 4 and 6 tasklets for COPY and ADD, respectively The pipeline latency of **SCALE** and **TRIAD** is longer than the MRAM latency for any number of tasklets (both use costly MUL) ## STREAM Benchmark: Bandwidth Saturation (II) #### **KEY OBSERVATION 5** - When the access latency to an MRAM bank for a streaming benchmark (COPY-DMA, COPY, ADD) is larger than the pipeline latency (i.e., execution latency of arithmetic operations and WRAM accesses), the performance of the DPU saturates at a number of tasklets smaller than 11. This is a memory-bound workload. - When the pipeline latency for a streaming benchmark (SCALE, TRIAD) is larger than the MRAM access latency, the performance of a DPU saturates at 11 tasklets. This is a compute-bound workload. #### **MRAM Bandwidth** - Goal - Measure MRAM bandwidth for different access patterns - Microbenchmarks - Latency of a single DMA transfer for different transfer sizes ``` mram_read(); // MRAM-WRAM DMA transfer ``` - mram\_write(); // WRAM-MRAM DMA transfer - STREAM benchmark - COPY, COPY-DMA - ADD, SCALE, TRIAD - Strided access pattern - Coarse-grain strided access - Fine-grain strided access - Random access pattern (GUPS) - We do include accesses to MRAM #### Strided and Random Access to MRAM ``` // COARSE-GRAINED STRIDED ACCESS // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(uint64 t)); mram read(( mram ptr void const*)mram address B, bufferB, SIZE * sizeof(uint64 t)); for(int i = 0; i < SIZE; i += stride){</pre> bufferB[i] = bufferA[i]; // Write WRAM block to MRAM mram write(bufferB, ( mram ptr void*)mram address B, SIZE * sizeof(uint64 t)); // FINE-GRAINED STRIDED & RANDOM ACCESS for(int i = 0; i < SIZE; i += stride){</pre> int index = i * sizeof(uint64 t); // Load current MRAM element to WRAM mram read(( mram ptr void const*)(mram address A + index), bufferA, sizeof(uint64 t)); // Write WRAM element to MRAM mram write(bufferA, ( mram ptr void*)(mram address B + index), sizeof(uint64 t)); ``` # Strided and Random Accesses (I) Large difference in maximum sustained bandwidth between coarse-grained and fine-grained DMA Coarse-grained DMA uses 1,024-byte transfers, while fine-grained DMA uses 8-byte transfers Random access achieves very similar maximum sustained bandwidth to fine-grained strided approach # Strided and Random Accesses (II) The sustained MRAM bandwidth of coarse-grained DMA decreases as the stride increases The effective utilization of the transferred data decreases as the stride becomes larger (e.g., a stride 4 means that only one fourth of the transferred data is used) # Strided and Random Accesses (III) For a stride of 16 or larger, the fine-grained DMA approach achieves higher bandwidth With stride 16, only one sixteenth of the maximum sustained bandwidth (622.36 MB/s) of coarse-grained DMA is effectively used, which is lower than the bandwidth of fine-grained DMA (72.58 MB/s) # Strided and Random Accesses (IV) #### PROGRAMMING RECOMMENDATION 4 - For strided access patterns with a **stride smaller than 16 8-byte elements, fetch a large contiguous chunk** (e.g., 1,024 bytes) from a DPU's MRAM bank. - For strided access patterns with **larger strides and random access patterns**, fetch **only the data elements that are needed** from an MRAM bank. ## Microbenchmark: Strided and Random Strided and random accesses to MRAM #### DPU: Arithmetic Throughput vs. Operational Intensity #### Arithmetic Throughput vs. Operational Intensity (I) - Goal - Characterize memory-bound regions and compute-bound regions for different datatypes and operations - Microbenchmark - We load one chunk of an MRAM array into WRAM - Perform a variable number of operations on the data - Write back to MRAM - The experiment is inspired by the Roofline model\* - We define operational intensity (OI) as the number of arithmetic operations performed per byte accessed from MRAM (OP/B) - The pipeline latency changes with the operational intensity, but the MRAM access latency is fixed ## Arithmetic Throughput vs. Operational Intensity (II) ``` int repetitions = input repeat >= 1.0 ? (int)input repeat : 1; int stride = input repeat \geq 1.0 ? 1 : (int)(1 / input repeat); // Load current MRAM block to WRAM mram read(( mram ptr void const*)mram address A, bufferA, SIZE * sizeof(T)); // Update input repeat greater or equal for(int r = 0; r < repetitions; r++){</pre> to 1 indicates the (integer) for(int i = 0; i < SIZE; i+=stride){</pre> number of repetitions per input #ifdef ADD element bufferA[i] += scalar; // ADD #elif SUB input repeat smaller than 1 bufferA[i] -= scalar; // SUB indicates the fraction of elements #elif MUIL that are updated bufferA[i] *= scalar; // MUL #elif DIV bufferA[i] /= scalar; // DIV #endif // Write WRAM block to MRAM mram write(bufferA, ( mram ptr void*)mram address B, SIZE * sizeof(T)); ``` #### Arithmetic Throughput vs. Operational Intensity (III) We show results of arithmetic throughput vs. operational intensity for (a) 32-bit integer ADD, (b) 32-bit integer MUL, (c) 32-bit floating-point ADD, and (d) 32-bit floating-point MUL (results for other datatypes and operations show similar trends) #### Arithmetic Throughput vs. Operational Intensity (IV) In the memory-bound region, the arithmetic throughput increases with the operational intensity In the compute-bound region, the arithmetic throughput is flat at its maximum The throughput saturation point is the operational intensity where the transition between the memory-bound region and the compute-bound region happens The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched ## Arithmetic Throughput vs. Operational Intensity (V) #### **KEY OBSERVATION 6** The arithmetic throughput of a DRAM Processing Unit (DPU) saturates at low or very low operational intensity (e.g., 1 integer addition per 32-bit element). Thus, the DPU is fundamentally a compute-bound processor. We expect most real-world workloads be compute-bound in the UPMEM PIM architecture. #### Microbenchmark: Arithmetic Throughput vs. Operational Intensity Arithmetic Throughput versus Operational Intensity # Benchmarking and Workload Suitability #### **PrIM Benchmarks** - Goal - A common set of workloads that can be used to - evaluate the UPMEM PIM architecture, - compare software improvements and compilers, - compare future PIM architectures and hardware - Two key selection criteria: - Selected workloads from different application domains - Memory-bound workloads on processor-centric architectures - 14 different workloads, 16 different benchmarks\* # **PrIM Benchmarks: Application Domains** | Domain | Benchmark | Short name | |-----------------------|-------------------------------|------------| | Dense linear algebra | Vector Addition | VA | | Dense linear algebra | Matrix-Vector Multiply | GEMV | | Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV | | Databases | Select | SEL | | | Unique | UNI | | Data analytics | Binary Search | BS | | | Time Series Analysis | TS | | Graph processing | Breadth-First Search | BFS | | Neural networks | Multilayer Perceptron | MLP | | Bioinformatics | Needleman-Wunsch | NW | | Image processing | Image histogram (short) | HST-S | | | Image histogram (large) | HST-L | | Parallel primitives | Reduction | RED | | | Prefix sum (scan-scan-add) | SCAN-SSA | | | Prefix sum (reduce-scan-scan) | SCAN-RSS | | | Matrix transposition | TRNS | #### **Roofline Model** Intel Advisor on an Intel Xeon E3-1225 v6 CPU All workloads fall in the memory-bound area of the Roofline #### **PrIM Benchmarks: Diversity** - PrIM benchmarks are diverse: - Memory access patterns - Operations and datatypes - Communication/synchronization | Domain | Benchmark | Short name | Memory access pattern | | | Computation pattern | | Communication/synchronization | | |-----------------------|-------------------------------|------------|-----------------------|---------|--------|---------------------|----------|-------------------------------|-----------| | Domain | | | Sequential | Strided | Random | Operations | Datatype | Intra-DPU | Inter-DPU | | Danas linaan alaahaa | Vector Addition | VA | Yes | | | add | int32_t | | | | Dense linear algebra | Matrix-Vector Multiply | GEMV | Yes | | | add, mul | uint32_t | | | | Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV | Yes | | Yes | add, mul | float | | | | Databases | Select | SEL | Yes | | | add, compare | int64_t | handshake, barrier | Yes | | Databases | Unique | UNI | Yes | | | add, compare | int64_t | handshake, barrier | Yes | | Data analytics | Binary Search | BS | Yes | | Yes | compare | int64_t | | | | | Time Series Analysis | TS | Yes | | | add, sub, mul, div | int32_t | | | | Graph processing | Breadth-First Search | BFS | Yes | | Yes | bitwise logic | uint64_t | barrier, mutex | Yes | | Neural networks | Multilayer Perceptron | MLP | Yes | | | add, mul, compare | int32_t | | | | Bioinformatics | Needleman-Wunsch | NW | Yes | Yes | | add, sub, compare | int32_t | barrier | Yes | | Image processing | Image histogram (short) | HST-S | Yes | | Yes | add | uint32_t | barrier | Yes | | | Image histogram (long) | HST-L | Yes | | Yes | add | uint32_t | barrier, mutex | Yes | | Parallel primitives | Reduction | RED | Yes | Yes | | add | int64_t | barrier | Yes | | | Prefix sum (scan-scan-add) | SCAN-SSA | Yes | | | add | int64_t | handshake, barrier | Yes | | | Prefix sum (reduce-scan-scan) | SCAN-RSS | Yes | | | add | int64_t | handshake, barrier | Yes | | | Matrix transposition | TRNS | Yes | | Yes | add, sub, mul | int64_t | mutex | | #### **PrIM Benchmarks: Inter-DPU Communication** | Domain Benchmark Short name Sequential Strided Random Operations Datatype Intra-DPU Inter-DPU | | | | | | | | | | | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-------------------------------|-------------|-------------------|------------------|--------|---------------------|----------|-------------------------------|-----------| | Domain Benchmark Short name Sequential Strided Random Operations Datatype Intra-DPU Inter-DPU | | | | | | | | | | | | Dense linear algebra Sequential Strided Random Operations Datatype Intra-DPU Inter-DPU Matrix-Vector Multiply GEMV Yes add, mul uint32 t Sparse linear algebra Sparse Matrix-Vector Multiply SpMV Yes Yes add, mul float Databases 1 Uniqu | Domain | Panahmania Shaut n | | Memory | y access pattern | | Computation pattern | | Communication/synchronization | | | Dense linear algebra Matrix-Vector Multiply Matrix-Vector Multiply Sparse linear algebra Sparse Matrix-Vector Multiply SpMV Yes Yes Add, mul Ifloat Anadyshake, barrier Yes Add, compare Andy compare Anadyshake, barrier Yes Analysis Graph processing Neural networks Bioinformatics Neural networks Bioinformatics Image processing Image histogram (short) Parallel primitives Matrix-Vector Multiply SpMV Yes Yes Add, compare sub, mul, div Analysis Ana | Domain | Deliciliiai k | Short name | Sequential | Strided | Random | Operations | Datatype | Intra-DPU | Inter-DPU | | Sparse linear algebra Sparse Matrix-Vector Multiply SpMV Yes Yes Add, mul float And, compare Add, sub, mul, div compare | Dense linear algebra | Vector Addition | VA | Yes | | | add | int32_t | | | | Databases 1 | Delise ililear algebra | Matrix-Vector Multiply | GEMV | Yes | | | add, mul | uint32 t | | · | | Data analytics Data analytics Binary Search Time Stries Analysis Birs TS Yes TS Yes TS Yes TS Yes Ditwise logic Williager Perceptron Needleman, Worklich HST-S Yes Image processing Mage histogram (short) Parallel primitives Parallel primitives Parallel primitives Parallel primitives Add, compare int64_t add, sub, mul, div int32_t add, sub, mul, div int32_t add, sub, compare int32_t add, sub, compare int32_t add, sub, compare int32_t add uint32_t barrier Yes Add, sub, compare int64_t barrier, mutex Yes Add, sub, compare int64_t barrier, mutex Yes Add, sub, compare int64_t barrier, mutex Yes Add, sub, compare int64_t barrier, mutex Yes Add uint32_t barrier Yes Add uint32_t barrier, mutex Yes Add int64_t barrier Yes Prefix sum (reduce-scan-scan) SCAN-RSS Yes Add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes | Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV | Yes | | Yes | add, mul | float | | | | Data analytics Binary Search Time Series Analysis Graph processing Neural networks Bioinformatics Image processing Image histogram (short) Parallel primitives Parallel primitives Binary Search BS Yes Yes Yes Compare int64_t add, sub, mul, div int32_t add, sub, mul, div int32_t add, sub, compare int64_t add, sub, mul, div int32_t add, sub, compare int32_t add, sub, compare int32_t add, sub, compare int32_t barrier Yes Add int64_t barrier, mutex Yes Add int64_t barrier Yes Add int64_t handshake, | Detaboada | Select | SEL • | | | | add, compare | int64_t | handshake, barrier | Yes | | Data analytics Time Series Analysis TS Yes BFS Yes Yes Bitwise logic Williayer Perceptron Needleman, Wuhschl, HSI-S, NMST-L, Rel D Image histogram (short) Image histogram (short) Image histogram (short) Reduction RED Yes Yes Add, sub, mul, div int32_t add, mul, compare add, sub, compare int32_t add, sub, compare int32_t add, sub, compare int32_t barrier Yes Add wint32_t barrier Yes Reduction RED Yes Yes Add int64_t barrier Yes Prefix sum (scan tell-add) Yes Yes Add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes | Paranter | Unique COM | UNII | Celts | | | add, compare | int64_t | handshake, barrier | Yes | | Graph processing Breacth-First Search S BFS Yes Yes Ditwise logic Williayer Perceptron Necaleman, Wuhlch HST-S, NWST-L, Yes D Image histogram (short) histogr | Data analytics | | | Yes | | Yes | compare | int64_t | | | | Graph processing Breadth-First Search S BFS Yes Yes bitwise logic uint64_t barrier, mutex Yes Neural networks Multilayer Perceptron MLP Yes add, mul, compare int32_t add, sub, compare int32_t barrier Yes add, sub, compare int32_t barrier Yes add uint32_t handshake, barrier Yes handshake, barrier Yes Prefix sum (reduce-scan-scan) SCAN-RSS Yes add int64_t handshake, barrier Yes | Data analytics | Time Series Analysis | TS | Yes | | | add, sub, mul, div | int32_t | | | | Image histogram (short) | Graph processing | Breadth-First Search | • BFS | | | Yes | bitwise logic | uint64_t | barrier, mutex | Yes | | Image histogram (short) | Neural networks | Multilayer Perceptron | - c MLP c T | Yes | | | add, mul, compare | int32_t | | | | Parallel primitives Image histogram (long) - C Utsansterses Yes add uint32_t barrier, mutex Yes | Bioinformatics | Needleman, Wuhich , HS | -2'NM 2 I | L, KED | Yes | | add, sub, compare | int32_t | barrier | Yes | | Parallel primitives Reduction RED Yes Yes Add int64_t barrier Yes Prefix Sum (schr-4d) SCAN-FSA E Medical Company (schr-4d) SCAN-RSS Yes Add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes Yes Yes Prefix sum (reduce-scan-scan) SCAN-RSS Yes Add int64_t handshake, barrier Yes Yes | Image processing | Image histogram (short) | HST-S | | | Yes | add | uint32_t | barrier | Yes | | Parallel primitives Prefix sum (reduce-scan-scan) SCAN-RSS Yes add int64_t handshake, barrier Yes Add int64_t handshake, barrier Yes | image processing | Image histogram long - | J HSainST | ers <sub>es</sub> | | Yes | add | uint32_t | barrier, mutex | Yes | | Parallel primitives Prefix sum (reduce-scan-scan) SCAN-RSS Yes add int64_t handshake, barrier Yes | Parallel primitives R | Reduction | | | Yes | | add | int64_t | barrier | Yes | | Prefix sum (reduce-scan-scan) SCAN-RSS Yes add int64_t handshake, barrier Yes | | | OCAN-59A | ermed | iate | resu | ts: add | int64_t | | Yes | | • Marx Trensposition D NIM/ CROSAN C CANID Ces add, sub, mul int64 t mutex | | Prefix sum (reduce-scan-scan) | SCAN-RSS | | | | | int64_t | handshake, barrier | Yes | | | | Ma Bx Frenspletion P, NV | V. SPOSAN | -S&A S | CAN- | R&S | add, sub, mul | int64_t | mutex | | DPU-CPU and CPU-DPU transfers #### **PrIM Benchmarks** - 16 benchmarks and scripts for evaluation - https://github.com/CMU-SAFARI/prim-benchmarks #### **Outline** - Introduction - Accelerator Model - UPMEM-based PIM System Overview - UPMEM PIM Programming - Vector Addition - CPU-DPU Data Transfers - Inter-DPU Communication - CPU-DPU/DPU-CPU Transfer Bandwidth - DRAM Processing Unit - Arithmetic Throughput - WRAM and MRAM Bandwidth - PrIM Benchmarks - Roofline Model - Benchmark Diversity - Evaluation - Strong and Weak Scaling - Comparison to CPU and GPU - Key Takeaways #### **Evaluation Methodology** - We evaluate the 16 PrIM benchmarks on two UPMEMbased systems: - 2,556-DPU system - 640-DPU system - Strong and weak scaling experiments on the 2,556-DPU system - 1 DPU with different numbers of tasklets - 1 rank (strong and weak) - Up to 32 ranks Strong scaling refers to how the execution time of a program solving a particular problem varies with the number of processors for a fixed problem size Weak scaling refers to how the execution time of a program solving a particular problem varies with the number of processors for a fixed problem size per processor #### **Evaluation Methodology** - We evaluate the 16 PrIM benchmarks on two UPMEMbased systems: - 2,556-DPU system - 640-DPU system - Strong and weak scaling experiments on the 2,556-DPU system - 1 DPU with different numbers of tasklets - 1 rank (strong and weak) - Up to 32 ranks - Comparison of both UPMEM-based PIM systems to state-of-the-art CPU and GPU - Intel Xeon E3-1240 CPU - NVIDIA Titan V GPU #### 2,560-DPU System - UPMEM-based PIM system with 20 UPMEM DIMMs of 16 chips each (40 ranks) - P21 DIMMs - Dual x86 socket - UPMEM DIMMs coexist with regular DDR4 DIMMs - 2 memory controllers/socket (3 channels each) - 2 conventional DDR4 DIMMs on one channel of one controller #### 640-DPU System - UPMEM-based PIM system with 10 UPMEM DIMMs of 8 chips each (10 ranks) - E19 DIMMs - x86 socket - 2 memory controllers (3 channels each) - 2 conventional DDR4 DIMMs on one channel of one controller #### **Datasets** #### Strong and weak scaling experiments | Benchmark | Strong Scaling Dataset | Weak Scaling Dataset | MRAM-WRAM<br>Transfer Sizes | |-----------|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|-----------------------------| | VA | 1 DPU-1 rank: 2.5M elem. (10 MB) 32 ranks: 160M elem. (640 MB) | 2.5M elem./DPU (10 MB) | 1024 bytes | | GEMV | 1 DPU-1 rank: $8192 \times 1024$ elem. (32 MB) 32 ranks: $163840 \times 4096$ elem. (2.56 GB) | 1024 × 2048 elem./DPU (8 MB) | 1024 bytes | | SpMV | bcsstk30 [253] (12 MB) | bcsstk30 [253] | 64 bytes | | SEL | 1 DPU-1 rank: 3.8M elem. (30 MB) 32 ranks: 240M elem. (1.9 GB) | 3.8M elem./DPU (30 MB) | 1024 bytes | | UNI | 1 DPU-1 rank: 3.8M elem. (30 MB) 32 ranks: 240M elem. (1.9 GB) | 3.8M elem./DPU (30 MB) | 1024 bytes | | BS | 2M elem. (16 MB). 1 DPU-1 rank: 256K queries. (2 MB) 32 ranks: 16M queries. (128 MB) | 2M elem. (16 MB). 256K queries./DPU (2 MB). | 8 bytes | | TS | 256 elem. query. 1 DPU-1 rank: 512K elem. (2 MB) 32 ranks: 32M elem. (128 MB) | 512K elem./DPU (2 MB) | 256 bytes | | BFS | loc-gowalla [254] (22 MB) | rMat [255] (≈100K vertices and 1.2M edges per DPU) | 8 bytes | | MLP | 3 fully-connected layers. 1 DPU-1 rank: 2K neurons (32 MB) 32 ranks: ≈160K neur. (2.56 GB) | 3 fully-connected layers. 1K neur./DPU (4 MB) | 1024 bytes | | NW | 1 DPU-1 rank: 2560 bps (50 MB), large/small sub-block= $\frac{2560}{\#DPUs}$ /2 32 ranks: 64K bps (32 GB), l./s.=32/2 | 512 bps/DPU (2MB), l./s.=512/2 | 8, 16, 32, 40 bytes | | HST-S | 1 DPU-1 rank: $1536 \times 1024$ input image [256] (6 MB) 32 ranks: $64 \times$ input image | $1536 \times 1024$ input image [256]/DPU (6 MB) | 1024 bytes | | HST-L | 1 DPU-1 rank: $1536 \times 1024$ input image [256] (6 MB) 32 ranks: $64 \times$ input image | $1536 \times 1024$ input image [256]/DPU (6 MB) | 1024 bytes | | RED | 1 DPU-1 rank: 6.3M elem. (50 MB) 32 ranks: 400M elem. (3.1 GB) | 6.3M elem./DPU (50 MB) | 1024 bytes | | SCAN-SSA | 1 DPU-1 rank: 3.8M elem. (30 MB) 32 ranks: 240M elem. (1.9 GB) | 3.8M elem./DPU (30 MB) | 1024 bytes | | SCAN-RSS | 1 DPU-1 rank: 3.8M elem. (30 MB) 32 ranks: 240M elem. (1.9 GB) | 3.8M elem./DPU (30 MB) | 1024 bytes | | TRNS | 1 DPU-1 rank: $12288 \times 16 \times 64 \times 8$ (768 MB) 32 ranks: $12288 \times 16 \times 2048 \times 8$ (24 GB) | $12288 \times 16 \times 1 \times 8$ /DPU (12 MB) | 128, 1024 bytes | The PrIM benchmarks repository includes all datasets and scripts used in our evaluation ## Strong Scaling: 1 DPU (I) - Strong scaling experiments on 1 DPU - We set the number of tasklets to 1, 2, 4, 8, and 16 - We show the breakdown of execution time: - DPU: Execution time on the DPU - Inter-DPU: Time for inter-DPU communication via the host CPU - CPU-DPU: Time for CPU to DPU transfer of input data - DPU-CPU: Time for DPU to CPU transfer of final results - Speedup over 1 tasklet # Strong Scaling: 1 DPU (II) VA, GEMV, SpMV, SEL, UNI, TS, MLP, NW, HST-S, RED, SCAN-SSA (Scan kernel), SCAN-RSS (both kernels), and TRNS (Step 2 kernel), the best performing number of tasklets is 16 Speedups 1.5-2.0x as we double the number of tasklets from 1 to 8. Speedups 1.2-1.5x from 8 to 16, since the pipeline throughput saturates at 11 tasklets #### **KEY OBSERVATION 10** A number of tasklets greater than 11 is a good choice for most realworld workloads we tested (16 kernels out of 19 kernels from 16 benchmarks), as it fully utilizes the DPU's pipeline. # Strong Scaling: 1 DPU (III) VA, GEMV, SpMV, BS, TS, MLP, HST-S do not use intra-DPU synchronization primitives In SEL, UNI, NW, RED, SCAN-SSA (Scan kernel), SCAN-RSS (both kernels), synchronization is lightweight BFS, HST-L, TRNS (Step 3) use mutexes, which cause contention when accessing shared data structures ## Strong Scaling: 1 DPU (IV) VA, GEMV, SpMV, BS, TS, MLP, HST-S do not use intra-DPU synchronization primitives In SEL, UNI, NW, RED, SCAN-SSA (Scan kernel), SCAN-RSS (both kernels), synchronization is lightweight BFS, HST-L, TRNS (Step 3) use mutexes, which cause contention when accessing shared data structures #### **KEY OBSERVATION 11** Intensive use of intra-DPU synchronization across tasklets (e.g., mutexes, barriers, handshakes) may limit scalability, sometimes causing the best performing number of tasklets to be lower than 11. # Strong Scaling: 1 DPU (V) SCAN-SSA (Add kernel) is not compute-intensive. Thus, performance saturates with less that 11 tasklets (recall STREAM ADD). BS shows similar behavior #### **KEY OBSERVATION 12** Most real-world workloads are in the compute-bound region of the DPU (all kernels except SCAN-SSA (Add kernel) and BS), i.e., the pipeline latency dominates the MRAM access latency. # Strong Scaling: 1 DPU (VI) The amount of time spent on CPU-DPU and DPU-CPU transfers is low compared to the time spent on DPU execution TRNS performs step 1 of the matrix transposition via the CPU-DPU transfer. Using small transfers (8 elements) does not exploit full CPU-DPU bandwidth #### **KEY OBSERVATION 13** Transferring large data chunks from/to the host CPU is preferred for input data and output results due to higher sustained CPU-DPU/DPU-CPU bandwidths. ## Strong Scaling: 1 Rank (I) - Strong scaling experiments on 1 rank - We set the number of tasklets to the best performing one - The number of DPUs is 1, 4, 16, 64 - We show the breakdown of execution time: - DPU: Execution time on the DPU - Inter-DPU: Time for inter-DPU communication via the host CPU - CPU-DPU: Time for CPU to DPU transfer of input data - DPU-CPU: Time for DPU to CPU transfer of final results - Speedup over 1 DPU # Strong Scaling: 1 Rank (II) VA, GEMV, SpMV, SEL, UNI, BS, TS, MLP, HST-S, HSTS-L, RED, SCAN-SSA (both kernel), SCAN-RSS (both kernels), and TRNS (both kernels) scale linearly with the number of DPUs Scaling is sublinear for BFS and NW BFS suffers load imbalance due to irregular graph topology NW computes a diagonal of a 2D matrix in each iteration. More DPUs does not mean more parallelization in shorter diagonals. # Strong Scaling: 1 Rank (III) VA, GEMV, SpMV, BS, TS, TRNS do not need inter-DPU synchronization SEL, UNI, HST-S, HST-L, RED, SCAN-SSA, SCAN-RSS need inter-DPU synchronization but 64 DPUs still obtain the best performance BFS, MLP, NW require heavy inter-DPU synchronization, involving DPU-CPU and CPU-DPU transfers # Strong Scaling: 1 Rank (IV) VA, GEMV, TS, MLP, HST-S, HST-L, RED, SCAN-SSA, SCAN-RSS, TRNS use parallel transfers. CPU-DPU and DPU-CPU transfer times decrease as we increase the number of DPUs #### BS, NW use parallel transfers but do not reduce transfer times: - BS transfers a complete array to all DPUs. - NW does not use all DPUs in all iterations SpMV, SEL, UNI, BFS cannot use parallel transfers, as the transfer size per DPU is not fixed #### PROGRAMMING RECOMMENDATION 5 Parallel CPU-DPU/DPU-CPU transfers inside a rank of DPUs are recommended for real-world workloads when all transferred buffers are of the same size. ## Strong Scaling: 32 Ranks (I) - Strong scaling experiments on 32 rank - We set the number of tasklets to the best performing one - The number of DPUs is 256, 512, 1024, 2048 - We show the breakdown of execution time: - DPU: Execution time on the DPU - Inter-DPU: Time for inter-DPU communication via the host CPU - We do not show CPU-DPU/DPU-CPU transfer times - Speedup over 256 DPUs # Strong Scaling: 32 Ranks (II) VA, GEMV, SEL, UNI, BS, TS, MLP, HST-S, HSTS-L, RED, SCAN-SSA (both kernel), SCAN-RSS (both kernels), and TRNS (both kernels) scale linearly with the number of DPUs SpMV, BFS, NW do not scale linearly due to load imbalance #### KEY OBSERVATION 14 Load balancing across DPUs ensures linear reduction of the execution time spent on the DPUs for a given problem size, when all available DPUs are used (as observed in strong scaling experiments). # Strong Scaling: 32 Ranks (III) SEL, UNI, HST-S, HST-L, RED only need to merge final results #### KEY OBSERVATION 15 The overhead of merging partial results from DPUs in the host CPU is tolerable across all PrIM benchmarks that need it. BFS, MLP, NW, SCAN-SSA, SCAN-RSS have more complex communication #### **KEY OBSERVATION 16** Complex synchronization across DPUs (i.e., inter-DPU synchronization involving two-way communication with the host CPU) imposes significant overhead, which limits scalability to more DPUs. ## Weak Scaling: 1 Rank #### **KEY OBSERVATION 17** Equally-sized problems assigned to different DPUs and little/no inter-DPU synchronization lead to linear weak scaling of the execution time spent on the DPUs (i.e., constant execution time when we increase the number of DPUs and the dataset size accordingly). #### **KEY OBSERVATION 18** Sustained bandwidth of parallel CPU-DPU/DPU-CPU transfers inside a rank of DPUs increases sublinearly with the number of DPUs. ## **CPU/GPU: Evaluation Methodology** - Comparison of both UPMEM-based PIM systems to state-of-the-art CPU and GPU - Intel Xeon E3-1240 CPU - NVIDIA Titan V GPU - We use state-of-the-art CPU and GPU counterparts of PrIM benchmarks - <a href="https://github.com/CMU-SAFARI/prim-benchmarks">https://github.com/CMU-SAFARI/prim-benchmarks</a> - We use the largest dataset that we can fit in the GPU memory - We show overall execution time, including DPU kernel time and inter DPU communication ## **CPU/GPU: Performance Comparison (I)** The 2,556-DPU and the 640-DPU systems outperform the CPU for all benchmarks except SpMV, BFS, and NW The 2,556-DPU and the 640-DPU are, respectively, 93.0x and 27.9x faster than the CPU for 13 of the PrIM benchmarks ## **CPU/GPU: Performance Comparison (II)** The 2,556-DPU outperforms the GPU for 10 PrIM benchmarks with an average of 2.54x The performance of the 640-DPU is within 65% the performance of the GPU for the same 10 PrIM benchmarks ## **CPU/GPU: Performance Comparison (III)** #### **KEY OBSERVATION 19** The UPMEM-based PIM system can outperform a state-of-the-art GPU on workloads with three key characteristics: - Streaming memory accesses - No or little inter-DPU synchronization - No or little use of integer multiplication, integer division, or floating point operations These three key characteristics make a workload potentially suitable to the UPMEM PIM architecture. ## CPU/GPU: Energy Comparison (I) The 640-DPU system consumes on average 1.64x less energy than the CPU for all 16 PrIM benchmarks For 12 benchmarks, the 640-DPU system provides energy savings of 5.23x over the CPU # **3MEAN** # **CPU/GPU: Energy Comparison (II)** #### **KEY OBSERVATION 20** The UPMEM-based PIM system provides large energy savings over a state-of-the-art CPU due to higher performance (thus, lower static energy) and less data movement between memory and processors. The UPMEM-based PIM system provides energy savings over a state-ofthe-art CPU/GPU on workloads where it outperforms the CPU/GPU. This is because the source of both performance improvement and energy savings is the same: the significant reduction in data movement between the memory and the processor cores, which the UPMEM-based PIM system can provide for PIM-suitable workloads. #### **Outline** - Introduction - Accelerator Model - UPMEM-based PIM System Overview - UPMEM PIM Programming - Vector Addition - CPU-DPU Data Transfers - Inter-DPU Communication - CPU-DPU/DPU-CPU Transfer Bandwidth - DRAM Processing Unit - Arithmetic Throughput - WRAM and MRAM Bandwidth - PrIM Benchmarks - Roofline Model - Benchmark Diversity - Evaluation - Strong and Weak Scaling - Comparison to CPU and GPU - Key Takeaways The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched Operational Intensity (OP/B) #### KEY TAKEAWAY 1 The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable workloads are memory-bound. #### **KEY TAKEAWAY 2** The most well-suited workloads for the UPMEM PIM architecture use no arithmetic operations or use only simple operations (e.g., bitwise operations and integer addition/subtraction). #### **KEY TAKEAWAY 3** The most well-suited workloads for the UPMEM PIM architecture require little or no communication across DPUs (inter-DPU communication). #### KEY TAKEAWAY 4 - UPMEM-based PIM systems **outperform state-of-the-art CPUs in terms of performance** (by 23.2× on 2,556 DPUs for 16 PrIM benchmarks) **and energy efficiency on most of PrIM benchmarks**. - UPMEM-based PIM systems **outperform state-of-the-art GPUs on a majority of PrIM benchmarks** (by 2.54× on 2,556 DPUs for 10 PrIM benchmarks), and the outlook is even more positive for future PIM systems. - UPMEM-based PIM systems are more energy-efficient than stateof-the-art CPUs and GPUs on workloads that they provide performance improvements over the CPUs and the GPUs. #### **Understanding a Modern PIM Architecture** ## Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup> <sup>1</sup>ETH Zürich Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch). https://arxiv.org/pdf/2105.03814.pdf <sup>&</sup>lt;sup>2</sup>American University of Beirut <sup>&</sup>lt;sup>3</sup>University of Malaga <sup>&</sup>lt;sup>4</sup>National Technical University of Athens #### **Short arXiv Version** #### Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware Juan Gómez-Luna Izzat El Hajj Ivan Fernandez Christina Giannoula Geraldo F. Oliveira Onur Mutlu ETH Zürich American University University of Malaga University of Athens The American University of Malaga University of Athens https://arxiv.org/pdf/2110.01709.pdf #### Long arXiv Version # Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture Juan Gómez-Luna<sup>1</sup> Izzat El Hajj<sup>2</sup> Ivan Fernandez<sup>1,3</sup> Christina Giannoula<sup>1,4</sup> Geraldo F. Oliveira<sup>1</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>American University of Beirut <sup>3</sup>University of Malaga <sup>4</sup>National Technical University of Athens https://arxiv.org/pdf/2105.03814.pdf #### **PrIM Repository** - All microbenchmarks, benchmarks, and scripts - https://github.com/CMU-SAFARI/prim-benchmarks # HPCA 2023 Tutorial Real-world Processing-in-Memory Architectures # Processing-Near-Memory Real PNM Architectures Programming General-purpose PIM Dr. Juan Gómez Luna Professor Onur Mutlu