### 1<sup>st</sup> Workshop Memory-Centric Computing: Research Challenges & Closing Remarks Geraldo F. Oliveira https://geraldofojunior.github.io ASPLOS 2025 30 March 2025 ### Eliminating the Adoption Barriers # How to Enable Adoption of Processing in Memory ### Potential Barriers to Adoption of PIM - 1. **Applications** & **software** for PIM - 2. Ease of **programming** (interfaces and compiler/HW support) - 3. **System** and **security** support: coherence, synchronization, virtual memory, isolation, communication interfaces, ... - 4. **Runtime** and **compilation** systems for adaptive scheduling, data mapping, access/sharing control, ... - 5. **Infrastructures** to assess benefits and feasibility All can be solved with change of mindset #### We Need to Revisit the Entire Stack We can get there step by step ### Adoption: How to Keep It Simple? Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture" Proceedings of the <u>42nd International Symposium on</u> Computer Architecture (ISCA), Portland, OR, June 2015. [Slides (pdf)] [Lightning Session Slides (pdf)] #### PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture Junwhan Ahn Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>†</sup>Carnegie Mellon University ### Adoption: How to Ease **Programmability?** (I) Geraldo F. Oliveira, Alain Kohli, David Novo, Juan Gómez-Luna, Onur Mutlu, "DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures," in PACT SRC Student Competition, Vienna, Austria, October 2023. #### DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures Geraldo F. Oliveira\* Alain Kohli\* David Novo<sup>‡</sup> Juan Gómez-Luna\* Onur Mutlu\* \*ETH Zürich <sup>‡</sup>LIRMM, Univ. Montpellier, CNRS #### Adoption: How to Ease Programmability? (II) Jinfan Chen, Juan Gómez-Luna, Izzat El Hajj, YuXin Guo, and Onur Mutlu, "SimplePIM: A Software Framework for Productive and Efficient Processing in Memory" Proceedings of the <u>32nd International Conference on</u> <u>Parallel Architectures and Compilation Techniques</u> (**PACT**), Vienna, Austria, October 2023. ### SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory Jinfan Chen<sup>1</sup> Juan Gómez-Luna<sup>1</sup> Izzat El Hajj<sup>2</sup> Yuxin Guo<sup>1</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>American University of Beirut ### The Programmability Barrier: Overview #### Programming the UPMEM-based system requires: - 1 Splitting input data and computation across PIM chips - Transferring input data from main memory to PIM chips - 3 Manually handling caching in PIM's scratchpad memory - Transferring output data from PIM chips to main memory **Programmer's Tasks:** #### **Programmer's Tasks:** Align data #### **Programmer's Tasks:** Align data Collect parameters ``` unsigned int kernel = 0; dpu_arguments_t input_arguments[NR_DPUS]; for(i=0; i<nr_of_dpus-1; i++) {</pre> input_arguments[i].size = input_size_dpu_8bytes * sizeof(T); input_arguments[i].transfer_size = input_size_dpu_8bytes * sizeof(T); input_arguments[i].kernel = kernel; input_arguments[nr_of_dpus-1].size = (input_size_8bytes - input_size_dpu_8bytes * (NR_DPUS-1)) * sizeof(T); input_arguments[nr_of_dpus-1].transfer_size = input_size_dpu_8bytes * sizeof(T); input_arguments[nr_of_dpus-1].kernel = kernel; ``` #### **Programmer's Tasks:** Align data Collect parameters Distribute parameters #### **Programmer's Tasks:** Align data Collect parameters Distribute parameters Launch computation DPU\_ASSERT(dpu\_launch(dpu\_set, DPU\_SYNCHRONOUS)); #### **Programmer's Tasks:** Align data pa Collect parameters Distribute parameters Launch computation Collect results #### **Programmer's Tasks:** Align data Collect parameters Distribute parameters Launch computation Collect results Manage scratchpad #### **Programmer's Tasks:** Align Collect Distribute Launch Collect Manage Orchestrate data parameters parameters computation results scratchpad computation ``` for (int byte_index = base_tasklet; byte_index < input_size_dpu_bytes;</pre> byte_index += BLOCK_SIZE * NR_TASKLETS){ uint32_t l_size_bytes = (byte_index + BLOCK_SIZE >= input_size_dpu_bytes) ? (input_size_dpu_bytes - byte_index) : BLOCK_SIZE; mram_read((__mram_ptr void const*)(mram_base_addr_A + byte_index), cache_A, l_size_bytes); mram_read((__mram_ptr void const*)(mram_base_addr_B + byte_index), cache_B, l_size_bytes); vector_addition(cache_B, cache_A, l_size_bytes >> DIV); mram_write(cache_B, (__mram_ptr void*)(mram_base_addr_B + byte_index) 1_size_bytes); ``` #### **Programmer's Tasks:** Goal: Align data Collect parameters Distribute parameters Launch computation Collect results Manage scratchpad Orchestrate computation Just write my kemel ``` static void vector_addition(T *bufferB, T *bufferA, int l_size){ for (unsigned int i = 0; i < l_size; i++){ bufferB[i] += bufferA[i]; } }</pre> ``` # The Programmability Barrier: Summary #### **Our Goal** EOL To ease programmability for the UPMEM system, allowing a programmer to write efficient PIM-friendly code without the need to explicitly manage hardware resources ### **Outline** | 1 | Introduction | | | | | |--------|---------------------------------------------------|--|--|--|--| | 2 | The Programmability Barrier | | | | | | 3 | 3 SimplePIM Overview | | | | | | | Management, Communication & Processing Interfaces | | | | | | | Evaluation Results | | | | | | 4 | DaPPA Overview | | | | | | | DaPPA Main Components | | | | | | | Evaluation Results | | | | | | 5 | Conclusion | | | | | | SAFARI | | | | | | #### SimplePIM: A Software Framework for Productive and Efficient Processing in Memory Jinfan Chen, Juan Gómez-Luna, Izzat El Hajj, Yuxin Guo, and Onur Mutlu, "SimplePIM: A Software Framework for Productive and Efficient Processing in Memory" Proceedings of the <u>32nd International Conference on Parallel Architectures and Compilation Techniques</u> (**PACT**), Vienna, Austria, October 2023. Slides (pptx) (pdf) SimplePIM Source Code ### SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory Jinfan Chen $^1$ Juan Gómez-Luna $^1$ Izzat El Hajj $^2$ Yuxin Guo $^1$ Onur Mutlu $^1$ ETH Zürich $^2$ American University of Beirut ### SimplePIM Programming Framework: Overview SimplePIM provides standard abstractions to build and deploy applications on PIM systems - Management interface - → Metadata for PIM-resident arrays - Communication interface - → Abstractions for host-PIM and PIM-PIM communication - Processing interface - → Iterators (map, reduce, zip) to implement workloads ### SimplePIM Programming Framework: Management Interface - Metadata for PIM-resident arrays - array meta data t describes a PIM-resident array - simple\_pim\_management\_t for managing PIM-resident arrays - lookup: Retrieves all relevant information of an array ``` array_meta_data_t* simple_pim_array_lookup(const char* id, simple_pim_management_t* management); ``` • register: Registers the metadata of an array ``` void simple_pim_array_register(array_meta_data_t* meta_data, simple_pim_management_t* management); ``` • **free**: Removes the metadata of an array ``` void simple_pim_array_free(const char* id, simple_pim_management_t* management); ``` # SimplePIM Programming Framework: Communication Interface (I) #### SimplePIM Host-to-CPU Broadcast - Transfers a host array to all PIM cores in the system ``` void simple_pim_array_broadcast(char* const id, void* arr, uint64_t len, uint32_t type_size, simple_pim_management_t* management); ``` # SimplePIM Programming Framework: Communication Interface (II) #### Host-to-PIM SimplePIM Scatter - Distributes an array to PIM DRAM banks ``` void simple_pim_array_scatter(char* const id, void* arr, uint64_t len, uint32_t type_size, simple_pim_management_t* management); ``` #### Host-to-PIM SimplePIM Gather **Host CPU** SAFARI - Collects portions of an array from PIM DRAM banks SimplePIM Gather ``` void* simple pim_array_gather(char* const id, simple_pim_management_t* management); SimplePIM Scatter Host DRAM Host DRAM Bank Bank ``` PIM Core n PIM Core o # SimplePIM Programming Framework: Communication Interface (III) #### PIM-to-PIM Communication: AllReduce - Used for algorithm synchronization - The programmer specifies an accumulative function ``` void simple_pim_array_allreduce(char* const id, handle_t* handle, simple_pim_management_t* management); ``` ### SimplePIM Programming Framework: Communication Interface (IV) #### PIM-to-PIM Communication: AllGather Combines array pieces and distributes the complete array to all PIM cores ``` void simple_pim_array_allgather(char* const id, char* new_id, simple_pim_management_t* management); ``` # SimplePIM Programming Framework: Processing Interface (I) #### Array Map - Applies map func to every element of the data array ``` void simple_pim_array_map(const char* src_id, const char* dest_id, uint32_t output_type, handle_t* handle, simple_pim_management_t* management); ``` # SimplePIM Programming Framework: Processing Interface (II) #### Array Reduction - The map\_to\_val\_func function transforms an input element to an output value and an output index - The acc\_func function accumulates the output values onto the output array ``` void simple_pim_array_red(const char* src_id, const char* dest_id, uint32_t output_type, uint32_t output_len, handle_t* handle, simple_pim_management_t* management); ``` ### SimplePIM Programming Framework: Processing Interface (III) #### Array Zip Takes two input arrays and combines their elements into an output array ``` void simple_pim_array_zip(const char* src1_id, const char* src2_id, const char* dest_id, simple_pim_management_t* management); ``` ### SimplePIM Programming Framework: General Code Optimizations Strength reduction Loop unrolling Avoiding boundary checks Function inlining Adjustment of data transfer sizes ### **Evaluation Results:** Evaluation Methodology #### Evaluated system - UPMEM PIM system with 2,432 PIM cores with 159 GB of PIM DRAM #### Real-world Benchmarks - Vector addition - Reduction - Histogram - K-Means - Linear regression - Logistic regression - Comparison to hand-optimized codes in terms of programming productivity and performance ### **Evaluation Results:** Productive Improvement (I) #### Example: Hand-optimized histogram with UPMEM SDK ``` ... // Initialize global variables and functions for histogram int main kernel() { if (tasklet id == 0) mem reset(); // Reset the heap ... // Initialize variables and the histogram T *input buff A = (T^*) mem alloc(2048); // Allocate buffer in scratchpad memory for (unsigned int byte index = base tasklet; byte index < input size; byte index += stride) {</pre> // Boundary checking uint32 t l size bytes = (byte index + 2048 >= input size) ? (input size - byte index) : 2048; // Load scratchpad with a DRAM block // Histogram calculation histogram(hist, bins, input buff A, 1 size bytes/sizeof(uint32 t)); barrier wait(&my barrier); // Barrier to synchronize PIM threads ... // Merging histograms from different tasklets into one histo dpu // Write result from scratchpad to DRAM if (tasklet id == 0) if (bins \pm sizeof(uint32 t) <= 2048) mram write(histo dpu, ( mram ptr void*)mram base addr histo, bins * sizeof(uint32 t)); else for (unsigned int offset = 0; offset < ((bins * sizeof(uint32 t)) >> 11); offset++) { mram write(histo dpu + (offset << 9), ( mram ptr void*) (mram base addr histo +</pre> (offset << 11)), 2048); return 0; ``` ### **Evaluation Results:** Productive Improvement (II) #### Example: SimplePIM histogram ``` // Programmer-defined functions in the file "histo filepath" void init func (uint32 t size, void* ptr) { char* casted value ptr = (char*) ptr; for (int i = 0; i < size; i++)</pre> casted value ptr[i] = 0; void acc func (void* dest, void* src) { *(uint32 t*)dest += *(uint32 t*)src; void map to val func (void* input, void* output, uint32 t* key) { uint32 t \overline{d} = *((uint32 t*)input); *(uint32 t*)output = 1; *key = d * bins >> 12; // Host side handle creation and iterator call handle t* handle = simple pim create handle("histo filepath", REDUCE, NULL, 0); // Transfer (scatter) data to PIM, register as "t1" simple_pim_array_scatter("t1", src, bins, sizeof(T), management); // Run histogram on "t1" and produce "t2" simple pim array red("t1", "t2", sizeof(T), bins, handle, management); ``` ### **Evaluation Results:**Productive Improvement (III) #### Lines of code (LoC) reduction | | SimplePIM | Hand-optimized | LoC Reduction | |---------------------|-----------|----------------|---------------| | Reduction | 14 | 83 | 5.93× | | Vector Addition | 14 | 82 | 5.86× | | Histogram | 21 | 114 | 5.43× | | Linear Regression | 48 | 157 | 3.27× | | Logistic Regression | 59 | 176 | 2.98× | | K-Means | 68 | 206 | 3.03× | SimplePIM reduces the number of lines of effective code by a factor of 2.98× to 5.93× ### **Evaluation Results:**Weak Scaling Analysis SimplePIM achieves comparable performance for reduction, histogram, and linear regression SimplePIM outperforms hand-optimized implementations for vector addition, logistic regression, and k-means by 10%-37% # **Evaluation Results: Strong Scaling Analysis** SimplePIM scales better than hand-optimized implementations for reduction, histogram, and linear regression SimplePIM outperforms hand-optimized implementations for vector addition, logistic regression, and k-means by 15%-43% ### **Source Code** https://github.com/CMU-SAFARI/SimplePIM ### SimplePIM: A Software Framework for Productive and Efficient Processing in Memory Jinfan Chen, Juan Gómez-Luna, Izzat El Hajj, Yuxin Guo, and Onur Mutlu, "SimplePIM: A Software Framework for Productive and Efficient Processing in Memory" Proceedings of the <u>32nd International Conference on Parallel Architectures and Compilation Techniques</u> (**PACT**), Vienna, Austria, October 2023. Slides (pptx) (pdf) SimplePIM Source Code # SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory Jinfan Chen $^1$ Juan Gómez-Luna $^1$ Izzat El Hajj $^2$ Yuxin Guo $^1$ Onur Mutlu $^1$ ETH Zürich $^2$ American University of Beirut #### DaPPA: #### A Data-Parallel Framework for Processing-in-Memory Architectures Geraldo F. Oliveira, Alain Kohli, David Novo, Juan Gómez-Luna, Onur Mutlu "DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures," arXiv:2310.10168 [cs.AR] 2<sup>nd</sup> Place ACM Student Research Competition at the <u>32nd International</u> <u>Conference on Parallel Architectures and Compilation Techniques</u> (**PACT**), Vienna, Austria, October 2023. #### DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures Geraldo F. Oliveira\* Alain Kohli\* David Novo<sup>‡</sup> Juan Gómez-Luna\* Onur Mutlu\* \*ETH Zürich <sup>‡</sup>LIRMM, Univ. Montpellier, CNRS #### 1. Motivation & Problem The increasing prevalence and growing size of data in modern applications have led to high costs for computation in traditional *processor-centric computing* systems. To mitigate these costs, the *processing-in-memory* (PIM) [1–6] paradigm moves computation closer to where the data resides, reducing the need to move data between memory and the processor. Even though the concept of PIM has been first proposed in the 1960s [7, 8], real-world PIM systems have only recently face [15, 16] that abstracts the hardware components of the UPMEM system. Using this key idea, DaPPA transforms a data-parallel pattern-based application code into the appropriate UPMEM-target code, including the required APIs for data management and code partition, which can then be compiled into a UPMEM-based binary *transparently* from the programmer. While generating UPMEM-target code, DaPPA implements several code optimizations to improve end-to-end performance. ## DaPPA: Key Idea & Overview Key Ide Leverage an intuitive data-parallel pattern-based interface for PIM programming DaPPA, a <u>Data-Parallel PIM Architecture that</u> automatically <u>distributes</u> input and <u>gathers</u> output data, <u>handles</u> memory management, and <u>parallelizes</u> work across PIM cores #### DaPPA is composed of three main components: - **1** Data-Parallel Pattern APIs - **2** Dataflow Programming Interface - 3 Dynamic Template-Based Compilation ## DaPPA: Data-Parallel Pattern APIs ## Pre-defined functions that implement high-level data-parallel pattern primitives - Skeleton and pattern-based parallel programming are a common abstraction for parallel architectures - M. Cole, "Bringing Skeletons Out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming," Parallel Computing, 2004 - DaPPA supports five primary data-parallel patterns The user can <u>combine</u> all five data-parallel primitives to describe <u>complex data transformations</u> ### DaPPA: **Dataflow Programming Interface** #### DaPPA exposes to the user a dataflow-based programming interface #### DaPPA: #### **Dynamic Template-Based Compilation** DaPPA uses a dynamic template-based compilation to generate PIM code in two main steps - **Templating**: DaPPA creates a base UPMEM code based on a **basic skeleton** of a UPMEM application - We use the Inja C++ templating engine - Optimizations: DaPPA uses a series of transformations to - extract data required by the UPMEM code template - calculate the *memory offsets* for MRAMs and WRAMs - divide computation between CPU and PIM cores DaPPA <u>compiles</u> and <u>executes</u> each stage in a Pipeline per time → allows for <u>runtime optimizations</u> ## Example of DaPPA's implementation of a vector dot product operation #### reduce $C = A_0B_0 + A_1B_1 + A_2B_2 + A_3B_3$ target computation ## Example of DaPPA's implementation of a vector dot product operation #### reduce $$C = A_0B_0 +$$ $$_{\circ}$$ A<sub>1</sub>B<sub>1</sub> + target computation ## Example of DaPPA's implementation of a vector dot product operation ## Example of DaPPA's implementation of a vector dot product operation # **Evaluation:**Methodology Overview #### Evaluation Setup - Host CPU: 2-socket Intel® Xeon Silver 4110 CPU - PIM Cores: 20 UPMEM PIM DIMMs (160 GB PIM memory) - **2560 DPUs** in total - Workloads: 6 workloads from the PrIM benchmark suite - Vector addition (VA); Select (SEL); Unique (UNI); Reduce (RED); General matrix-vector multiply (GEMV); Histogram small (HST-S) #### Metrics - End-to-end execution time (average of 10 runs) - Programming complexity (in lines of code) # **Evaluation:** Performance Analysis # **Evaluation:** Performance Analysis DaPPA <u>significantly improves</u> end-to-end performance compared to hand-tuned implementations # **Evaluation:**Programming Complexity Analysis DaPPA <u>significantly reduces</u> programming complexity by abstracting hardware components # **Evaluation:**Comparison to State-of-the-Art #### SimplePIM [Chen+, PACT'23]: a framework that uses (1) iterator functions and (2) primitives for communication to aid PIM programmability #### Compared to SimplePIM, DaPPA provides three key benefits - Higher abstraction level → The programmer does not need to manually specify communication patterns used during computation - 2. Support for more parallel patterns → DaPPA supports two more parallel primitives (window and group), and allows the mixing of parallel patterns - **3.** Further execution optimizations → DaPPA allows using idle host resources for collaborative execution DaPPA <u>improves</u> state-of-the-art frameworks for PIM programmability #### DaPPA: #### A Data-Parallel Framework for Processing-in-Memory Architectures Geraldo F. Oliveira, Alain Kohli, David Novo, Juan Gómez-Luna, Onur Mutlu "DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures," arXiv:2310.10168 [cs.AR] 2<sup>nd</sup> Place ACM Student Research Competition at the <u>32nd International</u> <u>Conference on Parallel Architectures and Compilation Techniques</u> (**PACT**), Vienna, Austria, October 2023. #### DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures Geraldo F. Oliveira\* Alain Kohli\* David Novo<sup>‡</sup> Juan Gómez-Luna\* Onur Mutlu\* \*ETH Zürich <sup>‡</sup>LIRMM, Univ. Montpellier, CNRS #### 1. Motivation & Problem The increasing prevalence and growing size of data in modern applications have led to high costs for computation in traditional *processor-centric computing* systems. To mitigate these costs, the *processing-in-memory* (PIM) [1–6] paradigm moves computation closer to where the data resides, reducing the need to move data between memory and the processor. Even though the concept of PIM has been first proposed in the 1960s [7, 8], real-world PIM systems have only recently face [15, 16] that abstracts the hardware components of the UPMEM system. Using this key idea, DaPPA transforms a data-parallel pattern-based application code into the appropriate UPMEM-target code, including the required APIs for data management and code partition, which can then be compiled into a UPMEM-based binary *transparently* from the programmer. While generating UPMEM-target code, DaPPA implements several code optimizations to improve end-to-end performance. ## Adoption: How to Maintain Coherence? (I) Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu, "LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory" IEEE Computer Architecture Letters (CAL), June 2016. #### LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory Amirali Boroumand<sup>†</sup>, Saugata Ghose<sup>†</sup>, Minesh Patel<sup>†</sup>, Hasan Hassan<sup>†</sup>, Brandon Lucia<sup>†</sup>, Kevin Hsieh<sup>†</sup>, Krishna T. Malladi<sup>\*</sup>, Hongzhong Zheng<sup>\*</sup>, and Onur Mutlu<sup>‡†</sup> † Carnegie Mellon University \* Samsung Semiconductor, Inc. § TOBB ETÜ <sup>‡</sup> ETH Zürich ### Challenge: Coherence for Hybrid CPU-PIM Apps ## Adoption: How to Maintain Coherence? (II) Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu, "CoNDA: Efficient Cache Coherence Support for Near-**Data Accelerators**" Proceedings of the <u>46th International Symposium on Computer</u> Architecture (ISCA), Phoenix, AZ, USA, June 2019. #### **CoNDA: Efficient Cache Coherence Support** for Near-Data Accelerators Amirali Boroumand<sup>†</sup> Saugata Ghose<sup>†</sup> Minesh Patel\* Hasan Hassan\* Brandon Lucia<sup>†</sup> Rachata Ausavarungnirun<sup>†‡</sup> Kevin Hsieh<sup>†</sup> Nastaran Hajinazar<sup>⋄†</sup> Krishna T. Malladi<sup>§</sup> Hongzhong Zheng<sup>§</sup> Onur Mutlu<sup>⋆†</sup> > <sup>†</sup>Carnegie Mellon University \*ETH Zürich \*Simon Fraser University **‡KMUTNB** §Samsung Semiconductor, Inc. ## Adoption: How to Support Synchronization? Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gómez-Luna, Lois Orosa, Nectarios Koziris, Georgios Goumas, Onur Mutlu, "SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures" Proceedings of the <u>27th International Symposium on High-Performance Computer</u> <u>Architecture</u> (**HPCA**), Virtual, February-March 2021. [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [Talk Video (21 minutes)] [Short Talk Video (7 minutes)] ## SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures ``` Christina Giannoula<sup>†‡</sup> Nandita Vijaykumar<sup>*‡</sup> Nikela Papadopoulou<sup>†</sup> Vasileios Karakostas<sup>†</sup> Ivan Fernandez<sup>§‡</sup> Juan Gómez-Luna<sup>‡</sup> Lois Orosa<sup>‡</sup> Nectarios Koziris<sup>†</sup> Georgios Goumas<sup>†</sup> Onur Mutlu<sup>‡</sup> <sup>†</sup>National Technical University of Athens <sup>‡</sup>ETH Zürich <sup>*</sup>University of Toronto <sup>§</sup>University of Malaga ``` ## Adoption: How to Support Virtual Memory? Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu, "Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation" Proceedings of the 34th IEEE International Conference on Computer Design (ICCD), Phoenix, AZ, USA, October 2016. # Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh<sup>†</sup> Samira Khan<sup>‡</sup> Nandita Vijaykumar<sup>†</sup> Kevin K. Chang<sup>†</sup> Amirali Boroumand<sup>†</sup> Saugata Ghose<sup>†</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup> Carnegie Mellon University <sup>‡</sup> University of Virginia <sup>§</sup> ETH Zürich ## Adoption: Code and Data Mapping Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler, "Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems" Proceedings of the <u>43rd International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), Seoul, South Korea, June 2016. [Slides (pptx) (pdf)] [<u>Lightning Session Slides (pptx) (pdf)</u>] #### Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh<sup>‡</sup> Eiman Ebrahimi<sup>†</sup> Gwangsun Kim\* Niladrish Chatterjee<sup>†</sup> Mike O'Connor<sup>†</sup> Nandita Vijaykumar<sup>‡</sup> Onur Mutlu<sup>§‡</sup> Stephen W. Keckler<sup>†</sup> <sup>‡</sup>Carnegie Mellon University <sup>†</sup>NVIDIA \*KAIST <sup>§</sup>ETH Zürich ### DAMOV Analysis Methodology & Workloads ### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks GERALDO F. OLIVEIRA, ETH Zürich, Switzerland JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland LOIS OROSA, ETH Zürich, Switzerland SAUGATA GHOSE, University of Illinois at Urbana-Champaign, USA NANDITA VIJAYKUMAR, University of Toronto, Canada IVAN FERNANDEZ, University of Malaga, Spain & ETH Zürich, Switzerland MOHAMMAD SADROSADATI, Institute for Research in Fundamental Sciences (IPM), Iran & ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Prior NDP works investigate the root causes of data movement bottlenecks using different profiling methodologies and tools. However, there is still a lack of understanding about the key metrics that can identify different data movement bottlenecks and their relation to traditional and emerging data movement mitigation mechanisms. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques (e.g., caching and prefetching) to more memory-centric techniques (e.g., NDP), thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV. SAFARI https://arxiv.org/pdf/2105.03725.pdf ### **Identifying Memory Bottlenecks** - Multiple approaches to identify applications that: - suffer from data movement bottlenecks - take advantage of NDP - Existing approaches are not comprehensive enough Roofline model → identifies when an application is bounded by compute or memory units Roofline model → identifies when an application is bounded by compute or memory units Roofline model → identifies when an application is bounded by compute or memory units Roofline model → identifies when an application is bounded by compute or memory units Roofline model does not accurately account for the **NDP suitability** of memory-bound applications - Application with a last-level cache MPKI > 10 - → memory intensive and benefits from NDP - Application with a last-level cache MPKI > 10 - → memory intensive and benefits from NDP - Application with a last-level cache MPKI > 10 → memory intensive and benefits from NDP ### LLC MPKI does not accurately account for the NDP suitability of memory-bound applications ### **Identifying Memory Bottlenecks** - Multiple approaches to identify applications that: - suffer from data movement bottlenecks - take advantage of NDP - Existing approaches are not comprehensive enough ### The Problem - Multiple approaches to identify applications that: - suffer from data movement bottlenecks - take advantage of NDP No available methodology can comprehensively: - identify data movement bottlenecks - correlate them with the most suitable data movement mitigation mechanism ### **Our Goal** - Our Goal: develop a methodology to: - methodically identify sources of data movement bottlenecks - comprehensively compare compute- and memorycentric data movement mitigation techniques # **Methodology Overview** # **Methodology Overview** # **Step 1: Application Profiling** Goal: Identify application functions that suffer from data movement bottlenecks Hardware Profiling Tool: Intel VTune **MemoryBound:** CPU is stalled due to load/store # **Methodology Overview** # Step 2: Locality-Based Clustering Goal: analyze application's memory characteristics ## Spatial Locality<sup>7</sup> # Step 2: Locality-Based Clustering · Goal: analyze application's memory characteristics #### Spatial Locality<sup>7</sup> Fride Profile Histogram 1 2 4 8 16 32 ··· 2<sup>N</sup> Stride Profile (bin) High spatial locality Low spatial locality #### Temporal Locality<sup>7</sup> **Memory Trace** reuse profile(4)+= 1 Low temporal locality High temporal locality # **Methodology Overview** SAFARI Last-to-First Miss Ratio (LFMR) **Arithmetic Intensity** LLC MPKI 270 ## Step 3: Memory Bottleneck Classification (1/2) #### **Arithmetic Intensity (AI)** - floating-point/arithmetic operations per L1 cache lines accessed - → shows computational intensity per memory request #### LLC Misses-per-Kilo-Instructions (MPKI) - LLC misses per one thousand instructions - → shows memory intensity #### **Last-to-First Miss Ratio (LFMR)** - LLC misses per L1 misses - → shows if an application benefits from L2/L3 caches ## Step 3: Memory Bottleneck Classification (2/2) Goal: identify the specific sources of data movement bottlenecks - Scalability Analysis: - 1, 4, 16, 64, and 256 out-of-order/in-order host and NDP CPU cores - 3D-stacked memory as main memory # **Step 3: Memory Bottleneck Analysis** ## DAMOV is Open Source We open-source our benchmark suite and our toolchain # DAMOV is Open Source We open-source our benchmark suite and our toolchain #### **Get DAMOV at:** #### https://github.com/CMU-SAFARI/DAMOV #### More on DAMOV Analysis Methodology & Workloads #### More on DAMOV Methods & Benchmarks Geraldo F. Oliveira, Juan Gomez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan fernandez, Mohammad Sadrosadati, and Onur Mutlu, "DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks" *IEEE Access*, 8 September 2021. Preprint in arXiv, 8 May 2021. [arXiv preprint] [IEEE Access version] [DAMOV Suite and Simulator Source Code] [SAFARI Live Seminar Video (2 hrs 40 mins)] [Short Talk Video (21 minutes)] # DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks GERALDO F. OLIVEIRA, ETH Zürich, Switzerland JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland LOIS OROSA, ETH Zürich, Switzerland SAUGATA GHOSE, University of Illinois at Urbana-Champaign, USA NANDITA VIJAYKUMAR, University of Toronto, Canada IVAN FERNANDEZ, University of Malaga, Spain & ETH Zürich, Switzerland MOHAMMAD SADROSADATI, ETH Zürich, Switzerland # Challenge and Opportunity for Future Fundamentally **Energy-Efficient** (Data-Centric) Computing Architectures # Challenge and Opportunity for Future Fundamentally High-Performance (Data-Centric) Computing Architectures # Challenge and Opportunity for Future # Computing Architectures with Minimal Data Movement # Concluding Remarks - We must design systems to be balanced, high-performance, energy-efficient (all at the same time) → intelligent systems - Data-centric, data-driven, data-aware - Enable computation capability inside and close to memory - This can - Lead to orders-of-magnitude improvements - Enable new applications & computing platforms - Enable better understanding of nature - **...** - Future of truly memory-centric computing is bright - We need to do research & design across the computing stack # Fundamentally Better Architectures # **Data-centric** **Data-driven** **Data-aware** ## We Need to Revisit the Entire Stack We can get there step by step # We Need to Exploit Good Principles - Data-centric system design - All components intelligent - Better (cross-layer) communication, better interfaces - Better-than-worst-case design - Heterogeneity - Flexibility, adaptability # **Open minds** # PIM Review and Open Problems # A Modern Primer on Processing in Memory Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup> SAFARI Research Group <sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" Invited Book Chapter in <u>Emerging Computing: From Devices to Systems -</u> Looking Beyond Moore and Von Neumann, Springer, to be published in 2021. # Special Research Sessions & Courses (I) Special Session at ISVLSI 2022: 9 cutting-edge talks # Special Research Sessions & Courses (II) Special Session at ISVLSI 2022: 9 cutting-edge talks # Processing-in-Memory Course (Fall 2024) - Short weekly lectures - Hands-on projects https://safari.ethz.ch/projects and seminars/fall2024/ doku.php?id=processing in memory https://www.youtube.com/playlist?list=PL5Q2soXY2Zi9DSQg7OAlsNO8dOQKx F9sl # Processing-in-Memory Course (Spring 2023) Short weekly lectures https://www.voutube.com/plavlist?list=PL5O2soXY2Zi\_EObuoAZVSg\_o6UvSWQHvZ https://safari.ethz.ch/projects\_and\_seminars/spring2023/doku.php?id =processing in memory ## PIM Course (Fall 2022) #### Fall 2022 Edition: https://safari.ethz.ch/projects and seminars/fall2022 /doku.php?id=processing in memory #### Spring 2022 Edition: https://safari.ethz.ch/projects and seminars/spring2 022/doku.php?id=processing in memory #### Youtube Livestream (Fall 2022): https://www.youtube.com/watch?v=QLL0wQ9I4Dw& list=PL5Q2soXY2Zi8KzG2CQYRNQOVD0GOBrnKy #### Youtube Livestream (Spring 2022): https://www.youtube.com/watch?v=9e4Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX #### Project course - Taken by Bachelor's/Master's students - Processing-in-Memory lectures - Hands-on research exploration - Many research readings https://www.youtube.com/onurmutlulectures #### Spring 2022 Meetings/Schedule | Week | Date | Livestream | Meeting | Learning<br>Materials | Assignment | |------|---------------|--------------|--------------------------------------------------------------------------|------------------------------------------------|------------| | W1 | 10.03<br>Thu. | You Live | M1: P&S PIM Course Presentation (PDF) am (PPT) | Required Materials<br>Recommended<br>Materials | HW 0 Out | | W2 | 15.03<br>Tue. | | Hands-on Project Proposals | | | | | 17.03<br>Thu. | You Premiere | M2: Real-world PIM: UPMEM PIM | | | | W3 | 24.03<br>Thu. | You Live | M3: Real-world PIM: Microbenchmarking of UPMEM PIM and (PDF) and (PPT) | | | | W4 | 31.03<br>Thu. | You Live | M4: Real-world PIM: Samsung HBM-PIM (COR) (PDF) (MI) (PPT) | | | | W5 | 07.04<br>Thu. | You Live | M5: How to Evaluate Data Movement Bottlenecks um (PDF) am (PPT) | | | | W6 | 14.04<br>Thu. | You Live | M6: Real-world PIM: SK Hynix AiM | | | | W7 | 21.04<br>Thu. | You Premiere | M7: Programming PIM Architectures GEN (PDF) INDEX (PPT) | | | | W8 | 28.04<br>Thu. | You Premiere | M8: Benchmarking and Workload<br>Suitability on PIM | | | | W9 | 05.05<br>Thu. | You Premiere | M9: Real-world PIM: Samsung AXDIMM GRAPH (PDF) (PPT) | | | | W10 | 12.05<br>Thu. | You Premiere | M10: Real-world PIM: Alibaba HB-<br>PNM | | | | W11 | 19.05<br>Thu. | You Live | M11: SpMV on a Real PIM Architecture | | | | W12 | 26.05<br>Thu. | You Live | M12: End-to-End Framework for<br>Processing-using-Memory | | | | W13 | 02.06<br>Thu. | You Live | M13: Bit-Serial SIMD Processing using DRAM | | | | W14 | 09.06<br>Thu. | You Live | M14: Analyzing and Mitigating ML<br>Inference Bottlenecks | | | | W15 | 15.06<br>Thu. | You Live | M15: In-Memory HTAP Databases with HW/SW Co-design (PDF) am (PPT) | | | | W16 | 23.06<br>Thu. | You Live | M16: In-Storage Processing for<br>Genome Analysis<br>and (PDF) and (PPT) | | | | W17 | 18.07<br>Mon. | You Premiere | M17: How to Enable the Adoption of PIM? | | | | W18 | 09.08<br>Tue. | You Premiere | SS1: ISVLSI 2022 Special Session on PIM (PDF & PPT) | | 99 | #### Real PIM Tutorials [ISCA'23, ASPLOS'23, HPCA'23] June, March, Feb: Lectures + Hands-on labs + Invited talks ## Real PIM Tutorial [ISCA 2023] #### June 18: Lectures + Hands-on labs + Invited talks #### **Tutorial Materials** | Time | Speaker | Title | Materials | | |---------------------|-----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|----------------------------------|--| | 8:55am-<br>9:00am | Dr. Juan Gómez Luna | Welcome & Agenda | ▶(PDF) P(PPT) | | | 9:00am-<br>10:20am | Prof. Onur Mutlu | Prof. Onur Mutlu Memory-Centric Computing | | | | 10:20am-<br>11:00am | Dr. Juan Gómez Luna | Processing-Near-Memory: Real PNM Architectures / Programming General-purpose PIM | ▶(PDF) P (PPT) | | | 11:20am-<br>11:50am | Prof. Izzat El Hajj | High-throughput Sequence Alignment using Real Processing-in-Memory Systems | ng Real Processing-in-Memory | | | 11:50am-<br>12:30pm | Dr. Christina Giannoula | ina Giannoula SparseP: Towards Efficient Sparse Matrix Vector Multiplication for Real Processing-In-Memory Systems | | | | 2:00pm-<br>2:45pm | Dr. Sukhan Lee | Introducing Real-world HBM-PIM Powered System for Memory-bound Applications | (PDF) (PPT) | | | 2:45pm-<br>3:30pm | Dr. Juan Gómez Luna /<br>Ataberk Olgun | Processing-Using-Memory: Exploiting the Analog Operational Properties of Memory Components / PUM Prototypes: PiDRAM | ↓ (PDF) P (PPT) ↓ (PDF) P (PPT) | | | 4:00pm-<br>4:40pm | Dr. Juan Gómez Luna Accelerating Modern Workloads on a General-purpose PIM System | | ▶(PDF) P(PPT) | | | 4:40pm-<br>5:20pm | Dr. Juan Gómez Luna Adoption Issues: How to Enable PIM? | | ▶(PDF) P(PPT) | | | 5:20pm-<br>5:30pm | Dr. Juan Gómez Luna | Hands-on Lab: Programming and Understanding a Real Processing-in-<br>Memory Architecture | 从(Handout)<br>从(PDF) P (PPT) | | https://www.youtube.com/ live/GIb5EgSrWk0 1.687 views Streamed live on Jun 18, 2023 Livestream - Data-Centric Architectures; Fundamentally Improving Performance and Energy (Spring 2023 https://events.safari.ethz.ch/ isca-pim-tutorial/ ## Real PIM Tutorial [ASPLOS 2023] #### March 26: Lectures + Hands-on labs + Invited talks Onur Mutlu Lectures 32.1K subscribers ## Real-world Processing-in-Memory Systems for Modern Workloads **Accelerating Modern Workloads** on a General-purpose PIM System Dr. Juan Gómez Luna Professor Onur Mutlu ASPLOS 2023 Tutorial: Real-world Processing-in-Memory Systems for Modern Workloads views Streamed 7 days ago Livestream - Data-Centric Architectures: Fundamentally Improving Performance and Energy (Spring 2023) #### **Tutorial Materials** | Time | Speaker | Title | Materials | views Streamed 7 days ago Livestream - Data-Centric Architectures: Fundamentally I | | |---------------------|-------------------------------------------------------|---------------------------------------------------------------------------------------------------|--------------------------------------|------------------------------------------------------------------------------------------------------------------|--| | 9:00am-<br>10:20am | Prof. Onur Mutlu | Memory-Centric Computing | → (PDF) P (PPT) | LOS 2023 Tutorial: Real-world Processing-in-Memory Systems for Modern Worldoads :://events.safari.ethz.ch/asplos | | | 10:40am-<br>12:00pm | Dr. Juan Gómez Luna | Processing-Near-Memory: Real PNM Architectures Programming<br>General-purpose PIM | P (PDF) | | | | 1:40pm-<br>2:20pm | Prof. Alexandra (Sasha) Fedorova (UBC) | Processing in Memory in the Wild | ♪ (PDF) P (PPT) | https://ww | | | 2:20pm-<br>3:20pm | Dr. Juan Gómez Luna & Ataberk<br>Olgun | Processing-Using-Memory: Exploiting the Analog Operational Properties of Memory Components | 본(PDF)<br>P(PPT)<br>본(PDF)<br>P(PPT) | watch?v= | | | 3:40pm-<br>4:10pm | Dr. Juan Gómez Luna | Adoption issues: How to enable PIM? Accelerating Modern Workloads on a General-purpose PIM System | P(PDF) P(PPT) P(PPT) | https://ever | | | 4:10pm-<br>4:50pm | Dr. Yongkee Kwon & Eddy<br>(Chanwook) Park (SK Hynix) | System Architecture and Software Stack for GDDR6-AiM | P (PPT) | https://ever | | | 4:50pm-<br>5:00pm | Dr. Juan Gómez Luna | Hands-on Lab: Programming and Understanding a Real Processing-in-Memory Architecture | | <u>asplos</u> | | https://www.youtube.com/ watch?v=oYCaLcT0Kmo https://events.safari.ethz.ch/ asplos-pim-tutorial/ ## Real PIM Tutorial [HPCA 2023] #### February 26: Lectures + Hands-on labs + Invited Talks https://www.youtube.com/ watch?v=f5-nT1tbz5w https://events.safari.ethz.ch/ real-pim-tutorial/ ## Real PIM Tutorial [MICRO 2023] #### October 29: Lectures + Hands-on labs + Invited talks https://www.youtube.com/ live/ohUooNSIxOI https://events.safari.ethz.ch/micro -pim-tutorial #### Agenda (Tentative, October 29, 2023) #### Lectures - 1. Introduction: PIM as a paradigm to overcome the data movement bottleneck. - 2. PIM taxonomy: PNM (processing near memory) and PUM (processing using memory). - 3. General-purpose PNM: UPMEM PIM. - 4. PNM for neural networks: Samsung HBM-PIM, SK Hynix AiM. - 5. PNM for recommender systems: Samsung AxDIMM, Alibaba PNM. - 6. PUM prototypes: PiDRAM, SRAM-based PUM, Flash-based PUM. - 7. Other approaches: Neuroblade, Mythic. - 8. Adoption issues: How to enable PIM? - 9. Hands-on labs: Programming a real PIM system. #### PIM Tutorial at HEART 2024 #### HEART 2024 Memory-Centric Computing Systems Tutorial Friday, June 21, Porto, Portugal Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati, Ataberk Olgun, Professor Onur Mutlu Program: https://events.safari.ethz.ch/heart24-memorycentric-tutorial/ International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies Overview of PIM | PIM taxonomy PIM in memory & storage Real-world PNM systems PUM for bulk bitwise operations Programming techniques & tools Infrastructures for PIM Research Research challenges & opportunities ## PIM Tutorial at ISCA 2024 #### ISCA 2024 Memory-Centric Computing Systems Tutorial Saturday, June 29, Buenos Aires, Argentina Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati, Ataberk Olgun, Professor Onur Mutlu Program: https://events.safari.ethz.ch/isca24-memorycentric-tutorial/ Overview of PIM | PIM taxonomy PIM in memory & storage Real-world PNM systems PUM for bulk bitwise operations Programming techniques & tools Infrastructures for PIM Research Research challenges & opportunities #### Tutorial at MICRO 2024 #### MICRO 2024 - Tutorial on Memory-Centric Computing Systems Saturday, November 2<sup>nd</sup>, Austin, Texas, USA Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati, Ataberk Olgun, Professor Onur Mutlu Program: https://events.safari.ethz.ch/micro24-memorycentric-tutorial/ Overview of PIM | PIM taxonomy PIM in memory & storage Real-world PNM systems PUM for bulk bitwise operations Programming techniques & tools Infrastructures for PIM Research Research challenges & opportunities #### PIM Tutorial at PPoPP 2025 #### PPoPP 2025 - Tutorial on Memory-Centric Computing Systems March 1<sup>st</sup> - March 5<sup>th</sup>, Las Vegas, Nevada, USA Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati, Ataberk Olgun, Professor Onur Mutlu Program: https://events.safari.ethz.ch/ppopp25-memorycentric-tutorial/ PPOPP 2025 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2025° March 1–5, 2025 Las Vegas, NV, USA Overview of PIM | PIM taxonomy PIM in memory & storage Real-world PNM systems PUM for bulk bitwise operations Programming techniques & tools Infrastructures for PIM Research Research challenges & opportunities # Referenced Papers, Talks, Artifacts All are available at https://people.inf.ethz.ch/omutlu/projects.htm https://www.youtube.com/onurmutlulectures https://github.com/CMU-SAFARI/ # Open Source Tools: SAFARI GitHub # 1<sup>st</sup> Workshop Memory-Centric Computing: Research Challenges & Closing Remarks Geraldo F. Oliveira https://geraldofojunior.github.io ASPLOS 2025 30 March 2025