ASPLOS 2025 Memory Centric Workshop

1st Workshop on

Memory-Centric Computing Systems (MCCSys) - 30 March 2025

Workshop Description

Processing-in-Memory (PIM) is a computing paradigm that aims to overcome data movement bottlenecks by making memory systems compute-capable. Explored over several decades since the 1960s, PIM systems are now becoming a reality with the advent of the first commercial products and prototypes. PIM can improve performance and energy efficiency for many modern applications. However, there are many open questions spanning the entire computing stack and many challenges for widespread adoption.

This combined tutorial and workshop will focus on the latest advances in PIM technology, spanning both hardware and software. It will include novel PIM ideas, different tools and frameworks for conducting PIM research, and programming techniques and optimization strategies for PIM kernels. First, we will provide a series of lectures and invited talks that will provide an introduction to PIM, including an overview and a rigorous analysis of existing PIM hardware from industry and academia. Second, we will invite the broad PIM research community to submit and present their ongoing work on memory-centric systems. The program committee will favor papers that bring new insights on memory-centric systems or novel PIM-friendly applications, address key system integration challenges in academic or industry PIM architectures, or put forward controversial points of view on the memory-centric execution paradigm. We also consider position papers, especially from industry, that outline design and process challenges affecting PIM systems, new PIM architectures, or system solutions for real state-of-the-art PIM devices.

Time & Location: March 30th, from 09:00 AM (CET) to 05:30P PM (CET) at the Penn Room II.

Procedure for Selecting Presentations [CLOSED]

This workshop consists of invited talks on the general topic of memory-centric computing systems. There are a limited number of slots for invited talks. If you would like to deliver a talk on related topics, please contact us by filling out this form. The submission deadline is February 28, 2025, 23:59 AoE. We invite abstract submissions related to (but not limited to) the following topics in the context of memory-centric computing systems:

Design of novel and new processing-in-memory (PIM) architectures, including system solutions for real state-of-the-art PIM devices
Analysis and mapping of novel applications to state-of-the-art PIM systems
Programming models and code generation support for PIM
Runtime engines for adaptive code and data scheduling, data mapping, access control for PIM systems
Memory coherence mechanisms for collaborative host–PIM execution
Virtual memory support for a unified host and PIM address space
Data structures and algorithms for PIM systems
Infrastructures to assess the benefits and feasibility of PIM systems, including benchmarks and simulation infrastructures for PIM prototyping
Issues related to robustness and security of PIM systems
Experimental analysis and benchmarking of real PIM systems

Livestream

YouTube livestream

Organizers

Name	E-mail
Geraldo F. Oliveira	geraldod@safari.ethz.ch
Dr. Mohammad Sadrosadati	mohammad.sadrosadati@safari.ethz.ch
Dr. A. Giray Yağlıkçı	mohammad.sadrosadati@safari.ethz.ch
Ataberk Olgun	ataberk.olgun@safari.ethz.ch
Professor Onur Mutlu	onur.mutlu@safari.ethz.ch

Agenda & Workshop Materials

Time	Speaker	Title	Materials
09:00am	Geraldo F. Oliveira	Logistics	(PDF) (PPT)
09:00am-09:30am	Prof. Onur Mutlu	Recent Advances in Processing-in-DRAM	(PDF) (PPT)
10:30am-11:00am	N/A	Coffee Break
11:00am-11:30am	Geraldo F. Oliveira	Processing-Near-Memory Systems: Developments from Academia & Industry	(PDF) (PPT)
11:30am-12:00pm	Geraldo F. Oliveira	Processing-Using-Memory Systems for Bulk Bitwise Operations	(PDF) (PPT)
12:00am-12:30pm	Dr. Mohammad Sadr	Processing-Near-Storage & Processing-Using-Storage	(PDF) (PPT)
12:30pm	Geraldo F. Oliveira	Infrastructure for PIM Research & Research Challenges	(PDF)(PPT)
12:30pm-02:00pm	N/A	Lunch Break
02:00pm-02:30pm	Hamid Farzaneh	CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms	(PDF) (PPT)
02:30pm-03:00pm	Theocharis Diamantidis	Harnessing PIM Techniques for Accelerating Sum Operations in FPGA-DRAM Architectures	(PDF) (PPT)
03:00pm-03:30pm	Krystian Chmielewski	Pitfalls of UPMEM Kernel Development	(PDF) (PPT)
03:30pm-04:00pm	N/A	Coffee Break
04:00pm-04:30pm	Yintao He	PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System	(PDF) (PPT)
04:30pm-05:00pm	Yufeng Gu	PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference	(PDF)
05:00pm-05:30pm	Dr. Christina Giannoula	PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures	(PDF) (PPT)

Invited Speakers

Hamid Farzaneh (TU Dresden)

Talk Title: CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms

Talk Abstract: The rise of data-intensive applications exposed the limitations of conventional processor-centric von-Neumann architectures that struggle to meet the off-chip memory bandwidth demand. Therefore, recent innovations in computer architecture advocate compute-in-memory (CIM) and compute-near-memory (CNM), non-von- Neumann paradigms achieving orders-of-magnitude improvements in performance and energy consumption. Despite significant technological breakthroughs in the last few years, the programmability of these systems is still a serious challenge. Their programming models are too low-level and specific to particular system implementations. Since such future architectures are predicted to be highly heterogenous, developing novel compiler abstractions and frameworks become necessary. To this end, we present CINM (Cinnamon), a first end-to-end compilation flow that leverages the hierarchal abstractions to generalize over different CIM and CNM devices and enable device-agnostic and device-aware optimizations. Cinnamon progressively lowers input programs and performs optimizations at each level in the lowering pipeline. To show its efficacy, we evaluate CINM on a set of benchmarks for the well-known UPMEM CNM system and the memristors-based CIM accelerators. We show that Cinnamon, supporting multiple hardware targets, generates high-performance code comparable to or better than state-of-the-art implementations.

Bio: Hamid Farzaneh received his bachelor's degree in Computer Engineering from Shiraz University in August 2019, and his master's degree in Computer Systems and Architecture from Shahid Beheshti University in November 2021.In August 2022, he joined the chair as a research assistant. He works on high-level compiler frameworks (like MLIR) and optimization for data and computation mapping onto highly heterogeneous systems with mainstream CPUs, FPGAs, SRAM, DRAM, and emerging NVMs and accelerators.

Theocharis Diamantidis (National Technical University of Athens)

Talk Title: Harnessing PIM Techniques for Accelerating Sum Operations in FPGA-DRAM Architectures

Talk Abstract: In the current work, we present and evaluate a method to exploit circuits originally implemented for storage operations in memory arrays, such that we can perform logic and arithmetic operations inside commercial DRAM chips. That is to say, we explore techniques for performing bit-level calculations by relying exclusively on the analog properties inherent in memory arrays. We start by showing binary logic results, such as the AND/OR/NOT gates, which we then apply to multiple bits in sequential and parallel fashion. By combining distinct gate calculations, we construct a fully-functional full adder procedure to perform single-bit addition entirely within the DRAM chip. Furthermore, we chain multiple full adders to create a complete multi-bit adder function, which can be applied in parallel to thousands of numbers, and therefore, accelerate huge summations. We evaluate the speed-up factor achieved by our method, alongside the errors induced by various approaches.

Bio: Theocharis Diamantidis is a current student at the National Technical University of Athens (NTUA) in the Department of Electrical and Computer Engineering where he now completes his Master thesis. He aspires to pursue a doctorate at the MicroLab laboratory at NTUA. His academic interests lie in circuit design and simulation, with a focus on Analog and Mixed-Signal Circuits and their applications. His master's thesis explored Compute-in-Memory, specifically leveraging DRAM circuitry to analyze the summation operation and investigate the speed-up it offers.

Krystian Chmielewski (Huawei Warsaw Research Center)

Talk Title: Pitfalls of UPMEM Kernel Development

Talk Abstract: Developing kernels for Processing-In-Memory (PIM) platforms presents unique challenges, particularly in data management and executing tasks concurrently on limited PIM processing units. While software development kits (SDKs) for PIM, such as the UPMEM SDK, provide essential tools for programmers, these platforms are still evolving and offer significant opportunities for performance optimization with existing hardware. In this talk, we share our surprising findings on the inefficiencies within the UPMEM software stack. We will present several straightforward cases where simple modifications to the assembly generated by the UPMEM compiler led to substantial performance improvements, including a staggering 1.6-4.9x speedup in integer multiplication (depending on the integer types used). Additionally, we will demonstrate how minor extensions to the UPMEM API enabled us to better manage NUMA (Non-Uniform Memory Access) effects on our test platform. Consequently, the performance of data transfer operations between the host and PIM (in both directions) became more consistent and improved by 30% over the baseline.

Bio: Krystian Chmielewski is a software engineer with 8 years of experience in emerging computing architectures and low-level performance optimizations. Since 2023, he has been working at Huawei Warsaw Research Center in Poland focusing on the enablement of novel Processing-In-Memory architectures and optimizing the JVM's just-in-time compiler. Prior to this, Krystian spent 6 years at Intel, where he specialized in compute runtimes and worked on features such as Mutable Command Lists.

Yintao He (UCAS)

Talk Title: PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Talk Abstract: Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8× and 11.1× speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.

Bio: Yintao He received the BE degree in electronic science and technology from Nankai University, Tianjin, China, in 2019. She is currently working toward the PhD degree with the University of Chinese Academy of Sciences, Beijing, China. Her research interests include processing in-memory and energy-efficient accelerators.

Yufeng Gu (University of Michigan)

Talk Title: PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference

Talk Abstract: Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced LLMs and makes users to pay for expensive compute resources that are poorly utilized for the memory-bound LLM inference tasks.

Bio: Yufeng Gu is a PhD candidate at University of Michigan, advised by Dr. Reetuparna Das. Prior to University of Michigan, Yufeng obtained a bachelor's degree from Zhejiang University in 2020. His research focuses on computer architecture, hardware/software co-design, near-memory processing and quality of service optimization. I am developing novel hardware and software solutions on accelerating large scale emerging applications, such as precision health and generative artificial intelligence (GenAI) workloads.

Dr. Christina Giannoula (University of Toronto)

Talk Title: PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures

Talk Abstract: Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML library that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers.

Bio: Christina Giannoula received the Ph.D. degree from the School of Electrical and Computer Engineering, National Technical University of Athens, advised by Prof. Georgios Goumas, Prof. Nectarios Koziris, and Prof. Onur Mutlu, in October 2022. She is currently a Postdoctoral Researcher with the University of Toronto working with Prof. Gennady Pekhimenko and his research group. She is also with the SAFARI Research Group and Prof. Onur Mutlu. Her research interests include the intersection of computer architecture, computer systems, and high-performance computing. Specifically, her research focuses on the hardware/software co-design of emerging applications, including graph processing, pointer-chasing data structures, machine learning workloads, and sparse linear algebra, with modern computing paradigms, such as large-scale multicore systems, disaggregated memory systems, and near-data processing architectures. She has several publications and awards for her research on the aforementioned topics. She is a member of ACM, ACM-W, and the Technical Chamber of Greece.

Recommended Materials

Mutlu, O., Ghose, S., Gómez-Luna, J., and Ausavarungnirun, R., “A Modern Primer on Processing in Memory.” In Emerging Computing: From Devices to Systems, 2023.
- PDF (arXiv)
Gómez-Luna, J., El Hajj, I., Fernandez, I., Giannoula, C., Oliveira, G. F., and Mutlu, O., “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System.” IEEE Access, 2022.
- PDF (arXiv)
- Repository (GitHub)
Giannoula, C., Fernandez, I., Gómez-Luna, J., Koziris, N., Goumas, G., and Mutlu, O., “SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures,” in SIGMETRICS 2022.
- PDF (arXiv)
- Repository (GitHub)
Olgun, A., Gómez-Luna, J., Kanellopoulos, K., Salami, B., Hassan, H., Ergin, O., and Mutlu, O., “PiDRAM: A Holistic End-to-End FPGA-Based Framework for Processing-in-DRAM.” ACM TACO, 2022.
- PDF (arXiv)
- Repository (GitHub)
Oliveira, G. F., Gómez-Luna, J., Orosa, L., Ghose, S., Vijaykumar, N., Fernandez, I., Sadrosadati, M., Mutlu, O., “DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks.” IEEE Access, 2021.
- PDF (arXiv)
- Repository (GitHub)
Luo, H., Tu, Y. C., Bostancı, F. N., Olgun, A., Ya, A. G., Mutlu, O., “Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator.” IEEE CAL, 2023.
- PDF (arXiv)
- Repository (GitHub)
Olgun, A., Hassan, H., Yağlıkçı, A. G., Tuğrul, Y. C., Orosa, L., Luo, H., Patel, M., Ergin, O., Mutlu, O., “DRAM Bender: An Extensible and Versatile FPGA-Based Infrastructure to Easily Test State-of-the-Art DRAM Chips.” IEEE CAD, 2023.
- PDF (arXiv)
- Repository (GitHub)
Oliveira, G. F., Olgun, A., Yaglikci, A. G., Bostanci, N., Gomez-Luna, J., Ghose, S., Mutlu, O., “MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing,” in HPCA, 2024.
- PDF (arXiv)
- Repository (GitHub)
Hajinazar, N., Oliveira, G. F., Gregorio, S., Ferreira, J. D., Ghiasi, N. M., Patel, M., Alser, M., Ghose, S., Gomez-Luna, J., Mutlu. O., “SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM,” in ASPLOS, 2021.
- PDF (arXiv)
- Full Talk Video
Seshadri, V., Lee, D., Mullins, T., Hassan, H., Boroumand, A., Kim, J., Kozuch, M. A., Mutlu, O., Gibbons, P. B., Mowry, T. C., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
- PDF
Schwedock, B.C., Yoovidhya, P., Seibert, J. and Beckmann, N., “Täkō: A Polymorphic Cache Hierarchy for General-Purpose Optimization of Data Movement,” in ISCA, 2022.
- PDF
Schwedock, B.C. and Beckmann, N., “Leviathan: A Unified System for General-Purpose Near-Data Computing,” in MICRO, 2024.
- PDF

More Learning Materials

Mutlu O., Memory-Centric Computing (IMACAW Keynote Talk at DAC 2023), July 2023:
- PDF PPT Video
Processing-in-Memory: A Workload-Driven Perspective (summary paper about recent research in PIM):
- PDF
Processing Data Where It Makes Sense: Enabling In-Memory Computation (summary paper about recent research in PIM):
- PDF
Processing-in-Memory course (Spring 2022):
- Course website
Gómez-Luna, J., and Mutlu, O., Data-Centric Architectures: Fundamentally Improving Performance and Energy (227-0085-37L), ETH Zürich, Fall 2022.

Table of Contents