Differences

This shows you the differences between two versions of the page.

--- start [2025/06/07 23:29] – geraldod
+++ start [2025/06/08 15:38] (current) – [Agenda & Workshop Materials (Tentative)] kkoliogeorgi
@@ Line 46: / Line 46: @@
 | 10:00 AM | Dr. Geraldo F. Oliveira | Processing-Using-Memory (PUM) Systems - Part I   | {{geraldo-ics25-lecture2-PUM-part-I-beforelecture.pdf|(PDF)}} {{geraldo-ics25-lecture2-PUM-part-I-beforelecture.pptx|(PPT)}} |
 | 10:30 AM | N/A | **Coffee Break** | |
-| 10:45 AM | Ismail E. Yuksel | Functionally-Complete Boolean Logic in Real DRAM Chips  |  |
+| 10:45 AM | Ismail E. Yuksel | Functionally-Complete Boolean Logic in Real DRAM Chips  | {{ics25_mccsys_fcdram_ismail_talk.pdf|(PDF)}} {{ics25_mccsys_fcdram_ismail_talk.pptx|(PPT)}}  |
 | 11:15 AM | Dr. Geraldo F. Oliveira | Processing-Using-Memory (PUM) Systems - Part II | {{geraldo-ics25-lecture3-PUM-part-II-beforelecture.pdf|(PDF)}} {{geraldo-ics25-lecture3-PUM-part-II-beforelecture.pptx|(PPT)}} |
 | 11:45 AM | Dr. Geraldo F. Oliveira | Processing-Near-Memory (PNM) Systems: Academia & Industry Developments - Part I | {{geraldo-ics25-lecture4-PNM-part-I-beforelecture.pdf|(PDF)}} {{geraldo-ics25-lecture4-PNM-part-I-beforelecture.pptx|(PPT)}} |
 | 12:00 PM | N/A | **Lunch** | |
 | 01:00 PM | Dr. Geraldo F. Oliveira |Processing-Near-Memory (PNM) Systems: Academia & Industry Developments - Part II  | {{geraldo-ics25-lecture5-PNM-part-II-beforelecture.pdf|(PDF)}} {{geraldo-ics25-lecture5-PNM-part-II-beforelecture.pptx|(PPT)}} |
-| 01:30 PM | Dr. Konstantina Koliogeorgi | PIM Architectures for Bioinformatics  |  |
+| 01:30 PM | Dr. Konstantina Koliogeorgi | PIM Architectures for Bioinformatics  | {{konstantina-ics25-pim-arch-for-genomics.pdf|(PDF)}} {{konstantina-ics25-pim-arch-for-genomics.pptx|(PPT)}} |
 | 02:00 PM | Dr. Geraldo F. Oliveira | PIM Adoption & Programmability | {{geraldo-ics25-lecture6-adoption-beforelecture.pdf|(PDF)}} {{geraldo-ics25-lecture6-adoption-beforelecture.pptx|(PPT)}} |
-| 02:30 PM | Dr. Geraldo F. Oliveira | //Proteus//: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic |  |
+| 02:30 PM | Dr. Geraldo F. Oliveira | //Proteus//: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic | {{geraldo_proteus_v1.pdf|(PDF)}} {{geraldo_proteus_v1.pptx|(PPT)}} |
 | 03:00 PM | N/A | **Coffee Break** | |
 | 03:15 PM | Taewoon Kang | SparsePIM: An Efficient HBM-Based PIM Architecture for Sparse Matrix-Vector Multiplications |  |
@@ Line 63: / Line 63: @@
 ==== Invited Speakers ====
-=== Ismail E. Yüksek (ETH Zurich) ===
+=== Ismail E. Yüksel (ETH Zurich) ===
 **Talk Title:** Functionally-Complete Boolean Logic in Real DRAM Chip {{ ::ismail_headshot.jpeg?nolink&200|}}
@@ Line 69: / Line 69: @@
-**Bio:** [[https://www.linkedin.com/in/ismail-emir-yuksel |Ismail E. Yüksek]] is a 2nd-year PhD student in the SAFARI Research Group at ETH Zurich under the supervision of Prof.Onur Mutlu. His current broader research interests are in computer architecture, processing-in-memory, and hardware security, focusing on understanding, enhancing, and exploiting fundamental computational capabilities of modern DRAM architectures.
+**Bio:** [[https://www.linkedin.com/in/ismail-emir-yuksel |Ismail E. Yüksel]] is a 2nd-year PhD student in the SAFARI Research Group at ETH Zurich under the supervision of Prof.Onur Mutlu. His current broader research interests are in computer architecture, processing-in-memory, and hardware security, focusing on understanding, enhancing, and exploiting fundamental computational capabilities of modern DRAM architectures.
-=== Theocharis Diamantidis (National Technical University of Athens) ===
+=== Konstantina Koliogeorgi (ETH Zurich) ===
-**Talk Title:** Harnessing PIM Techniques for Accelerating Sum Operations in FPGA-DRAM Architectures
+**Talk Title:** PIM Architectures for Bioinformatics {{ ::konstantina_headshot.jpeg?nolink&200|}}
-**Talk Abstract:** In the current work, we present and evaluate a method to exploit circuits originally implemented for storage operations in memory arrays, such that we can perform logic and arithmetic operations inside commercial DRAM chips. That is to say, we explore techniques for performing bit-level calculations by relying exclusively on the analog properties inherent in memory arrays. We start by showing binary logic results, such as the AND/OR/NOT gates, which we then apply to multiple bits in sequential and parallel fashion. By combining distinct gate calculations, we construct a fully-functional full adder procedure to perform single-bit addition entirely within the DRAM chip. Furthermore, we chain multiple full adders to create a complete multi-bit adder function, which can be applied in parallel to thousands of numbers, and therefore, accelerate huge summations. We evaluate the speed-up factor achieved by our method, alongside the errors induced by various approaches.
+**Talk Abstract:** As bioinformatics workflows grow increasingly data-intensive — from genome sequencing to proteomics and large-scale biological simulations — traditional compute architectures face significant memory bottlenecks. This talk explores the potential of Processing-in-Memory (PIM) architectures to revolutionize bioinformatics by bringing computation closer to the data. We will cover key PIM design principles, highlight recent advancements in PIM-enabled bioinformatics applications (such as sequence alignment), and discuss practical considerations for integrating PIM into existing HPC.
+**Bio:** [[https://www.linkedin.com/in/konstantina-koliogeorgi-256965269 | Konstantina Koliogeorgi]] received the Diploma and Ph.D. degree in Electrical and Computer Engineering from the Microprocessors and Digital Systems Laboratory of the National Technical University of Athens, Greece in 2016 and 2023, respectively.She is currently a Postdoctoral Researcher at ETH Zurich at SAFARI Research Group. Her research activity lies in the field of computer systems, hardware-software co-design, heterogeneous computing and hardware acceleration. She is particularly interested in leveraging these principles for the computational and architectural optimization of genome analysis applications.
-**Bio:** Theocharis Diamantidis is a current student at the National Technical University of Athens (NTUA) in the Department of Electrical and Computer Engineering where he now completes his Master thesis. He aspires to pursue a doctorate at the MicroLab laboratory at NTUA. His academic interests lie in circuit design and simulation, with a focus on Analog and Mixed-Signal Circuits and their applications. His master's thesis explored Compute-in-Memory, specifically leveraging DRAM circuitry to analyze the summation operation and investigate the speed-up it offers.
+=== Prof. Elaheh Sadredini (University of California, Riverside) ===
+**Talk Title:** Keep it Close, Keep it Secure! Towards Efficient, Secure, and Programmable Memory-Centric Computing {{ ::elaheh_headshot.jpeg?nolink&200|}}
-=== Krystian Chmielewski (Huawei Warsaw Research Center) ===
-**Talk Title:** Pitfalls of UPMEM Kernel Development
-**Talk Abstract:** Developing kernels for Processing-In-Memory (PIM) platforms presents unique challenges, particularly in data management and executing tasks concurrently on limited PIM processing units. While software development kits (SDKs) for PIM, such as the UPMEM SDK, provide essential tools for programmers, these platforms are still evolving and offer significant opportunities for performance optimization with existing hardware. In this talk, we share our surprising findings on the inefficiencies within the UPMEM software stack. We will present several straightforward cases where simple modifications to the assembly generated by the UPMEM compiler led to substantial performance improvements, including a staggering 1.6-4.9x speedup in integer multiplication (depending on the integer types used). Additionally, we will demonstrate how minor extensions to the UPMEM API enabled us to better manage NUMA (Non-Uniform Memory Access) effects on our test platform. Consequently, the performance of data transfer operations between the host and PIM (in both directions) became more consistent and improved by 30% over the baseline.
+**Talk Abstract:** Processing-in-memory (PIM) architectures are increasingly promising for accelerating data-intensive workloads, but key challenges remain in making them secure, programmable, and deployable across platforms. This talk presents our efforts to tackle these challenges through the co-design of hardware, software, and security mechanisms that make PIM systems more practical and trustworthy. We develop near-cache and in-SRAM PIM architectures that support a wide range of cryptographic kernels with high internal bandwidth and system integration. To address programmability, we develop a compiler framework that automatically maps high-level code to efficient PIM execution through advanced source transformations, PIM-aware loop optimizations, and cost-driven layout and instruction selection. To enable secure execution, we leverage secure multi-party computation (MPC) as a lightweight, privacy-preserving mechanism that enables secure computing on real-world PIM hardware. Together, these contributions bring PIM systems closer to practical deployment in both cloud and edge environments.
+**Bio:** [[https://www.cs.ucr.edu/~elaheh/ | Elaheh Sadredini]] is an Assistant Professor of Computer Science and Engineering at the University of California, Riverside. Her research broadly focuses on developing secure, high-performance, and energy-efficient data-centric architectures. She received her Ph.D. from the University of Virginia in 2019 and joined UCR in 2020. Her work has appeared in top-tier venues including MICRO, ISCA, ASPLOS, and HPCA, USENIX Security, DAC, ICS, and KDD, and has earned several recognitions, including the NSF CAREER Award, a Best Paper Award at ACM Computing Frontiers, the “Best of CAL” award, and multiple best paper nominations, including HPCA’20, FCCM’20, and IISWC’19. She is also a recipient of the Hellman Fellowship and the John A. Stankovic Graduate Research Award.
-**Bio:** Krystian Chmielewski is a software engineer with 8 years of experience in emerging computing architectures and low-level performance optimizations. Since 2023, he has been working at Huawei Warsaw Research Center in Poland focusing on the enablement of novel Processing-In-Memory architectures and optimizing the JVM's just-in-time compiler. Prior to this, Krystian spent 6 years at Intel, where he specialized in compute runtimes and worked on features such as Mutable Command Lists.
+=== Taewoon Kang (Korea University) ===
+**Talk Title:** SparsePIM: An Efficient HBM-Based PIM Architecture for Sparse Matrix-Vector Multiplications {{ ::taewoon_headshot.jpg?nolink&200|}}
-=== Yintao He (UCAS) ===
+**Talk Abstract:** Sparse matrix-vector multiplication (SpMV) is a fundamental operation across diverse domains, including scientific computing, machine learning, and graph processing. However, its irregular memory access patterns necessitate frequent data retrieval from external memory, leading to significant inefficiencies on conventional processors such as CPUs and GPUs. Processing-in-memory (PIM) presents a promising solution to address these performance bottlenecks observed in memory-intensive workloads. However, existing PIM architectures are primarily optimized for dense matrix operations since conventional memory cell structures struggle with the challenges of indirect indexing and unbalanced data distributions inherent in sparse computations.
-**Talk Title:** PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System {{ ::yintao_headshot.jpeg?nolink&200|}}
+In order to address these challenges, we propose SparsePIM, a novel PIM architecture designed to accelerate SpMV computations efficiently. SparsePIM introduces a DRAM row-aligned format (DRAF) to optimize memory access patterns. SparsePIM exploits K-means-based column group partitioning to achieve a balanced load distribution across memory banks. Furthermore, SparsePIM includes bank group (BG) accumulators to mitigate the performance burdens of accumulating partial sums in SpMV operations. By aggregating partial results across multiple banks, SparsePIM can significantly improve the throughput of sparse matrix computations. Leveraging a combination of hardware and software optimizations, SparsePIM can achieve significant performance gains over cuSPARSE-based SpMV kernels on the GPU. Our evaluation demonstrates that SparsePIM achieves up to 5.61x speedup over SpMV on GPUs.
-**Talk Abstract:** Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels.
-In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8× and 11.1× speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.
+**Bio:** [[https://csarch.korea.ac.kr/authors/taewoon-kang/ | Taewoon Kang]] is a graduate student pursuing a Ph.D degree (Ph.D./Masters integrated course) in Department of Computer Science and Engineering at Korea University. His research interest lies in near data processing (NDP) and FPGA-based accelerator design. His current research focuses on processing-in-memory (PIM). Taewoon earned his B.S. in System Semiconductor Engineering from Sangmyung University, South Korea.
-**Bio:** Yintao He received the BE degree in electronic science and technology from Nankai University, Tianjin, China, in 2019. She is currently working toward the PhD degree with the University of Chinese Academy of Sciences, Beijing, China. Her research interests include processing in-memory and energy-efficient accelerators.
-=== Yufeng Gu (University of Michigan) ===
-**Talk Title:** PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference {{ ::yufeng_headshot.jpg?nolink&200|}}
-**Talk Abstract:** Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced LLMs and makes users to pay for expensive compute resources that are poorly utilized for the memory-bound LLM inference tasks.
-**Bio:** [[https://web.eecs.umich.edu/~yufenggu/|Yufeng Gu]] is a PhD candidate at University of Michigan, advised by Dr. Reetuparna Das. Prior to University of Michigan, Yufeng obtained a bachelor's degree from Zhejiang University in 2020. His research focuses on computer architecture, hardware/software co-design, near-memory processing and quality of service optimization. I am developing novel hardware and software solutions on accelerating large scale emerging applications, such as precision health and generative artificial intelligence (GenAI) workloads.
-=== Dr. Christina Giannoula (University of Toronto) ===
-**Talk Title:** PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures {{ ::christina_headshot.jpg?nolink&200|}}
-**Talk Abstract:** Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memory-intensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML library that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04x, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers.
-**Bio:** [[https://cgiannoula.github.io/#home|Christina Giannoula]] received the Ph.D. degree from the School of Electrical and Computer
-Engineering, National Technical University of Athens, advised by Prof. Georgios Goumas, Prof. Nectarios Koziris, and Prof. Onur Mutlu, in October 2022. She is currently a Postdoctoral Researcher with the University of Toronto working with Prof. Gennady Pekhimenko and his research group. She is also with the SAFARI Research Group and Prof. Onur Mutlu. Her research interests include the intersection of computer architecture, computer systems, and high-performance computing. Specifically, her research focuses on the hardware/software co-design of emerging applications, including graph processing, pointer-chasing data structures, machine learning workloads, and sparse linear algebra, with modern computing paradigms, such as large-scale multicore systems, disaggregated memory systems, and near-data processing architectures. She has several publications and awards for her research on the aforementioned topics. She is a member of ACM, ACM-W, and the Technical Chamber of Greece.