Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders
H Ham, J Hong, G Park, Y Shin, O Woo, W Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
arXiv preprint arXiv:2404.19381, 2024•arxiv.org
To overcome the memory capacity wall of large-scale AI and big data applications, Compute
Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of
processors. While its CXL. mem protocol stack minimizes interconnect latency, CXL memory
accesses can still result in significant slowdowns for memory-bound applications. While near-
data processing (NDP) in CXL memory can overcome such limitations, prior works propose
application-specific HW units that are not suitable for practical CXL memory-based systems …
Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of
processors. While its CXL. mem protocol stack minimizes interconnect latency, CXL memory
accesses can still result in significant slowdowns for memory-bound applications. While near-
data processing (NDP) in CXL memory can overcome such limitations, prior works propose
application-specific HW units that are not suitable for practical CXL memory-based systems …
To overcome the memory capacity wall of large-scale AI and big data applications, Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL.mem protocol stack minimizes interconnect latency, CXL memory accesses can still result in significant slowdowns for memory-bound applications. While near-data processing (NDP) in CXL memory can overcome such limitations, prior works propose application-specific HW units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but the CXL.io (or PCIe) protocol incurs s-scale latency and is not suitable for fine-grain NDP. To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (MNDP), which comprises memory-mapped functions (Mfunc) and memory-mapped threading (Mthr). The Mfunc is a CXL.mem-compatible low-overhead communication mechanism between the host processor and NDP controller in the CXL memory. The Mthr enables low-cost, general-purpose NDP unit design by introducing lightweight threads that support highly concurrent execution of NDP kernels with minimal resource wastage. By combining them, our MNDP achieves significant speedups for various applications, including in-memory OLAP, key-value store, large language model, recommendation model, and graph analytics by up to 128 (11.5 overall) and reduces energy by up to 87.9\% (80.1\% overall) compared to a baseline CPU or GPU host with passive CXL memory.
arxiv.org