R&D: Seven Articles on CXL Technologies and Solutions
Guidelines for building indexes on partially cache-coherent CXL shared memory, Beluga, CXL-based memory architecture for scalable and efficient LLM KVCache management, FPGA-based distributed shared memory architecture supporting CXL 2.0+ specification, Sangam, chiplet-based DRAM-PIM accelerator with CXL integration for LLM inferencing, pushing memory bandwidth wall with CXL-enabled idle I/O bandwidth harvesting
This is a Press Release edited by StorageNewsletter.com on December 29, 2025 at 2:00 pmR&D: Offloading to CXL-based Computational Memory
This work first examines these tradeoffs and demonstrates their impact on end-to-end performance and system efficiency for workloads with diverse data and processing requirement.
ArXiv has published an article written by Suyeon Lee, School of Computer Science, Georgia Institute of Technology, Atlanta, USA, Kangkyu Park, Kwangsik Shin, Memory Systems Research, SK hynix, Seongnam, Republic of Korea, and Ada Gavrilovska, School of Computer Science, Georgia Institute of Technology, Atlanta, USA.
Abstract: “CXL-based Computational Memory (CCM) enables near-memory processing within expanded remote memory, presenting opportunities to address data movement costs associated with disaggregated memory systems and to accelerate overall performance. However, existing operation offloading mechanisms are not capable of leveraging the trade-offs of different models based on different CXL protocols. This work first examines these tradeoffs and demonstrates their impact on end-to-end performance and system efficiency for workloads with diverse data and processing requirements. We propose a novel ‘Asynchronous Back-Streaming’ protocol by carefully layering data and control transfer operations on top of the underlying CXL protocols. We design KAI, a system that realizes the asynchronous back-streaming model that supports asynchronous data movement and lightweight pipelining in host-CCM interactions. Overall, KAI reduces end-to-end runtime by up to 50.4%, and CCM and host idle times by average 22.11x and 3.85x, respectively.“
R&D: Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
Authors develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP.
ArXiv has published an article written by Zehao Fan, Zhenyu Liu, Rensselaer Polytechnic Institute, Troy, NY, USA, Yunzhen Liu, University of Massachusetts Amherst, Amherst, MA, USA, Yayue Hou, Rensselaer Polytechnic Institute, Troy, NY, USA, Hadjer Benmeziane, IBM Research Europe, Switzerland, Kaoutar El Maghraoui, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, and Liu Liu, Rensselaer Polytechnic Institute, Troy, NY, USA.
Abstract: “Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP’s limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.“
R&D: Guidelines for Building Indexes on Partially Cache-Coherent CXL Shared Memory
Paper focuses on building consistent and efficient indexes on PCC platforms.
ArXiv has published an article written by Fangnuo Wu, Mingkai Dong, Wenjun Cai, Jingsheng Yan, and Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University, China.
Abstract: “The \emph{Partial Cache-Coherence (PCC)} model maintains hardware cache coherence only within subsets of cores, enabling large-scale memory sharing with emerging memory interconnect technologies like Compute Express Link (CXL). However, PCC’s relaxation of global cache coherence compromises the correctness of existing single-machine software.“
“This paper focuses on building consistent and efficient indexes on PCC platforms. We present that existing indexes designed for cache-coherent platforms can be made consistent on PCC platforms following SP guidelines, i.e., we identify \emph{sync-data} and \emph{protected-data} according to the index’s concurrency control mechanisms, and synchronize them accordingly. However, conversion with SP guidelines introduces performance overhead. To mitigate the overhead, we identify several unique performance bottlenecks on PCC platforms, and propose P^3 guidelines (i.e., using Out-of-\underline{P}lace update, Re\underline{P}licated shared variable, S\underline{P}eculative Reading) to improve the efficiency of converted indexes on PCC platforms.“
“With SP and P^3 guidelines, we convert and optimize two indexes (CLevelHash and BwTree) for PCC platforms. Evaluation shows that converted indexes’ throughput improves up to 16\times following P^3 guidelines, and the optimized indexes outperform their message-passing-based and disaggregated-memory-based counterparts by up to 16\times and 19\times.“
R&D: Beluga, CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management
Authors propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches.
ArXiv has published an article written by Xinjun Yang, Alibaba Cloud Computing, Sunnyvale, CA, USA, Qingda Hu, Alibaba Cloud Computing, Hangzhou, China, Junru Li, Alibaba Cloud Computing, Beijing, China, Feifei Li, Yicong Zhu, Alibaba Cloud Computing, Hangzhou, China, Yuqi Zhou, Alibaba Cloud Computing, Beijing, China, Qiuru Lin, Jian Dai, Alibaba Cloud Computing, Hangzhou, China, Yang Kong, Jiayu Zhang, Alibaba Cloud Computing, Shanghai, China, Guoqiang Xu, Alibaba Cloud Computing, Hangzhou, China, and Qiang Liu, Alibaba Cloud Computing, Shenzhen, China.
Abstract: “The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-KVCache, a system tailored for managing the large-scale KVCache in LLM inference. Beluga-KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the vLLM inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.“
R&D: FPGA-Based Distributed Shared Memory Architecture Supporting CXL 2.0+ Specification
Paper proposes an FPGA-based distributed shared memory architecture supporting the CXL 2.0+ specification.
Network and Parallel Computing has published an article written by Xiuhao Huang, Jinge Ding, Haikun Liu, Zhuohui Duan, Xiaofei Liao, and Hai Jin, National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab/Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, 430074, Wuhan, China
Abstract: “Big data and AI applications pose significant challenges to traditional distributed shared memory architectures, where network bandwidth and latency constraints have become critical bottlenecks. Although the Compute Express Link (CXL) protocol promises low-latency, high-bandwidth interconnects for memory expansion, existing CXL 1.1 devices still cannot support fine-grained memory sharing across multiple nodes. This paper proposes an FPGA-based distributed shared memory architecture supporting the CXL 2.0+ specification. It features three key innovations for transparent cross-node memory accesses: 1) replacing conventional network stacks with CXL physical links to mitigate the performance overhead of frequent data copying; 2) a hardware-managed memory controller with interleaved access mechanisms to optimize the bandwidth utilization of the CXL-DDR channel; 3) hierarchical queues to ensure memory access orders under high concurrency. This fine-grained memory sharing architecture supports zero-copy data swapping across multiple servers via a pass-by-reference manner. Experimental results show that the end-to-end access latency of our CXL-based shared memory architecture is as low as \(1.25\,\upmu \text {s}\), 5\(\times \) lower than that of one-sided Remote Direct Memory Access (RDMA).“
R&D: Sangam, Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing
Work presents a chiplet-based memory module that addresses limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer.
ArXiv has published an article written by Khyati Kiyawat, Zhenxing Fan, Yasas Seneviratne, Morteza Baradaran, Akhil Shekar, Zihan Xia, Mingu Kang, Kevin Skadron, University of Virginia, University of California, San Diego, CA, USA.
Abstract: “Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.“
R&D: Pushing Memory Bandwidth Wall with CXL-enabled Idle I/O Bandwidth Harvesting
Authors introduce SURGE, a software-supported architectural technique that boosts memory bandwidth availability by salvaging idle I/O bandwidth resources.
ArXiv has published an article written by Divya Kiran Kadiyala, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA, and Alexandros Daglis, School of Computer Science, Georgia Institute of Technology, Atlanta, Georgia, USA.
Abstract: “The continual increase of cores on server-grade CPUs raises demands on memory systems, which are constrained by limited off-chip pin and data transfer rate scalability. As a result, high-end processors typically feature lower memory bandwidth per core, at the detriment of memory-intensive workloads. We propose alleviating this challenge by improving the utility of the CPU’s limited pins. In a typical CPU design process, the available pins are apportioned between memory and I/O traffic, each accounting for about half of the total off-chip bandwidth availability. Consequently, unless both memory and I/O are simultaneously highly utilized, such fragmentation leads to underutilization of the valuable off-chip bandwidth resources. An ideal architecture would offer I/O and memory bandwidth fungibility, allowing use of the aggregate off-chip bandwidth in the form required by each workload.“
“In this work, we introduce SURGE, a software-supported architectural technique that boosts memory bandwidth availability by salvaging idle I/O bandwidth resources. SURGE leverages the capability of versatile interconnect technologies like CXL to dynamically multiplex memory and I/O traffic over the same processor interface. We demonstrate that SURGE-enhanced architectures can accelerate memory-intensive workloads on bandwidth-constrained servers by up to 1.3x.“












