R&D: Four Articles on CXL Technologies and Applications

R&D: Hardware-Software Co-Development for Emerging CXL Architectures
Authors propose a hardware-software co-development framework for future systems

ACM Digital Library has published, MemSys ’25: Proceedings of the International Symposium on Memory Systems, an article written by Roberto Gioiosa, Bo Fang, Lenny Guo, and Andres Marquez, Pacific Northwest National Laboratory, Richland, WA, USA.

Abstract: “Modern scientific, and graph analytics workloads demand substantial memory resources to accommodate their extensive datasets. While distributed systems connected via high-performance networks have traditionally been employed to address such challenges, technology is emerging as a compelling alternative. systems offer a shared memory abstraction over physically disaggregated memory with load/store programming semantics, simplifying the development of applications that require large memory pools.“

“However, as hardware is still under development, its internal mechanisms and the performance implications for critical applications remain largely unexplored. To address this gap, we propose a hardware-software co-development framework for future systems. Our approach combines a -enabled full-system emulator with a memory allocator (MemForge) backed by memory devices and a high-level set of s.“

“We demonstrate that our methodology supports the development of essential kernels across, and graph analytics domains. Experimental results obtained from two hardware configurations — direct-attached memory and memory accessed via a switch — indicate minimal runtime overhead. Additionally, we highlight the internal introspective capabilities of our memory allocator, which facilitate profiling and debugging.“

R&D: ZipCXL, CXL-based Main Memory Compression at Low Performance Penalty
Paper tackles a challenge by introducing three simple yet effective design techniques to enhance the design of compression-capable CXL memory controllers

ACM Digital Library has published, MemSys ’25: Proceedings of the International Symposium on Memory Systems, an article written by Asad Ul Haq, Rui Xie, Linsen Ma, Yunhua Fang, Liu Liu, and Tong Zhang, ECSE, Rensselaer Polytechnic Institute, Troy, NY, USA.

Abstract: “The escalating cost of DRAM and the typically high compressibility of memory content make main memory compression highly desirable. However, its practical deployment has been hindered by significant challenges, including its adverse impact on performance and, more critically, the substantial integration challenges it poses to computing infrastructure. The emerging Compute Express Link (CXL) ecosystem provides a unique opportunity to implement main memory compression with minimal integration overhead, shifting the primary adoption barrier towards performance impact. This paper tackles this challenge by introducing three simple yet effective design techniques to enhance the design of compression-capable CXL memory controllers. The first two techniques improve the trade-off between compression ratio and speed performance by dynamically adjusting compression configurations in adaptation to runtime data characteristics. The third technique mitigates compression-induced speed performance degradation by decoupling the in-memory placement of compressed data blocks from their associated error correction code (ECC) redundancy. To evaluate these techniques, we performed RTL-level design and synthesis to estimate silicon cost overhead and developed a simulation platform to capture the trade-offs between compression ratio and speed performance. The results demonstrate that the proposed techniques effectively improve compression ratio vs. performance trade-offs with negligible silicon cost overhead.“

R&D: Hierarchical Framework for Multi-node Compute eXpress Link Memory Transactions
Authors describe a novel solution for supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based disaggregated shared-memory architecture

ACM Digital Library has published, MemSys ’25: Proceedings of the International Symposium on Memory Systems, an article written by Ellis Robinson Giles, Elex Technologies, Houston, TX, USA, and Peter Varman, Rice University, Houston, TX, USA.

Abstract: “There is an increasing need to support high-volume, concurrent transaction processing on shared data in both high-performance and datacenter computing. A recent innovation in server architectures is the adoption of disaggregated memory organizations utilizing the Compute eXpress Link (CXL) interconnect protocol. While CXL memory architectures alleviate many concerns in datacenters, enforcing ACID semantics for transactions in CXL memory faces many challenges.“

“We describe a novel solution for supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based disaggregated shared-memory architecture. We call this solution HTCXL for Hierarchical Transactional CXL. HTCXL is implemented in a software library that enforces transaction semantics within a host, along with a back-end controller to detect conflicts across hosts. HTCXL is a modular solution that allows various combinations of HTM or software-based transaction management to be mixed as needed.

We perform experimental evaluation of HTCXL using micro-architectural processor simulation and several STAMP benchmarks. Our method shows a significant speedup over a software approach on CXL fabric.“

R&D: Exploring Multi-level Cache Prefetching for Fabric Attached Memory
Memory disaggregation in data centers has been approaching practicality, owing to the maturity of interconnect standards like Compute Express Link (CXL)

ACM Digital Library has published, MemSys ’25: Proceedings of the International Symposium on Memory Systems, an article written by Chandrahas Tirumalasetty, and Narasimha Reddy, Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX, USA.

Abstract: “Memory disaggregation in data centers has been approaching practicality, owing to the maturity of interconnect standards like Compute Express Link (CXL) [3]. CXL presents a hardware centric approach for multiple compute nodes to pool memory capacities from a shared Fabric Attached Memory (FAM) node, on a per need basis. Using FAM for memory provisioning can potentially mitigate resource underutilization and yield in cost savings, but can cost the application it’s performance due to relatively longer access latency.“

“Modern processors attempt to hide memory access latency by employing sophisticated cache prefetchers. While resourceful, current cache prefetching techniques can be further optimized, in light of the long access latency of CXL FAM. To that end, we consider multi-level cache prefetcher that adds additional layer of prefetching at Last Level Cache (LLC). Our multi-level prefetching scheme increases the fraction of requests that hit in LLC, potentially decreasing the sensitivity of workload to FAM latency. We implemented our multi-level cache prefetcher using SST simulation components [29], and evaluated it with workloads from standard benchmarks suites in single and multi-node system configuration. Our evaluation reveals that, comparing to using only per-core prefetcher, multi-level prefetcher resulted in performance improvement of 2-7%, with LLC hit fraction increasing by 13%.“