As the compute demands of high-performance computing (HPC) and artificial intelligence (AI) grow exponentially, traditional system architectures are reaching their limits. Memory bottlenecks, latency spikes, and inefficient resource utilization are now standing in the way of scalable performance. Enter Compute Express Link (CXL) , a transformative open industry standard that redefines how memory, processors, and accelerators interconnect.
CXL offers low-latency, cache-coherent, and high-bandwidth communication between CPUs, GPUs, memory modules, and accelerators. For developers building performance-critical applications in domains like AI inference, scientific computing, real-time analytics, or large-scale data processing, CXL provides a highly efficient and flexible alternative to traditional memory and I/O subsystems.
This blog explores why CXL is not just another interconnect but a crucial innovation for the future of computing. We’ll cover its architectural value, real-world benefits for developers, support across hardware and software ecosystems, and how CXL changes the game in both HPC and AI infrastructure.
Modern AI models, especially large language models (LLMs), diffusion transformers, and computer vision networks, are memory-hungry beasts. In HPC, complex simulations involving weather modeling, quantum mechanics, or fluid dynamics require frequent data movement between processing units and memory, which introduces bottlenecks.
Legacy interconnects like PCIe, DDR, and NUMA-based architectures suffer from fundamental limitations:
CXL was born to solve these issues, by creating a unified memory fabric with cache coherency, load/store semantics, and flexible memory tiering that makes these legacy boundaries obsolete.
At the heart of CXL’s utility lies its three protocol layers, CXL.io, CXL.cache, and CXL.mem. While CXL.io maintains compatibility with PCIe, the real innovation lies in the cache and memory protocols.
CXL.cache enables a CPU or accelerator to access another device’s memory space as if it were its own, with full cache-coherent semantics. That means you don’t need to write code to explicitly manage data movement or worry about keeping caches synchronized. CXL handles that transparently in hardware.
This is transformative for AI developers using GPUs, TPUs, or custom accelerators. You no longer need to allocate separate memory buffers, copy inputs and outputs back and forth, or manage stale data issues. Everything is coherently shared.
CXL.mem allows the host processor to access memory on a connected device, like an expansion module or memory pool. Think of it as plug-and-play memory expansion over PCIe, but at near-DDR latency and with full coherency support.
For HPC workloads that require massive memory bandwidth and capacity, like genome sequencing or seismic data processing, this enables heterogeneous memory architectures with seamless memory extension and sharing between multiple processing elements.
By combining both protocols, CXL enables bidirectional, coherent, low-latency access between CPUs and devices like GPUs, memory expanders, and even FPGAs or NPUs.
Traditionally, increasing memory capacity means upgrading your entire server, including CPUs and motherboards. With CXL, this model is flipped.
CXL Type-3 devices (aka memory expansion modules) allow you to add DRAM or even byte-addressable non-volatile memory (NVM) to a system as an external resource. This is transparent to the software and accessed just like system memory. Vendors like Micron and SK hynix are already delivering DDR5-based CXL memory modules with high throughput and low latency.
Imagine scaling from 512GB to 2TB of memory in a node without touching the CPU or motherboard. With CXL, this is not only possible, it’s efficient.
CXL 3.0 introduces fabric capabilities, where memory resources from multiple devices can be pooled and accessed by multiple hosts. This is ideal for AI training clusters or HPC environments where memory demands vary per workload.
A CXL switch can sit between hosts and memory devices, providing dynamic memory assignment, disaggregation, and sharing, much like how cloud-native apps use shared storage. This leads to higher memory utilization, better resource efficiency, and reduced idle memory, lowering the total cost of ownership (TCO).
CXL isn’t just theoretical, it’s delivering real gains in performance and efficiency:
Micron’s CXL-based memory expander, the CZ120, shows:
This translates directly into faster model training, quicker simulation runtimes, and less developer time tuning memory access patterns.
SK hynix’s DDR5-based CXL Memory Module (CMM) boosts memory bandwidth by 82%, doubles the addressable capacity, and improves:
These gains make it easier to scale large models without running into memory bottlenecks, enabling developers to build deeper, more accurate neural networks and simulations.
Latency is a critical factor in applications like autonomous driving, financial trading, and real-time machine learning inference. Traditional RDMA-based interconnects offer high bandwidth but still operate in the microsecond latency range.
CXL leverages the latest PCIe 5.0 and 6.0 PHY layers, delivering bandwidths of 64–128 GB/s per 16-lane connection and latency in the hundreds of nanoseconds range. That’s 10x faster than RDMA and comparable to native DRAM in many cases.
For developers, this means you can offload time-critical logic to accelerators or CXL-attached memory without worrying about excessive delays or jitter, paving the way for more responsive, smarter edge and data center systems.
One of CXL’s most powerful benefits is hardware-managed memory coherence. This simplification propagates up the stack, reducing development time and improving maintainability.
With CXL, developers no longer need to write data-copy logic between devices. Load/store semantics make the same memory accessible to multiple devices simultaneously, without buffering or synchronization headaches.
Shared memory pools accessed via CXL switches improve overall memory utilization across servers, reducing stranded memory. Less overprovisioning means:
For data center architects and system integrators, CXL brings a compelling value proposition for green AI and eco-efficient HPC.
CXL isn’t just for hardware vendors, it dramatically changes how software developers and AI/ML engineers build and optimize applications.
Instead of designing around device-local memory and communication APIs (like RDMA or MPI), developers can now use standard pointers and memory allocators to access shared resources.
Linux 6.5+ includes native CXL support in the kernel, with user-space APIs and memory-aware NUMA balancing. Tools like cxl-cli, memkind, and libmemif let you manage and test memory allocations programmatically.
Because CXL devices are memory-mapped into the CPU’s address space, traditional tools like gdb, valgrind, and perf can work without modification, no special drivers or communication libraries required.
Start with CPUs like Intel Xeon Scalable (Granite Rapids), AMD EPYC Genoa/Bergamo, or GPUs/accelerators that advertise CXL compatibility.
Install Linux 6.5+ and enable CONFIG_CXL_MEM, CONFIG_DEV_DAX, and CONFIG_LIBNVDIMM. Use cxl list to enumerate attached memory devices.
Plug in a CXL memory expander (Micron CZ120, SK hynix DDR5-CMM) and test allocation via memkind or custom memory allocators.
Build applications that share pointer-based data between CPU and accelerators using CXL.cache + CXL.mem modes. Test coherence with multi-threaded, shared-state models.
Explore CXL 3.0 fabrics via switches and pooled memory. Vendors like Astera Labs and Montage Technology are pioneering CXL switch deployments.
With support for up to 4096 devices and multi-host topologies, CXL is enabling composable data centers where compute and memory scale independently.
Hybrid modules with NAND-backed DRAM offer persistent, byte-addressable memory, perfect for large-model storage or failure-resilient simulations.
Vendor-provided SDKs (like SK hynix HSMDK, Micron’s FAM-FS) provide abstractions, testing frameworks, and memory management toolkits for CXL-native development.
CXL is more than just a new bus protocol, it’s a rethinking of how modern systems interconnect processing and memory. For developers, this means simpler code, more powerful hardware abstraction, and a direct path to scaling AI and HPC workloads without architectural compromises.
Whether you're a kernel developer, ML engineer, cloud architect, or HPC researcher, now is the time to start exploring and adopting Compute Express Link. It brings scalable memory expansion, efficient pooling, hardware-enforced coherence, and developer-centric tooling, all wrapped into a future-proof ecosystem supported by the biggest players in the industry.