Why CXL Matters for High-Performance Computing and AI

Written By:
Founder & CTO
June 25, 2025

As the compute demands of high-performance computing (HPC) and artificial intelligence (AI) grow exponentially, traditional system architectures are reaching their limits. Memory bottlenecks, latency spikes, and inefficient resource utilization are now standing in the way of scalable performance. Enter Compute Express Link (CXL) ,  a transformative open industry standard that redefines how memory, processors, and accelerators interconnect.

CXL offers low-latency, cache-coherent, and high-bandwidth communication between CPUs, GPUs, memory modules, and accelerators. For developers building performance-critical applications in domains like AI inference, scientific computing, real-time analytics, or large-scale data processing, CXL provides a highly efficient and flexible alternative to traditional memory and I/O subsystems.

This blog explores why CXL is not just another interconnect but a crucial innovation for the future of computing. We’ll cover its architectural value, real-world benefits for developers, support across hardware and software ecosystems, and how CXL changes the game in both HPC and AI infrastructure.

Breaking Down Traditional Bottlenecks in HPC and AI

Modern AI models, especially large language models (LLMs), diffusion transformers, and computer vision networks, are memory-hungry beasts. In HPC, complex simulations involving weather modeling, quantum mechanics, or fluid dynamics require frequent data movement between processing units and memory, which introduces bottlenecks.

Legacy interconnects like PCIe, DDR, and NUMA-based architectures suffer from fundamental limitations:

  • Non-coherent memory: CPUs and accelerators maintain separate memory spaces, necessitating redundant data copies and synchronization.

  • Fixed memory allocation: Traditional architectures lack the ability to dynamically allocate memory where it’s needed.

  • Limited scalability: Scaling up memory often means scaling up compute unintentionally, increasing cost and power consumption.

CXL was born to solve these issues, by creating a unified memory fabric with cache coherency, load/store semantics, and flexible memory tiering that makes these legacy boundaries obsolete.

Unlocking GPU & CPU Collaboration with CXL.cache and CXL.mem

At the heart of CXL’s utility lies its three protocol layers, CXL.io, CXL.cache, and CXL.mem. While CXL.io maintains compatibility with PCIe, the real innovation lies in the cache and memory protocols.

What Is CXL.cache?

CXL.cache enables a CPU or accelerator to access another device’s memory space as if it were its own, with full cache-coherent semantics. That means you don’t need to write code to explicitly manage data movement or worry about keeping caches synchronized. CXL handles that transparently in hardware.

This is transformative for AI developers using GPUs, TPUs, or custom accelerators. You no longer need to allocate separate memory buffers, copy inputs and outputs back and forth, or manage stale data issues. Everything is coherently shared.

What Is CXL.mem?

CXL.mem allows the host processor to access memory on a connected device, like an expansion module or memory pool. Think of it as plug-and-play memory expansion over PCIe, but at near-DDR latency and with full coherency support.

For HPC workloads that require massive memory bandwidth and capacity, like genome sequencing or seismic data processing, this enables heterogeneous memory architectures with seamless memory extension and sharing between multiple processing elements.

By combining both protocols, CXL enables bidirectional, coherent, low-latency access between CPUs and devices like GPUs, memory expanders, and even FPGAs or NPUs.

Memory Expansion & Pooling: Scaling Memory Without Scaling CPUs

Traditionally, increasing memory capacity means upgrading your entire server, including CPUs and motherboards. With CXL, this model is flipped.

CXL Memory Expansion Modules

CXL Type-3 devices (aka memory expansion modules) allow you to add DRAM or even byte-addressable non-volatile memory (NVM) to a system as an external resource. This is transparent to the software and accessed just like system memory. Vendors like Micron and SK hynix are already delivering DDR5-based CXL memory modules with high throughput and low latency.

Imagine scaling from 512GB to 2TB of memory in a node without touching the CPU or motherboard. With CXL, this is not only possible, it’s efficient.

Memory Pooling via CXL Fabric

CXL 3.0 introduces fabric capabilities, where memory resources from multiple devices can be pooled and accessed by multiple hosts. This is ideal for AI training clusters or HPC environments where memory demands vary per workload.

A CXL switch can sit between hosts and memory devices, providing dynamic memory assignment, disaggregation, and sharing, much like how cloud-native apps use shared storage. This leads to higher memory utilization, better resource efficiency, and reduced idle memory, lowering the total cost of ownership (TCO).

Dramatic Performance Gains in Real-World Use Cases

CXL isn’t just theoretical, it’s delivering real gains in performance and efficiency:

Micron’s CZ120: Real-World Benchmarks

Micron’s CXL-based memory expander, the CZ120, shows:

  • +24% improvement in sequential read bandwidth

  • +38% uplift in mixed workload performance

  • Up to 24% speedup in memory-bound HPC and AI workloads

This translates directly into faster model training, quicker simulation runtimes, and less developer time tuning memory access patterns.

SK hynix DDR5-CMM: Accelerated Throughput

SK hynix’s DDR5-based CXL Memory Module (CMM) boosts memory bandwidth by 82%, doubles the addressable capacity, and improves:

  • AI inference token throughput by 31%

  • HPC computation throughput by 33%

These gains make it easier to scale large models without running into memory bottlenecks, enabling developers to build deeper, more accurate neural networks and simulations.

Low-Latency, Real-Time Advantage for Time-Critical Applications

Latency is a critical factor in applications like autonomous driving, financial trading, and real-time machine learning inference. Traditional RDMA-based interconnects offer high bandwidth but still operate in the microsecond latency range.

Built on PCIe 5.0 and 6.0

CXL leverages the latest PCIe 5.0 and 6.0 PHY layers, delivering bandwidths of 64–128 GB/s per 16-lane connection and latency in the hundreds of nanoseconds range. That’s 10x faster than RDMA and comparable to native DRAM in many cases.

For developers, this means you can offload time-critical logic to accelerators or CXL-attached memory without worrying about excessive delays or jitter, paving the way for more responsive, smarter edge and data center systems.

Simplified Software Stack & Total Cost of Ownership Reduction

One of CXL’s most powerful benefits is hardware-managed memory coherence. This simplification propagates up the stack, reducing development time and improving maintainability.

No More Manual Data Movement

With CXL, developers no longer need to write data-copy logic between devices. Load/store semantics make the same memory accessible to multiple devices simultaneously, without buffering or synchronization headaches.

Higher Utilization, Lower Power

Shared memory pools accessed via CXL switches improve overall memory utilization across servers, reducing stranded memory. Less overprovisioning means:

  • Lower upfront hardware costs (CAPEX)

  • Lower ongoing energy and cooling costs (OPEX)

  • Lower carbon footprint for sustainability goals

For data center architects and system integrators, CXL brings a compelling value proposition for green AI and eco-efficient HPC.

Developer Experience: What Changes with CXL?

CXL isn’t just for hardware vendors, it dramatically changes how software developers and AI/ML engineers build and optimize applications.

Simplified Programming Model

Instead of designing around device-local memory and communication APIs (like RDMA or MPI), developers can now use standard pointers and memory allocators to access shared resources.

Support in Linux and Open Source

Linux 6.5+ includes native CXL support in the kernel, with user-space APIs and memory-aware NUMA balancing. Tools like cxl-cli, memkind, and libmemif let you manage and test memory allocations programmatically.

Better Portability and Debugging

Because CXL devices are memory-mapped into the CPU’s address space, traditional tools like gdb, valgrind, and perf can work without modification, no special drivers or communication libraries required.

CXL vs Traditional Alternatives
NUMA/PCIe Architectures
  • Siloed memory per socket

  • High inter-socket latency

  • Manual memory affinity tuning required

RDMA-based Clusters
  • High throughput, but high latency (~2–5μs)

  • Requires special programming models (verbs, libfabric)

  • Poor support for cache coherence

CXL Interconnect
  • Shared, cache-coherent memory with sub-microsecond latency

  • Load/store programming model

  • Dynamically composable infrastructure

  • Developer-friendly, OS-integrated

Getting Started with CXL in Your Stack
1. Choose Hardware That Supports CXL

Start with CPUs like Intel Xeon Scalable (Granite Rapids), AMD EPYC Genoa/Bergamo, or GPUs/accelerators that advertise CXL compatibility.

2. Upgrade Your Linux Kernel

Install Linux 6.5+ and enable CONFIG_CXL_MEM, CONFIG_DEV_DAX, and CONFIG_LIBNVDIMM. Use cxl list to enumerate attached memory devices.

3. Start with Memory Expansion

Plug in a CXL memory expander (Micron CZ120, SK hynix DDR5-CMM) and test allocation via memkind or custom memory allocators.

4. Use Shared Pointers in Hybrid Apps

Build applications that share pointer-based data between CPU and accelerators using CXL.cache + CXL.mem modes. Test coherence with multi-threaded, shared-state models.

5. Experiment with Fabrics

Explore CXL 3.0 fabrics via switches and pooled memory. Vendors like Astera Labs and Montage Technology are pioneering CXL switch deployments.

Looking Ahead: What’s Next for Developers
CXL 3.0 Fabric Ecosystems

With support for up to 4096 devices and multi-host topologies, CXL is enabling composable data centers where compute and memory scale independently.

Persistent Memory Integration

Hybrid modules with NAND-backed DRAM offer persistent, byte-addressable memory, perfect for large-model storage or failure-resilient simulations.

Dev Tools & SDKs

Vendor-provided SDKs (like SK hynix HSMDK, Micron’s FAM-FS) provide abstractions, testing frameworks, and memory management toolkits for CXL-native development.

Final Thoughts: Why CXL Is a Game-Changer

CXL is more than just a new bus protocol, it’s a rethinking of how modern systems interconnect processing and memory. For developers, this means simpler code, more powerful hardware abstraction, and a direct path to scaling AI and HPC workloads without architectural compromises.

Whether you're a kernel developer, ML engineer, cloud architect, or HPC researcher, now is the time to start exploring and adopting Compute Express Link. It brings scalable memory expansion, efficient pooling, hardware-enforced coherence, and developer-centric tooling, all wrapped into a future-proof ecosystem supported by the biggest players in the industry.