Xinnor's xiRAID Opus and Gradient Prioritization Cut GPU Memory Costs for LLM Training

Xinnor has published new benchmarking research addressing one of the most pressing infrastructure challenges in AI today: GPU memory is critically insufficient for training large language models at scale, and simply adding DRAM is not an economically viable answer.

Training modern Transformer-based models demands 18 bytes or more per parameter once optimizer states, gradients, and mixed-precision master weights are accounted for. As parameter counts climb into the tens of billions, GPU VRAM becomes the hard ceiling. Expanding that ceiling with additional DRAM is increasingly unattractive: DRAM prices have risen sharply, and DRAM’s physical locality to the CPU, not the GPU, means every optimizer update still traverses PCIe, creating a CPU- and memory-bandwidth bottleneck that grows worse as model size increases. The problem, in short, cannot be solved simply by buying more RAM.

Xinnor’s answer begins with its NVMe-over-RDMA fabric, xiRAID Opus – the company’s high-performance software RAID engine and data protection solution built for disaggregated NVMe environments. xiRAID Opus aggregates local and networked NVMe volumes into high-throughput virtual volumes, delivering what Xinnor believes is the fastest NVMe-oF fabric available. Yet the company is direct about its limits: even the fastest NVMe-over-RDMA fabric cannot match GPU or DRAM memory speeds in raw latency and bandwidth terms. On its own, it is not enough.

That is where ZenFlow changes the equation. ZenFlow, a gradient-priority extension to DeepSpeed, exploits the empirical observation that a small fraction of gradients carries the overwhelming majority of the update signal. It keeps the most impactful gradients (the top 10% by gradient norm) resident on GPU for immediate update, while asynchronously offloading and updating the remaining lower-priority gradients in the background. This approach dramatically reduces the volume of data that must transit the storage fabric on the critical path of each training step.

To measure the combined effect, Xinnor conducted controlled training experiments on Llama-style models from 1B to 16B parameters, comparing GPU-baseline performance against multiple offloading strategies across CPU RAM, single-drive NVMe, and RAID0 NVMe configurations, with and without ZenFlow.

The results validate the thesis. Full optimizer offload to DRAM does reduce GPU memory pressure, but step time degrades as model size grows because the optimizer update becomes CPU- and memory-bandwidth bound, confirming that DRAM is not a scalable solution. NVMe offload alone introduces substantial latency and bandwidth penalties. However, when RAID0 NVMe aggregation is combined with ZenFlow, the picture changes materially: RAID0 NVMe with ZenFlow operated within approximately 1.6–1.8x of the RAM-plus-ZenFlow configuration and approached GPU-baseline iteration times at higher gradient accumulation settings, a standard production knob. Critically, the data shows that once optimizer states are placed on NVMe, training performance becomes primarily a storage bandwidth engineering problem. RAID0 consistently delivered roughly 2x step-time reduction versus a single drive, confirming the workload is throughput-bound and responds directly to device-level parallelism.

The combination of xiRAID Opus and ZenFlow thus achieves what neither can alone: it expands effective training memory capacity far beyond GPU VRAM and DRAM, delivers near-baseline training performance, and does so while keeping GPU utilization high – only marginally sacrificing training quality through the gradient deferral mechanism.

The economics follow naturally. As GPU memory and flash costs rise, organizations that can extend training capacity through fast, aggregated NVMe fabric (rather than proportionally expanding GPU fleets) gain a meaningful cost advantage. xiRAID Opus supports both unprotected RAID0 volumes for offload aggregation and protected RAID5/6+ volumes for checkpoint safety over NVMe-over-RDMA, enabling flexible, high-utilization storage architectures without stranded “dark flash.”

“GPU memory is the bottleneck that is blocking organizations from training the models they need, and DRAM alone doesn’t solve it — the economics and locality constraints work against you,” said Dmitry Livshits, CEO, Xinnor. “We built xiRAID Opus to be the fastest NVMeOF fabric possible, but we knew raw speed wasn’t sufficient on its own. Combining it with ZenFlow’s gradient prioritization is what closes the gap. Together, they let organizations expand their effective training memory dramatically, maintain strong GPU utilization, and do it without the capital cost of additional GPUs.”

The full benchmark analysis, including detailed step-time data, I/O characterization, and per-model-size results, is available on the Xinnor blog.