What are you looking for ?
Advertise with us
RAIDON

PEAK:AIO Solution to Unify KVCache Acceleration and GPU Memory Expansion for Large-scale AI Workloads

Purpose-built for AI: Unifying KVCache Reuse and GPU memory expansion using CXL to address one of AI’s most persistent infrastructure challenges

PEAK:AIO unveiled the 1st dedicated solution to unify KVCache acceleration and GPU memory expansion for large-scale AI workloads, including inference, agentic systems, and model creation.

Peak Aio Logo

As AI workloads evolve beyond static prompts into dynamic context streams, model creation pipelines, and long-running agents, infrastructure must evolve, too.

Whether you are deploying agents that think across sessions or scaling toward million-token context windows, where memory demands can exceed 500GB per model, this appliance makes it possible by treating token history as memory, not storage,” said Eyal Lemberger, chief AI strategist and co-founder, PEAK:AIO “It is time for memory to scale like compute has.

As transformer models grow in size and context, AI pipelines face two critical limitations: KVCache inefficiency and GPU memory saturation. Until now, vendors have retrofitted legacy storage stacks or overextended NVMe to delay the inevitable. The company’s new 1U Token Memory Feature changes that by building for memory, not files.

First Token-Centric Architecture Built for Scalable AI
Powered by CXL memory and integrated with Gen5 NVMe and GPUDirect RDMA, PEAK:AIO’s feature delivers up to 150GB/s sustained throughput with sub-5 microsecond latency. It enables:

  • KVCache reuse across sessions, models, and nodes
  • Context-window expansion for longer LLM history
  • GPU memory offload via true CXL tiering
  • Ultra-low latency access using RDMA over NVMe-oF

This is the 1st feature that treats token memory as infrastructure rather than storage, allowing teams to cache token history, attention maps, and streaming data at memory-class latency.

Unlike passive NVMe-based storage, PEAK:AIO’s architecture aligns directly with NVIDIA’s KVCache reuse and memory reclaim models. This provides plug-in support for teams building on TensorRT-LLM or Triton, accelerating inference with minimal integration effort. By harnessing true CXL memory-class performance, it delivers what others cannot: token memory that behaves like RAM, not files.

While others are bending file systems to act like memory, we built infrastructure that behaves like memory, because that is what modern AI needs,” continued  Lemberger. “At scale, it is not about saving files; it is about keeping every token accessible in microseconds. That is a memory problem, and we solved it at embracing the latest silicon layer.

The big vendors are stacking NVMe to fake memory. We went the other way, leveraging CXL to unlock actual memory semantics at rack scale,” said Mark Klarzynski, co-founder and chief strategy officer, PEAK:AIO. “This is the token memory fabric modern AI has been waiting for.”

The fully software-defined solution utilizes off-the-shelf servers is expected to enter production by Q3.

Read also :
Articles_bottom
ExaGrid
AIC
Teledyne
ATTO
OPEN-E