Nvidia GTC 2026: ScaleFlux, FarmGPU, and Lightbits Labs Preview Solution to Solve Long-Context AI Inference

ScaleFlux, FarmGPU, and Lightbits Labs announced the public debut of a collaborative architecture designed to solve one of AI inference’s most persistent challenges: the memory and I/O constraints created by long-context workloads.

At Nvidia GTC San Jose, the companies will debut an implementation that brings together ScaleFlux high-performance NVMe, FarmGPU’s managed inference environment, and Lightbits LightInferra software to solve how KV-cache data can be persisted, reused, and streamed more efficiently across inference sessions to reduce GPU stalls caused by repeated context recomputation and open the door to more predictable, scalable performance and infrastructure efficiency.

“We’re transforming inference memory from a reactive cache into an intelligent, streamed data layer,” said Arthur Rassmuson, director, AI architecture, Lightbits Labs. “By prefetching only the data that matters and delivering it to GPUs over high-speed RDMA before it’s needed, we eliminate the stalls that traditionally limit long-context performance. The result is lower Time-to-First-Token (TTFT), more stable throughput under real-world load, and significantly higher effective GPU utilization. For enterprises, that means serving larger models and longer conversations at lower infrastructure cost – and for end users, it means faster, smoother, more responsive AI experiences.”

“Fast networked storage from Lightbits unlocks a lot of new use cases for long context inference,” said Jonmichael Hands, CEO, FarmGPU. “By pairing our managed service with Lightbit’s high-performance storage running on ScaleFlux NVMe, we are able to lower time to first token and increase utilization on GPUs, drastically lowering the TCO for inference.”

Key areas under exploration include:

Higher GPU Utilization and Inference Throughput: Extending and sharing the KV cache beyond limited GPU memory, enabling the same GPUs to serve up to 3× more inference requests by eliminating redundant computation
Reduced Latency and Increased Stability: Lowering TTFT and Time Per Output Token (TPOT) by retrieving attention states from storage instead of recomputing them to mitigate inference stalls as context windows expand
AI-Native Security and Isolation: Providing end-to-end security, including encryption for KV cache blocks, tenant isolation, and integration with Key Management Systems (KMS) and Trusted Platform Modules (TPM) for shared inference environments

“As members of the Nvidia Magnum IO GPU Direct Network, we see this as an opportunity to collaborate openly with the ecosystem,” said Keith McKay, senior director, solutions architecture and technical partnerships, ScaleFlux. “What we’re showing at GTC is an early look at how smarter data placement and persistent attention state management could help inference systems stay responsive as context windows grow. This is very much a collaboration we want to shape alongside real operators.”

This announcement marks the beginning of a design-partner-driven effort, with the companies actively seeking feedback from AI infrastructure teams, platform builders, and service providers running large-scale or long-context inference workloads.

Conference attendees are invited to visit the ScaleFlux booth 7006 to view live demonstrations, speak with engineers from all three companies, and discuss participation as design partners in the next phase of development.