GSI Technology, Inc., inventor of the Associative Processing Unit (APU), a paradigm shift in AI and HPC processing providing true compute-in-memory technology, announced the publication of a paper led by researchers at Cornell University.Findings confirmed that GSI Technology’s APU CIM (Compute-In-Memory) architectures can match GPU-level performance for large-scale AI applications with a reduction in energy consumption due to high-density and high-bandwidth memory associated with the CIM architecture.

Key findings include:

GPU-class performance – The Gemini-I APU delivered comparable throughput to NVIDIA’s A6000 GPU on RAG workloads.
Massive energy advantage – The APU delivers over 98% lower energy consumption than a GPU over various large corpora datasets, underscoring its efficiency and sustainability.
Faster and more efficient than CPUs – The APU’s unique design allows it to perform retrieval tasks several times faster than standard CPUs, shortening total processing time by up to 80%.

“Cornell’s independent validation confirms what we’ve long believed—compute-in-memory has the potential to disrupt the $100 billion AI inference market,” said Lee-Lean Shu, chairman and CEO, GSI Technology. “The APU delivers GPU-class performance at a fraction of the energy cost, thanks to its highly efficient memory-centric architecture.”

Published on ACM and presented at the Micro ’25 conference, the paper by the Cornell research team titled ‘Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device,’ represents one of the 1st comprehensive evaluations of a commercial compute-in-memory device under realistic workloads. The Cornell-led team benchmarked the GSI Gemini-I APU against established CPUs and GPUs, focusing on retrieval-augmented generation (RAG) tasks over datasets ranging from 10 to 200GB.

The researchers’ findings point to significant opportunities for GSI Technology as customers increasingly require performance-per-watt gains across various industries, including Edge AI for power-constrained robotics, drones, and IoT devices, as well as defense and aerospace applications where the APU can deliver high performance in environments with strict energy and cooling constraints.

Mr. Shu continued, “This tremendous work by Cornell highlights CIM advantages using the Gemini-I silicon. Our recently released second-generation APU silicon, Gemini-II, can deliver roughly 10x faster throughput and even lower latency for memory-intensive AI workloads, while further improving energy efficiency. Looking ahead, Plato represents the next step forward, offering even greater compute capability at lower power for embedded edge applications. The APU’s unique combination of speed, efficiency, and programmability positions us to unlock high-growth opportunities across edge AI, data centers, defense, and other markets where energy efficiency is a critical strategic advantage.”

The Cornell study also introduced a new analytical framework for general-purpose compute-in-memory devices, providing optimization principles that strengthen the APU’s position as a scalable platform for developers and system integrators.

Article: Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device

ACM Digital Library has published, in MICRO ’25: Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, an article written by Niansong Zhang, Cornell University, Ithaca, New York, USA, Wenbo Zhu, University of Southern California, Los Angeles, California, USA, Courtney Golden, MIT, Cambridge, Massachusetts, USA, Dan Ilan, GSI Inc., Tel Aviv, Israel, Hongzheng Chen, Christopher Batten, and Zhiru Zhang, Cornell University, Ithaca, New York, USA.

Abstract: “Compute-in-SRAM architectures offer a promising approach to achieving higher performance and energy efficiency across a range of data-intensive applications. However, prior evaluations have largely relied on simulators or small prototypes, limiting the understanding of their real-world potential. In this work, we present a comprehensive performance and energy characterization of a commercial compute-in-SRAM device, the GSI APU, under realistic workloads. We compare the GSI APU against established architectures, including CPUs and GPUs, to quantify its energy efficiency and performance potential. We introduce an analytical framework for general-purpose compute-in-SRAM devices that reveals fundamental optimization principles by modeling performance trade-offs, thereby guiding program optimizations.“

“Exploiting the fine-grained parallelism of tightly integrated memory-compute architectures requires careful data management. We address this by proposing three optimizations: communication-aware reduction mapping, coalesced DMA, and broadcast-friendly data layouts. When applied to retrieval-augmented generation (RAG) over large corpora (10GB–200GB), these optimizations enable our compute-in-SRAM system to accelerate retrieval by 4.8 × –6.6 × over an optimized CPU baseline, improving end-to-end RAG latency by 1.1 × –1.8 ×. The shared off-chip memory bandwidth is modeled using a simulated HBM, while all other components are measured on the real compute-in-SRAM device. Critically, this system matches the performance of an NVIDIA A6000 GPU for RAG while being significantly more energy-efficient (54.4 × -117.9 × reduction). These findings validate the viability of compute-in-SRAM for complex, real-world applications and provide guidance for advancing the technology.“