DataPelago Nucleus Outperforms Nvidia cuDF Raising the Roofline of GPU-Accelerated Data Processing
Benchmarks show DataPelago's universal data processing engine delivers significant speed gains for compute-intensive operations on top of Nvidia GPUs
This is a Press Release edited by StorageNewsletter.com on August 27, 2025 at 2:01 pmDataPelago, Inc. released new benchmarking results that show DataPelago Nucleus significantly outperforms Nvidia’s cuDF – a widely used open-source software library that runs on CUDA to speed up data processing – for compute-intensive operations on top of Nvidia GPUs. Nucleus, DataPelago’s universal data processing engine, seamlessly executes data processing tasks across heterogeneous hardware (from CPUs to GPUs), dramatically improving price/performance for key data processing workloads without requiring code or infrastructure changes.
As businesses manage growing volumes of complex data for ETL, business intelligence and GenAI workloads, CPU-based data processing alone can no longer keep pace. Nvidia GPUs offer massive parallelism and throughput advantages that make them ideal for accelerating these workloads. However, they also present unique challenges – such as I/O bottlenecks and limited GPU memory – that limit the amount of data that can be processed at once. To fully realize the benefits of GPUs and deliver better performance-per-$ to accelerate adoption, data processing engines must be designed to leverage GPU strengths while compensating for their limitations.
Nucleus’ GPU-optimized execution layer was designed with this objective in mind. While cuDF has long established the performance ceiling for utilizing these GPUs in data processing, complex and real-world workloads, such as multi-key and variable-length string sorts, are not handled efficiently. The benchmarking results for this scenario demonstrate higher gains with Nucleus compared to simple, fixed-length data operations.
Nucleus overcomes these challenges with capabilities such as better parallel algorithms, fast flows for common workloads, optimized multi-column support, kernel fusions to accelerate complex expressions, and end-to-end string optimization with zero copy shared memory management. This enables Nucleus to raise the roofline for performance on GPUs, unlocking greater value from existing accelerated infrastructure.
Initial benchmark results for real-world workloads include
- Complex Expressions: Nucleus is up to 10.5x faster for project operations, up to 10.1x faster for filter operations, and up to 4.3x faster for aggregate operations compared to cuDF.
- Variable Length String As Data Type: For hash join operations, Nucleus achieves up to 38.6x faster throughput compared to cuDF for smaller strings while up to 4x faster for larger strings. Nucleus also shows significant improvements in hash aggregate operations with gains of up to 3.8x and up to 5.9x improvement for Top-K.
- Multi-Column Support: Nucleus delivers up to 8.2x faster performance for ‘Top-K’ operations compared to cuDF while handling multiple column key.
“While organizations deal with a tsunami of complex data, fortunately accelerated hardware like GPUs have become more readily available in today’s cloud environments. To take full advantage of the performance benefits possible with accelerated hardware, new approaches and non-linear thinking are required,” said Rajan Goyal, CEO, DataPelago. “We founded DataPelago to apply this non-linear thinking and create a new data processing standard for the accelerated computing era so that companies can overcome performance, cost and scalability limitations. These latest benchmark results are an example of how DataPelago is continuing to push this new standard forward.”