Silk Vs. NetApp

By Derek Swanson, WW sales engineering organization, senior customer-facing technologist and product evangelist, Silk

Choosing a platform that consistently delivers both maximum performance and availability in the public cloud is a bit of a challenge. Cloud IaaS is good enough for general workloads, but it isn’t built to support the mission critical transactional apps that are time sensitive and are key to a great customer experience. Beneath the marketing hype, which vendor can actually deliver the highest performance with costs lower than cloud native IaaS?

Silk wants to make that decision a little easier. Using data from performance whitepapers that NetApp has published on their website in the past year, it runs a series of performance tests going head-to-head against their Cloud Volumes ONTAP (and Cloud Volumes Service too) to show you how we stacked up.

In the first blog post in this series, we’ll take a look at how Silk and NetApp compare by testing how each performs with a hero number–4K random read on AWS. We’ve run 10 more tests comprising all sorts of workloads – including VSI, VDI, SQL and Oracle real world loads that we’ll publish, along with a head-to-head comparison of the 5 tests NetApp ran in their paper.

How Silk and NetApp differ?
Before we dive in, you should know a bit how both the Silk and the NetApp architectures are built. NetApp Cloud Volumes ONTAP is using a classic single controller architecture (single node) with 12EB disks split into 2 aggregates, and also a high availability mode which uses 2 controllers (active-active with sync-mirror). The HA element kicks in if one controller fails, but they did not test that. So really the HA mode workload tests are just using 2 controllers serving IO from 2 groups of 6 mirrored (using synchronous replication) EBS disk-based aggregates, and then they report the 2 combined results from the host perspective. They are also occasionally using a “high write” mode which turns off write consistency (if there’s a problem there is data loss). The ‘high’ mode makes all the IO faster, but of course there’s no data protection in case of sudden failure, so it would have limited use in an enterprise production environment, and it doesn’t have a use case here certainly – except maybe to highlight what a large performance hit write protection inflicts on the NetApp system. The NetApp controllers are running on a variety of different compute engines, each with different amounts of vCPUs and DRAM. The level of read performance varies wildly with the amount of memory available.

The Silk platform is a symmetric active-active platform using (in these tests) 2, 4, 6 or 8 nodes (c.nodes) in a cluster, which simultaneously serve IO from a single 40TB pool of storage. The storage pool is located on a persistent media node (m.node), which is (virtually) built from a number of data nodes (virtual SSDs created from small VMs using ephemeral local NVMe disks) that are protected by write-mirroring, erasure coding and triple parity. The Silk platform always has HA ‘mode’ enabled with data protection on as a baseline. The number of nodes impacts the amount of IO that can be served, and Silk can serve IO from any single volume across any node(s). The amount of nodes can be scaled out or in automatically and non-disruptively from 2-8 (2,3,4,5,6,7 or 8 nodes currently supported) to deliver the proper amount of performance on-demand while keeping costs in line.

For this test, we compared the various configurations of the Silk platform – from the smallest (2 nodes) to the largest (8 nodes) on AWS and compared those results to the entire group of NetApp Cloud Volumes ONTAP configurations. For the full report, see our white paper for details of the entire configuration, environment and test harness profiles. These blog posts are based on excerpts from that paper.

Test Results – 4K block size, 100% reads, and 100% random access I/O, single node
The first test NetApp ran was a 4K random read test as mentioned. This first graph shows IOPS. We first see the 2 smallest configurations—Silk’s 2 cnode vs the NetApp c5.9xlarge configuration. Silk pushes a 329,000 IO/s, while the NetApp comes in at 50,537 IO/s (85% less). For the big config, Silk deploys 8 c.nodes compared to the NetApp c5d.18xlarge. Silk delivers a huge 881,000 IO/s in this config, compared to 200,263 for the NetApp.

Our next graph shows the IO latency, because while having enough IO is important, the time it takes to deliver that IO to transactional systems is really the more important number. The Silk solution is consistently quicker, with the 2 cnode system delivering IO in 800ms while the NetApp clocks in at 9.29ms latency. Almost 10ms latency for 4K reads is… unusable? Just our opinion. The large Silk config is more impressive, delivering almost 900,000 IO.s at 0.56ms latency. The large NetApp does better here but can still only get to 1.52ms latency, 3x slower than the Silk.

In our expert opinion, the Silk result of 560 microseconds for almost 900,000 IO/s is sparkling – and that includes enterprise data protection, full redundancy, inline compression, thin provisioning, write mirroring, zero detection, encryption and other data services that are always enabled.

(HA) Test results – 4K block size, 100% reads, and 100% random access I/O – multi node
Not to be unfair, NetApp also tested their High Availability Mode (2 controllers with 2 aggregates set up in a sync mirror with active-active IO on the front end). We will also compare all the tests we did with their HA mode configuration – that mode should fare much better with read only test profiles (since NetApp’s architecture is basically built to be a read caching engine, 100% read tests should produce a great result). By contrast, the Silk platform uses no caching whatsoever and is not dependent on cache hit/miss ratios to produce performance. The Silk architecture produces consistently flat latency regardless of the size or variability of the ‘hot’ workload skews and different read vs. write ratios.

NetApp’s c5.9xlarge engine config produces 210,914 IO/s at 4.3ms latency, still 40% fewer IO/s than Silk’s smallest 2 cnode configuration, and at ~5x higher latency. 200,000 IO/s is nice for a small config, but over 4ms latency on small block reads is rough – in our aforementioned described opinion. Let’s look at their biggest config, the c5d.18xlarge. This produces a result much more like what we expected: 542,012 IO/s at 0.29ms), a quite respectable result. Silk matches that with our 4 cnode configuration, which pushes 520,000 IO/s at 900 microseconds, a fair comparison. Of course, Silk can continue to scale far beyond this, increasing up to almost 900,000 IOPS at 500 microseconds, leaving NetApp’s strongest configuration in the dust.

Now, we readily acknowledge that small block 4K reads are a hero number and not a true test of the enterprise abilities of a data platform to service IO while providing rich data services – but it’s a fun place to start! At least it is for Silk, demonstrably so. Our next entries will progressively dig into heavier, more complex, and eventually actual real-world workloads that showcase what our platform can really enable for your most demanding tier-1 mission critical databases and customer facing applications. Silk can do what no other platform can do – not cloud native IaaS, not managed DBaaS, not NetApp, nor any other platform – deliver rich data services and autonomous HA combined with monster IO and gonzo throughput with dead flat-line low latency that supercharges even the heaviest application and database workloads.

Keep an eye out in the coming weeks as we continue to share the results of additional performance testing we’ve conducted against both NetApp CVO and Cloud Volumes Service.