Comparative Evaluation of Ethernet RDMA Fabrics for NVMe-oF
By Saqib Jang, Margalla Communications, formerly at Auspex and Sun
This is a Press Release edited by StorageNewsletter.com on November 6, 2019 at 2:37 pmAuthor Saqib Jang is principal consultant and founder at Margalla Communications, Inc., a technology consulting firm specializing in data center networking. Prior to independent consulting, he was responsible for software product management and marketing at Auspex Systems, Inc. including company’s entry into the CIFS market from 1996 to 1988, and spent over 7 years at Sun Microsystems/SunSoft in a range of executive marketing roles developing and implementing product marketing strategies for storage networking and networking security products.
Comparative Evaluation of Ethernet RDMA Fabrics for NVMe over Fabrics (NVMe-oF)
There is a growing mindshare in the enterprise and cloud IT markets for NVMe technology. The underlying driver is that deploying NVMe for end-to-end storage networking in cloud and enterprise data centers can improve latency and throughput and reduce latency, which, taken together, enable an improved response time for enterprise application users.
However, a large proportion of the networked storage solutions promoted as supporting NVMe provide little of the performance increase achievable due to the standard. This is because implementing NVMe in storage products has two distinct aspects: as the front-end networking NVMe-oF protocol between storage initiators and storage controllers in storage target systems and as the back-end protocol between the storage controller and flash media.
The reason this distinction is important is that it’s estimated that greater that 80% of the performance enhancement for enterprise applications delivered by NVMe is through the use of NVMe-oF networking which enables the considerable parallelism offered by the NVMe across data center networks.
Built into the NVMe-oF standard was support for a range of network transports. As a result, the challenge for enterprise and cloud data canter architects is which fabric or network to utilize for NVMe-oF to optimize performance, cost and reliability.
Ethernet RDMA for NVMe-oF
The types of networking fabrics supported by NVMe include NVMe over FC, NVMe over IB, and NVMe over Ethernet RDMA (iWARP or RoCE).
Ethernet is by far the most popular networking technology in the enterprise and cloud data center markets. Thus, it would be very helpful to review the highlights of the choices for deploying NVMe-oF over Ethernet which are iWARP and RoCE.
Both iWARP and RoCE utilize Remote Direct Memory Access (RDMA) which is a method of exchanging data between the memory components of two networked devices without involving the processor, cache, or OS of either computer. Because RDMA bypasses the OS, it is generally the fastest and lowest-overhead mechanism for data movement across a network.
iWARP RDMA: Utilizes TCP/IP for Scalability and Loss Resilience
iWARP provides RDMA functionality over TCP/IP and inherits the loss resilience and congestion management from the underlying TCP/IP layer. Thus, it does not require specific best practices beyond those in use for TCP/IP including not requiring any specific host or switch configuration (such as for support of lossless Ethernet) as well as out-of-the-box support across LAN/MAN/WAN networks.
For example, iWARP has no distance limit and since it is a protocol on top of basic TCP/IP, it can transfer data over wireless links to satellites if need be.
iWARP RDMA NICs (RNICs) are available from a range of vendors including Intel and Chelsio and provide latency in the sub 10μs range.
iWARP utilizes the underlying TCP/IP layer for loss resilience which is particularly useful for congested networks or long-distance links. This happens at silicon speeds for iWARP adapters with embedded TCP/IP offloaded engine (TOE) functionality.
iWARP is scalable and ideal for data center deployment of NVMe-oF storage solutions. For example, iWARP is the recommended configuration from Microsoft for Windows Server 2019 Storage Spaces Direct SDS solution.
RoCE: Requires Lossless Ethernet Impacting Ease of Deployment
RoCE comprises a set of data center bridging (DCB) enhancements to the Ethernet protocol that aim at making it lossless. RoCE v1 operates at layer 2, the data link layer in the Open Systems Interconnection (OSI) model. Therefore, it cannot route between subnets, so it only supports communication between two hosts in the same Ethernet network. RoCE v2 provides more flexibility because it runs at OSI Layer 3 (L3) atop User Datagram Protocol (UDP) and thus can be routed.
RDMA adapters running RoCE are available from a number of vendors including Mellanox and Marvell. RoCE adapters provide latency in the sub 10μs range but, as mentioned earlier, require a lossless Ethernet network to achieve low latency operation. This means that the RoCE requires Ethernet switches which support DCB and Priority Flow Control (PFC) mechanisms.
The challenge with the use of DCB/PFC-enabled Ethernet environments is that switch and end-point configuration is a complex process and scalable RoCE deployment requires supplementary traffic congestion control mechanisms, such as Data Center Quantized Congestion Notification (DCQCN), that require highly experienced teams of network engineers and administrators. As a result, practically speaking, RoCE deployment is limited to single-hop environments having a few network connections in each system.
Which Ethernet RDMA Fabric to Use?
When latency is a key requirement, but ease-of-use and scalability are also high priorities, iWARP is the best choice. It runs on the existing TCP/IP, Ethernet network infrastructure and can easily scale between racks and even long distances across data centers. A great use case for iWARP is as the preferred the network connectivity option for Windows Server 2019 Storage Spaces Direct and Storage Replica deployments.
When scalability is not a requirement or when network engineering resources are available to help with network configuration, RoCE may be considered. For example, RoCE is frequently implemented as the back-end network in modern disk arrays, between the storage controllers and NVMe flash drives. RoCE is also deployed within a rack or where there are only one or two top-of-rack switches and subnets to support.