January 18, 2022—
The Confusing World of Vendor Benchmarking: Part One – Apples, Oranges, and Eggs
By: Heather Morris
This is the first in a series of upcoming articles that will focus on performance testing and comparisons from Steve McQuerry, Director of Technical Product Management at Pensando Systems.
In this series we will contrast the various architectural approaches to DPUs and demonstrate why the Pensando Distributed Services Card is the clear choice when deploying multiple services on a single card.
If you’ve seen the graphs on social media or a presentation from Pensando and just want to get to the details, Principled Technologies have prepared a validation report which can be found here: Principled Technologies Report (PDF).
Recently both Fungible and NVIDIA claimed to have set world record storage IOPS performance with their respective DPU (Data Processing Unit) products. As a result of these claims, their methodologies have generated some interesting discussions around DPU performance measurement and the resulting comparisons, both in the press and on social media.
N.B. When it’s pointed out that your competitive testing “is not directly comparable” you may end up with some egg on your face (read on – it’ll make sense soon).
When you dig into the details of these tests it’s important to understand if it’s a like-to-like comparison or just another case of apples and oranges. Benchmarks and world records may be interesting and noteworthy, but we should never lose sight of the reason we do performance testing: to gain an understanding of how products should behave in real-world environments. If the testing principle is flawed then does it really show us anything? For example, how useful is it to have the ability to crush 142 eggs on your forehead in one minute?
Corner case benchmarks and world record performance display a focus and tenacity that may deserve recognition, but does it matter if it doesn’t actually translate to real-world use cases? One of the main use cases for a DPU is to offload multiple infrastructure services from general purpose CPUs. In this sort of deployment, modern infrastructure services such as networking, storage and security don’t run individually or serially, so real-world performance can’t be measured for each service on a case-by-case basis—it has to be measured with multiple services, all running at the same time. This is how workloads execute in the real world and, most importantly, when we model how the platform behaves with multiple services competing for resources, we gain some insight into where we’re likely to see performance issues.
Contrasting DPU Architectures
Most DPUs can be put into one of two categories: sea of cores, or a sea of cores combined with hardware packet processing using a legacy NIC. (Many vendors use Arm cores but some have designs based on other processors.) Both of these solutions lack innovation and just recycle existing technologies in a different package. Instead of using legacy technologies to address the challenges facing modern infrastructure, Pensando has taken a more innovative approach: we have built a domain-specific processor for I/O processing alongside general purpose Arm cores.
Now you might ask: what’s the difference? Well, a domain-specific processor is built from the ground up to allow entire I/O services to execute within the I/O processor, vs. mostly running I/O services in general purpose cores with hardware acceleration of certain functions in the datapath.
Our platform is designed from the ground up to provide optimal performance and flexibility with full programmability for SDN, security, storage and telemetry services. The addition of P4-programmable hardware enables both Pensando and our customers to develop customized software-defined services and applications that run with the performance of hardware, something we refer to as Software-in-Silicon.
Figure 1 – Architectural Comparisons
Figure 1 represents a simplified architectural overview of the components required to deliver basic SDN services. Each vendor listed, including Pensando, incorporates additional hardware offloads that were featured in these tests.
In this series we will examine using DPUs to offload multiple services typically found in cloud environments. To understand the advantage of a programmable architecture compared to a fixed one, let’s start with the basic SDN use case.
Some vendors, like NVIDIA, use hardware acceleration (flow offload) to reduce the burden on the host CPU for SDN pipelines, by providing an API so the SDN software can program a flow entry into the HW table of the ASIC. Since packets are processed on the card instead of the host CPU, this allows for much higher throughput and reduced latency. Others, like AWS Nitro, rely primarily on a card with a sea of cores to offload the bulk of the packet processing functions.
Both of these architectural approaches reduce the amount of CPU cycles required on the host for infrastructure service processing, which allows those valuable cycles to be used for business functions. (It also has the side effect of isolating the infrastructure services from the tenant workloads, which improves security.)
It’s been well-established in studies by hyperscalers such as Google and Facebook that moving infrastructure services onto specialized processors can have significant performance and efficiency benefits, quantifying the CPU overhead required for data center microservices to be between 22 and 80 percent. By now, any additional benchmarking demonstrating this is only belaboring the point. A more useful comparison would focus on what a specific architecture does well and not how it performs when servicing a variety of workloads and functions. With that in mind, let’s take a closer look at a SDN-based comparison for a hardware offload DPU architecture vs. the Pensando Software-in-Silicon architecture.
Pensando Software-in-Silicon vs. NVIDIA Hardware Offload
The SDN benchmark data published by NVIDIA at HotChips 33 (slide 11) for BlueField-2 was based on a relatively small packet size (114 bytes). Tests like this can provide interesting results, but is it useful? (See my reference to eggs above.) That depends on how it compares to the type of traffic you are servicing. It may be similar to certain applications, but how does it relate to providing services across a variety of traffic types in a modern cloud environment?
The most common hardware offload use case in a cloud environment is that of networking services. Each tenant in a cloud creates their own private network, yet the cloud infrastructure must transparently and securely service each customer. This is accomplished through building a SDN pipeline to carry out the required functions.
A more interesting test, then, would be to define a SDN pipeline and implement it on each architecture to compare the performance across various packet sizes using TCP traffic. The first order of business would be to define a simple pipeline. For this let’s consider what is required to build a secure multi-tenant environment. First we need an encapsulation method to provide tenant isolation. In this case we will use VXLAN and as packets are parsed we match the VXLAN network ID (VNID) to identify the traffic in a specific cloud instance or network. Next we need to check the security entries for the customer network and provide connection tracking (L4 stateful firewall services). Then we need to do a lookup on the inner packet destination IP address and rewrite the inner packet destination MAC address accordingly. Finally we will need to re-encapsulate the packet in a VXLAN header and forward out the appropriate interface.
Figure 2 – Test SDN Pipeline
In order to understand how the Pensando architecture compares to that of NVIDIA’s BlueField-2 we built an environment to measure the basic performance of this simple pipeline between two bare metal servers. This testing was independently validated by Principled Technologies; a more detailed report including setup and configuration can be found here: Principled Technologies Report (PDF).
For this test we compared the latest commercially available products, namely the Pensando DSC-200 and NVIDIA’s ConnectX-6 Dx, both running at 100G.
Why not the BlueField-2?
Despite numerous attempts we have been unable to purchase any BlueField-2 100G DPUs from NVIDIA or their reseller channel. We have, however, been able to get access to multiple ConnectX-6 Dx cards, which are what NVIDIA uses as the packet processing engine in BlueField-2, so we have focused this set of tests on that.
This is the point where I expect you to point out that this is just another case of apples and oranges, but bear with me: The ConnectX-6 Dx ASIC is exactly the same as that used in BlueField-2. In addition, we are leveraging the cores in the host to replace those in the BlueField-2 and, we’ve stacked the deck a little in favor of NVIDIA by enabling four times the number of cores found in BlueField-2. Before we go any further: if anyone can provide myself or Principled Technologies with a BlueField-2 100G, we can do an exact apples to apples comparison.
To summarize: The tests we are running compare the performance of the Pensando DSC-200 and NVIDIA BlueField-2 packet engines. Since we do not have a BlueField-2 DPU we are leveraging the processing function of the host system’s 32 cores (four times the cores available with the BlueField-2 100G) which provides a more favorable outcome for packets that are not offloaded by the hardware.
That’s Enough Fine Print, Let’s Get Back To The Testing
For the Pensando DSC, the SDN pipeline was loaded into the firmware of the card and executed on the P4 processing engines. For the NVIDIA ConnectX-6 Dx the SDN pipeline was built using OVS on the target server. Figure 3 shows the basic setup where Server 1 is the sender and Server 2 is the target server.
Figure 3 – Testbed Setup (DUT – Device Under Test)
We used a kernel-based tool (iperf3) to generate TCP traffic in order to approximate the type of traffic that may be generated by applications running on two hosts in a provider environment. Because of this, these results do not measure or represent the limits of either architecture, but instead show how each architecture performs when subjected to various packet sizes running traffic that approximates that of a kernel-based application.
What we discovered was that for lower packet sizes the CX-6 Dx architecture performed well for traffic that was accelerated by hardware offload. However, as the packet size increased above 256 bytes, the DSC-200 pulled away, performing up to 1.5X better than the NVIDIA ASIC.
Another noteworthy measurement was for traffic that was not offloaded. For traffic that could not be hardware accelerated, the DSC-200 performed up to thirteen times better than the CX-6 Dx setup. This is due to the nature of the architecture: the DSC handled all of the stateful firewall services including connection tracking in the P4 engines in hardware. Even though the CX-6 Dx supports hardware offload for connection tracking, we were unable to make this work, despite using the latest drivers and software available for the setup at the time of testing. This was likely due to the dependency between the SDN software and the hardware. Even though the hardware capabilities exist for the CX-6 Dx, the software must be able to program the flow. When either the software or the hardware is incapable of performing such functions, the packet processing falls back on the CPU cores, which do not match the performance of the ASIC and therefore negatively affect performance.
Figure 4 – Test Results
Why Does This Matter?
This brings us back to our original point about architectural differentiation and the value of Software-in-Silicon. The graph above visually highlights the differing performance seen between architectures when workloads are accelerated versus being processed via a sea of cores. The Pensando Software-in-Silicon approach means that we can add support for new protocols and services while maintaining a superior level of performance. Other platforms and vendors will leave you with a stark choice: either live with reduced performance or deploy new hardware (which typically takes 18-24 months to develop, let alone qualify and deploy). This can be seen in the evolution of BlueField products: BlueField (CX-5), BlueField-2 (CX-6), & BlueField-3 (CX-7).
This is the key differentiator between the architectures. The CX-6 Dx (and by extension BlueField 2) architecture relies on hardware offload, it is built on a static ASIC, an API, compatible drivers, and software that can push flows to the ASIC. This approach introduces a number of compromises, most notably:
- The offload functionality must be built into the chip;
- Software that runs on the host CPU or Arm cores must be able to use that functionality.
If either condition is not met, then packet processing occurs on the cores, which greatly reduces performance.
For Pensando, that functionality is programmed in the P4 packet engines. There is minimal dependency on software running on the Arm cores and new functionality can easily be programmed into the silicon.
Summary
Performance testing and benchmarking provides a valuable data point when evaluating technology, but it is important to look at the methodology and translate that into real word applications. What really matters when comparing DPUs like BlueField and the Pensando DSC is their ability to perform across varying traffic sizes while providing complex SDN pipelines, security, and storage services simultaneously.
The differentiator for Pensando is Software-in-Silicon, which provides a customizable hardware acceleration engine. This enables Pensando customers to run packet processing software developed specifically for their business use cases as if it was built in silicon. This innovation eliminates the delay in feature development created by the legacy model where packet processing functions are built into the ASIC and enabled through third party software like OVS. It also greatly improves performance by eliminating the need to utilize cores to provide packet processing for services that cannot be accelerated or offloaded by the ASIC.
As we continue to build solutions at Pensando we will also continue to look for meaningful ways to demonstrate the architectural superiority of our platform through performance testing. In the next blog we will look at combining multiple services (SDN + Storage) in a cloud environment, as deployed by many cloud providers.
One Final Note
Performance testing in general tends to be a very emotive subject, in particular when its used as a means to compare products. If you’re interested in having a look at the test platforms and potentially even seeing the tests or running them yourself, let us know via the Contact form below and we’ll be happy to get something scheduled.