How smartNIC architecture supports scalable infrastructure
In this Q&A, author Silvano Gai discusses how smartNICs can benefit enterprises by providing more granular telemetry and supporting distributed cloud infrastructure.
Over the last few years, networking software has outshined its hardware companions. While routers, switches and chipsets may not currently boast the most exciting innovations, advancements are happening.
One such change is occurring with the network interface card (NIC). A traditional NIC receives and sends packets between a network and a server. It might not be fancy, but it gets the job done. Recent innovations, however, are adding programmability, memory, compute and other capabilities onto NICs to bolster the hardware. The result is a smartNIC.
Cloud providers have implemented smartNIC architecture to support their increasingly distributed cloud environments and reduce the stress on their servers. As they scale out their infrastructures, many use smartNICs to provide connectivity, add storage capabilities and run other functions, such as processing and telemetry.
While enterprises don't need the same scale as cloud providers, they can still benefit from smartNIC architecture. One of the compelling benefits for enterprises is telemetry, according to Silvano Gai, author of Building a Future-Proof Cloud Infrastructure from Pearson.
"Telemetry is very important because, in enterprises, there is a lot of finger-pointing," Gai said. "Without objective data, it's really difficult to diagnose." SmartNICs can embed telemetry and pull in valuable network data that helps enterprises gain more granular visibility to diagnose and fix network issues.
In his book, Gai discusses how distributed services platforms have evolved and explores the different components required to support that infrastructure.
Editor's note: The following interview was edited for length and clarity.
How are distributed platforms changing among enterprises?
Silvano Gai: If you look at the classic architecture of an enterprise data center, it's clearly a distributed architecture. But it's mostly still a siloed architecture. You have silos of server networks for HR, engineering or whatever. There isn't really a concept of multi-tenancy. Multi-tenancy is basically obtained by creating multiple silos in the network.
Public cloud is completely different. It's multi-tenant by definition with multi-tenancy built in from day one. When you build multi-tenancy in your network from day one, then you don't need a silo. You can put different users on the same server but with different [virtual] machines or different containers, and they can share their resources. Then, you can have policy to guarantee these resources.
Normally, people speak about scaling out the compute. One CPU is not enough, so you put 10, 100 or thousands to scale out the CPU. But cloud also scales out the service. They said, 'The only way we are going to survive in this multi-tenant environment is, if each time we install a new server, we also scale the services that are related to that new server. We scale out not only the compute, but also the service.' And, when I say services, I mean the classic firewall, load balancer, encryption, things like that.
When you look at enterprises, they don't do that. Enterprises have all these silos, and they put in appliances -- like Palo Alto firewalls, an F5 load balancer, Cisco, Juniper, Arista, whatever -- to basically keep the silos separate. It's a much less scalable architecture. It also means the network becomes weird with this effect called traffic tromboning, in which you go multiple times through the network to go to the appliance, bounce off, and everything is not really optimized.
Now, how did the cloud scale out the service? Well, they tried in software, and that didn't really work. They said, 'We need a footprint where we can run the services.' That footprint was basically identified at the border between the server and the network. With that device -- people call it smartNIC, DPU [data processing unit], EPU [energy processing unit], and there are more names than products -- you not only provide the connectivity for the network and possibly for storage, but you also provide the implementation of public services. And, if you already support [the services], then you also get the performance.
Can you speak more about those smartNIC developments?
Gai: I associate the transition from NIC to smartNIC with the fact that companies that build smartNICs have started to put processing inside the smartNIC. And, 99% of the [time], that's in the form of an Arm core [processor]. Of course, to run a processor, you also need dedicated memory.
So, the cost of the smartNIC, due to the processor plus memory, is significantly different from the cost of a NIC, typically two to three times more. But, by putting in the processor, now, you can write software and implement services, and then you have the performance. You basically do everything in Arm processors -- it's all SQL. That is part of the advantage and disadvantage; it's easy to program. But the performance you get isn't so great, same with the latency and jitter.
There are other approaches. Other companies have tried, for example, to use an FPGA -- field-programmable gate array -- and they try to program that. That also has advantages and disadvantages. The FPGAs are power-hungry, and the density is very low because you need to have all this programming logic and so on. The results have been mixed.
Other companies, like Pensando, have tried to adhere to a P4 architecture. P4 is a programmable way to write the data path. So, you use P4 for the smartNIC data path and use an Arm core to do the control path and management path. There are combinations of this technique. Intel, with the acquisition of Barefoot, is also probably working on or has announced a P4 smartNIC. But, basically, the transition from a NIC to a smartNIC is when you add the dimension of programmability.
Is there an enterprise use case for smartNICs?
Gai: The market is clearly led by the cloud providers. But, in the enterprise, there is this big desire to build a private cloud to mimic what cloud providers did in the public cloud. So, it has come into enterprises, with many installing [smartNICs] to get some low-hanging fruit.
The biggest low-hanging fruit, believe it or not, is telemetry, measuring what's happening on the network. After that, it's what is called a network tap, where you implement the capability to observe what is going on everywhere in a distributed fashion. And, of course, the enterprise is more price-sensitive than the cloud.
Who manages the smartNIC?
Gai: There are basically two distinct modes. The original mode, which I think won't survive, is the OS on which the smartNIC is installed manages it. The reason it won't survive, in my view, is, if the OS is compromised, the smartNIC is compromised, and all your security is stored on the OS.
Most of the smartNICs now have an external interface, either a gRPC interface or a REST API interface, and they can be managed through the network. They basically present a PCIe [Peripheral Component Interconnect Express] firewall to the OS, so the OS cannot compromise that. If you are able to successfully implement that, then the firewall on the smartNIC will not get compromised if the OS is compromised, so you have the possibility to contain an attack.
Do you think most network teams are open to that change?
Gai: Everybody's conceptually open. Pragmatically, it's a bit more difficult. It's more difficult between the network team, security team and server team. In the enterprise, the security team has, for a long time, relied on an appliance on which they have 100% control. Now, this appliance is extremely expensive -- for a modern enterprise, an appliance can cost $1 million.
The smartNIC solution costs way less than that. But, on the other hand, it implies the security team now needs to control these smaller form factors, which are in much greater quantity. And it needs to do that in some sort of coordination with the network team. Before, with the coordination of installing an appliance, the network team would give them a bunch of IP addresses on different subnets, or VLANs [virtual LANs] or VXLANs [virtual extensible LANs], but that was the extent of coordination.
Now, you need to coordinate the management. So, I think that resistance is a bit of an organizational resistance. People realize this is going to come. But the fact that this is going to come doesn't immediately imply people are prudent in implementing this.