putilov_denis - stock.adobe.com

Tip

Building networks for AI workloads

Conventional and high-performance computing networks cannot adequately support AI workloads, so network engineers must build specialized networks to accommodate their massive size.

The rapid rise of AI highlights the need for powerful and efficient networks dedicated to supporting AI workloads and the data used to train them.

Data centers built for AI workloads have different requirements than their conventional and even high-performance computing (HPC) counterparts. These workloads don't rely solely on legacy server components. Instead, computing and storage hardware should integrate GPUs, data processing units (DPUs) and smartNICs to accelerate AI training and workloads.

Once integrated, networks must stitch these infrastructure components together and handle workloads with different parameters and requirements. Thus, data center and cloud networks designed for AI must adhere to a unique set of conditions.

To support AI data flows, network engineers must meet critical AI workload requirements, such as high throughput and dense port connectivity. To meet these needs, set up data center networks with the right connectivity, protocols, architecture and management tools.

AI workload network requirements

AI data flows differ from client-server, hyperconverged infrastructure and other HPC architectures. The three critical requirements for AI networks are the following:

  1. Low latency, high network throughput. Half the time spent processing AI workloads occurs in the network. HPC network architectures are built to process thousands of small but simultaneous workloads. By contrast, AI flows are few but massive in size.
  2. Horizontally scalable port density. AI training data uses a large number of network-connected GPUs that process data in parallel. As such, the number of network connections can be eight to 16 times the norm of a data center. Rapid transmission between GPUs and storage mandates that the switch fabric be fully meshed with nonblocking ports to provide the best east-west network performance.
  3. Elimination of human-caused errors. AI workloads are typically massive in size. Up to 50% of the time spent processing AI training data happens during network transport. GPUs must complete all processing on training data before AI applications can use the resulting information. Any disruption or slowdown -- no matter how minor -- during this process can cause significant delays. The biggest culprit of network outages or degradation is manual configurations. AI infrastructure setups must be resilient and free of human error.

AI network design

To address the above needs for optimal handling of AI workloads, modern data center networks are increasingly built with specialized network transport, Clos architectures and intelligent automation.

Specialized network transport and accelerators

Specialized physical and logical transport mechanisms minimize network latency in AI workload processing. InfiniBand offers speed, latency and reliability improvements over standard Ethernet for AI workloads. The drawback, however, is that InfiniBand is a proprietary protocol using specialized cabling. These two factors increase the cost of deployment versus Ethernet.

An alternative to InfiniBand already exists in the data center: standard Ethernet cabling and switching hardware. Ethernet can transport AI workloads using an optimized network protocol, such as RDMA over Converged Ethernet, commonly called RoCE. This Ethernet-based protocol delivers low-latency, high-throughput data transport -- the exact requirements for AI workflows.

Accelerators and smartNICs also support AI workloads at the data processing level. DPUs are programmable processors that transfer data and process many tasks simultaneously. Network teams can use DPUs independently or get DPUs in smartNICs, which offload some network tasks and help free up computational resources for AI training and workloads.

3-stage and 5-stage Clos networks

Networks designed to transport AI workloads commonly use a nonblocking three-stage or five-stage Clos network architecture. This design enables numerous GPUs to process data in parallel. In this architecture, a network can handle the eight to 16 times increase in port density over conventional data center networks. The Clos design also provides efficiencies for data moving between GPUs and storage.

Intelligent automation in network management tools

Eliminating human error in data center network operations is a rapidly growing and evolving goal for enterprise IT. Network orchestration tools address this issue with intelligent automation. These tools replace manual configuration processes with built-in AI capabilities to perform configuration tasks.

AI-enhanced network orchestration tools can make configurations uniform across the entire network fabric and identify whether configuration changes will disrupt other parts of the network. These network orchestration platforms continually audit and validate existing network configurations. They can analyze network component health and performance data for optimization. If the system identifies configuration changes to optimize data flow transport, it can make those changes without human intervention.

Andrew Froehlich is founder of InfraMomentum, an enterprise IT research and analyst firm, and president of West Gate Networks, an IT consulting company. He has been involved in enterprise IT for more than 20 years.

Next Steps

Why GenAI infrastructure optimization starts with the network

Dig Deeper on Network strategy and planning