tostphoto - stock.adobe.com
Bare-metal container clusters: Infrastructure for next-gen apps
Learn the requirements, designs and software stacks used to build bare-metal clusters for containers using dedicated 1-2U HCI hardware instead of existing virtual machine servers.
One of the most controversial topics in DevOps and IT engineering circles concerns the best infrastructure for containers, where there are three architectural choices: private vs. public cloud infrastructure, cloud VM instances vs. containers as a service, and container nodes on VMs vs. bare-metal servers.
The last item -- the bare-metal container cluster or container nodes on VMs -- is relevant for both cloud operators and enterprise IT. It is only a realistic consideration when operating private container clusters, however, as hyperscale operators play by a different set of rules than your typical IT organization.
The decision on which direction to take -- bare-metal vs. VM container deployments -- comes down to the management convenience and flexibility of VMs compared to the performance and relative simplicity (particularly in networking) of bare-metal implementations. Let's explore this debate further and then move onto bare-metal container cluster best practices.
The VM vs. bare-metal container debate
The choice between VMs or bare metal for containers is often determined by the type of application you intend to run. Monolithic legacy applications written for the mainframe or client-server era that have since been containerized can't feasibly exploit many of the workload orchestration features of Kubernetes, for example.
Here, you are better off using VM clusters, where resilience features, like image snapshots and workload migration (e.g., vMotion), are valuable tools for maximizing application availability and outweigh any performance overhead of using two levels of virtualization. By contrast, cloud/container-native applications built out of microservices and designed for distribution across multiple container nodes that can be cloned, restarted and moved with little to no disruption to overall application availability are well suited for bare-metal environments.
It's not surprising that vendors of container software vociferously argue the benefits of VMs vs. bare metal depending on their chosen system design. VMware has been the most adamant advocate for Kubernetes on VMs. The vendor has poured sizable R&D resources into developing the technology underlying its Tanzu product that minimizes the performance overhead of VMs and the single-interface integration of Kubernetes and vSphere management features.
VM vs. bare-metal performance gap depends on the workload
To demonstrate the viability of VM-hosted containers, VMware conducted tests of a large Java workload requiring eight virtual CPUs (vCPUs) and 42 GB of RAM on a single 44-core server with 512 GB of RAM. It deployed 10 Kubernetes pods on identical machines, one running the precursor to VMware Tanzu (previously codenamed Project Pacific) and another running a bare-metal Linux distribution supporting Docker.
Without tuning, testing found the VMware implementation provided 8% more application throughput, a gap that reversed if it pinned container workloads to a particular node. Admittedly, workload pinning defeats the point of a clustered environment. Running an eight vCPU, 42 GB workload isn't representative of modern applications, violating the design principles of lean, granular microservices.
By contrast, an older performance study, "Running Containers on Bare Metal vs. VMs: Performance and Benefits," from hyper-converged infrastructure (HCI) vendor Stratoscale showed that bare-metal instances significantly outperformed VM-based containers:
Running Kubernetes and containers on the bare-metal machines achieved significantly lower latency -- around 3x lower than running Kubernetes on VMs. We can also see that, in several cases, the CPU utilization can be pretty high when running on VMs in comparison to bare metal.
Stratoscale also noted how workloads requiring direct access to hardware like GPUs might not even run in containers on VMs. However, this point is mitigated, if not eliminated, by hardware virtualization technologies, like single-root I/O virtualization (SR-IOV) for network interfaces and Bitfusion for GPUs.
Diamanti further argued in its white paper, "Five Reasons You Should Run Containers on Bare Metal, Not VMs," that containers running in VMs can require as much as five times the infrastructure to run the same workload containers on bare metal.
There are two primary factors behind Diamanti's contention:
- Bare-metal systems can achieve as high as 90% resource utilization for containerized workload, versus 15% for VM-based container host due to the overhead of running a full guest OS for each node.
- System (particularly network) resource contention, aka "the noisy neighbor problem," in heavily loaded servers is more pronounced on VM hosts -- again, due to hardware being shared among multiple full OSes.
Although the performance claims on both sides are debatable, the most concrete advantage for bare-metal container environments is cost. While the number of required hosts might not decrease by the 80% claimed by Diamanti, the added licensing fees and administrative layer are tangible financial and operational costs of hosting containers on an otherwise unused VM environment. Since bare-metal servers typically use a lightweight Linux distribution, like Fedora CoreOS, they also eliminate expensive hypervisor licensing fees, aka the "VMware tax."
VMware seeks to mitigate the administrative disadvantage by integrating Kubernetes management features into vSphere. But adding a virtual abstraction layer means that VM-based clusters will always have administrative tasks required in bare-metal environments.
Note that these operational costs will be mitigated should an organization require resilience features, like migrating workload to other clusters (VMs) and taking image snapshots. Pure container environments can meet these requirements in other ways, however, such as by running Kubernetes in multiple zones and using centralized container repositories for application images.
Building bare-metal clusters
System design is one way in which bare-metal and VM-based clusters are more similar than different.
Since Kubernetes is so effective at distributing workloads across pod nodes, it is best to keep nodes as dense, standardized and straightforward as possible. Therefore, container clusters are usually comprised of 1U or 2U nodes, with or without local storage.
However, since modern container stacks, like Diamanti, OpenShift and Platform9, include container-native storage and network layers -- Container Storage Interface and Container Network Interface in Kubernetes-speak -- many bare-metal container clusters are built from HCI with local NVMe or SATA drives.
Both server vendors, like Cisco, Dell EMC and HPE, and Kubernetes specialists, like Diamanti, offer hyper-converged products, often with a prepackaged Kubernetes distribution, designed for container clusters. These are 1U or 2U chassis with 1U or 2U servers and integrated network interface cards (NICs) and storage.
The following is a representative sample of specifications:
To minimize the overhead of virtual overlay networks on the host processor and enable the carving of a single NIC into virtual slices, a typical configuration includes dual (for redundancy) SmartNICs, such as the Mellanox ConnectX-4 and Diamanti Ultima, listed above. These support SR-IOV for hardware virtualization, overlay network -- e.g., NVGRE (Network Virtualization using Generic Rousting Encapsulation) -- packet processing and quality of service offloading to maximize the CPU resources available for application workloads.
Massive compute density key for bare-metal container clusters
The configurations above deliver 40 to 80 CPU cores and up to 32 TB of data storage per rack unit, with 9.6 GB of RAM per core. Each cluster requires at least three master nodes -- for redundancy, it's recommended to have an odd number of master nodes to establish a quorum, even in the event of node or network failures -- and two switches using a 48-port 10/25/100 Gigabit Ethernet, 1U device.
Additionally, each container environment -- for example, pod of racks -- requires at least one network boot and one cluster management server. So, a maximally configured one-rack cluster with 36 worker nodes would provide 1,440 cores of compute, each with 9.6 GB of RAM and a total of 0.4-1.4 petabytes of NVMe storage.
When planning a container environment, VMware or Windows shops already paying hefty licensing fees will probably find the convenience of a single administrative environment with the VM workload migration and backup features of a VM stack too compelling to ignore. All others building modern applications using container-based microservices will likely bypass the VM overhead and go straight to bare-metal container clusters, where they will find many more hardware and software options.