kentoh - Fotolia

Enterprises reckon with Kubernetes cluster scale

Kubernetes has landed in enterprise production -- the next task is for it to expand to accommodate very large clusters, as well as many small clusters for multi-cloud and edge computing.

The core Kubernetes cluster platform has mostly become 'boring,' but that doesn't mean it's done growing.

Kubernetes now has 20 incremental releases since it reached version 1.0 in 2015; even a year ago, its sixteenth such release went mostly unheralded among IT pros.

The container orchestration platform has also become ubiquitous. Three times the number of respondents to the Cloud Native Computing Foundation (CNCF) 2020 end-user survey -- 92% out of 1,324 respondents this year -- now use containers in production compared with 2016. A reported 91% of respondents use Kubernetes, and organizations with 5,000 or more containers in production increased from 11% in 2016 to 23% in 2020.

This latter trend represents what may be the final frontier for Kubernetes development heading into 2021: enhancing users' ability to scale deployments, both within large clusters, and in environments with many clusters.

Financial software maker Intuit, which recently finished migrating its TurboTax environment to Kubernetes, is among the enterprises testing the limits of Kubernetes cluster scalability.

"TurboTax brings a level of scale that is quite unique," said Pratik Wadher, vice president of product development at Intuit. "It's one big burst [of traffic] that happens over a week when we have 14 million users that come in and start doing their taxes, mostly during evening hours, and then on Tax Day."

Pratik Wadher, IntuitPratik Wadher

The goal for the company's wholesale shift to Kubernetes, which began in 2018, was to avoid provisioning for that peak capacity, and instead dynamically scale the IT infrastructure to accommodate it only when needed, cutting infrastructure costs. Kubernetes has been an improvement over static VM-based infrastructure, Wadher said, but it hasn't been without its own scalability challenges.

"We immediately ran into scale issues, whether it was pod scaling, node scaling -- actually DNS, which we never thought we'd run into, was a major pain that we encountered very quickly," he said.

Intuit addressed many of these issues internally through adjustments to the size of its DNS cache and its Keiko Kubernetes cluster management project, but there are still challenges with Horizontal Pod Autoscaler (HPA) that the company will work with the community to improve, Wadher said.

"Kubernetes autoscales pods very easily, but what we were discovering is that we didn't have enough worker nodes -- it takes time to bring up a new node into the system," Wadher said. "The easiest thing would be to overprovision, which we ended up doing in some cases, but then if you have the nodes available, how do you make sure that you can automatically scale the pods [to include them]?"

Intuit uses the Kubernetes pod status mechanism to keep very low-priority pods on standby within Kubernetes clusters, which are preempted with higher-priority workloads when traffic starts to spike.

"That algorithm, that modeling, is something we feel will be useful for other companies," he said. The company has not presented this idea formally to SIG Autoscaling, but plans to, according to Wadher.

Kubernetes users call for open source multicluster UI

There are extreme cases such as Intuit's that challenge vertical scalability within Kubernetes clusters, and some enterprises believe large multi-tenant clusters are the best method for Kubernetes management. But a recently announced security vulnerability is among the factors making multicluster management the more popular approach for now.

This issue, publicized by the CNCF Dec. 7, concerns "man in the middle" attacks in multi-tenant Kubernetes clusters. The vulnerability makes it possible for potential attackers that can create or edit services and pods to intercept traffic between pods or nodes in the cluster.

"This issue is a design flaw that cannot be mitigated without user-facing changes," the CNCF announcement stated. "With this public announcement, we can begin conversations about a long-term fix."

As of mid-2020, most Kubernetes vendors had begun expanding centralized multicluster management features, from IBM Red Hat's Advanced Cluster Management to VMware's Tanzu Mission Control and SUSE's Rancher Enterprise. But some enterprises would also like to see the community offer an open source interface for multicluster management.

We would like the community to focus on ... improved scheduling, improved support for multicluster and hybrid deployments.
Ricardo RochaComputing engineer, CERN

"There are some specific requirements in the research area, which triggered the creation of an end-user group after the last KubeCon dedicated to research use cases," said Ricardo Rocha, computing engineer at CERN, a European particle physics research center based in Geneva, Switzerland. "We would like the community to focus on ... improved scheduling, improved support for multicluster and hybrid deployments, not so much tightly interconnected services, but more for burst workloads into multiple clusters running in different clouds."

There is a Kubernetes special interest group (SIG) for multicluster management, but its work so far has been focused on Kubernetes federation, in which separate clusters are abstracted into a single pool of resources.

SIG Multicluster also oversees the Cluster Registry subproject, a nascent effort that may serve as the basis for multicluster Kubernetes controllers. SIG Cluster Lifecycle continues to develop Cluster API, which would standardize multi-cloud Kubernetes cluster installations, but none of those efforts to date includes a multicluster management UI. There is a web-based Kubernetes dashboard UI that can manage multiple virtual clusters within the same physical infrastructure, but not multiple physically separate ones.

Still, Kubernetes users remain hopeful that the community can deliver a Kubernetes cluster management UI that would handle the full lifecycle for multiple physically separate deployments across public cloud and on-premises environments.

"We've been using kops and had to develop an internal tool for managing our clusters, because we have over 200 clusters now," Intuit's Wadher said. This tool handles upgrades, configuration management and cluster sizing centrally for all the company's Kubernetes clusters.

"We've been bandying about the thought that we should just contribute what we have," he added. "But it's so intertwined with a lot of our internal workings ... that it's probably going to take us a while."

Edge computing adds Kubernetes management challenges

For now, the most advanced development for Kubernetes multicluster management tools remains the domain vendors focused on supporting it as an edge computing platform. This area has begun to explode in 2020 because of the proliferation of mobile devices, internet of things (IoT) networks, and a desire among companies to automate IT management at remote and branch office locations using Kubernetes as a common platform.

This year, Kubernetes vendors have introduced and expanded distros specifically tuned for minimal footprints in resource-constrained edge computing environments, such as Rancher k3s and Mirantis k0s. Kubernetes management vendors, including Red Hat OpenShift and GitOps Kubernetes vendor Weaveworks, have expanded their support for edge computing environments, including support for a higher number of clusters managed within a single interface and enhanced support for disconnected environments.

Steve George, COO, WeaveworksSteve George

The emergence of 5G mobile networks has contributed to the growth of Kubernetes for edge computing this year, said Steve George, COO at Weaveworks. This will continue to challenge Kubernetes maintainers to support it as a platform to manage an ever-expanding number of small clusters without on-site human intervention.

"5G and the general trend for telcos around software-defined networking ... in the next couple of years is really driving that thinking," George said. "To run 5G, you need a lot of network nodes and you don't want to send humans out to reboot them. You need a mass management capability."

Such environments will require a fresh approach to Kubernetes networking and remote updates, George predicted.

"There's a variability and complexity problem, and as Kubernetes becomes a sort of universal backplane layer, you have to accept that there's going to be a lot of variability [among edge environments]," he said. "They have much more complicated networking environments than standard application platforms, with multiple different networks that may be segmented, with multiple different traffic types."

Kubernetes networking for edge environments also presents a unique reliability challenge, George added, whether it's staggering or rolling upgrades for disconnected and unreliably connected locations or rolling back faulty ones.

"Customers want to do fast deployments, but the real question is, what happens when something goes wrong?" he said.

Dig Deeper on Containers and virtualization