Avoid colocation and cloud noisy neighbor issues
In any multi-tenant IT environment, noisy neighbors can be an issue. Here's a closer look at how the challenges differ in the public cloud versus colocation.
Noisy neighbor issues in IT are a major problem that can be difficult to deal with. A noisy neighbor is a workload that's using up one or more resources on a platform in a manner that restricts the availability of such resources for other workloads.
On physical platforms, this is only a problem for a single workload. A memory leak or overly chatty storage setup can bring the workload to a halt but won't affect other workloads, as they have their own dedicated resources to use. The only areas where it could be problematic are SAN or NAS, or when a workload hits a shared WAN.
It's only when you use more virtualized resources that noisy neighbor workloads became a widespread issue. The capability for virtual platforms to share resources and for those resources to be dynamically managed means that it's possible for one workload to use up all available resources, leaving none for other workloads to use when they need it most.
Administrators need a way to prevent noisy neighbor issues in the first place and to deal with them if they become an issue.
For the purposes of this article, we'll consider two different multi-tenant environments: a relatively controlled colocation environment and a public cloud platform.
Managing noisy neighbors in a colocation facility
In a colocation facility, an organization has its own dedicated cage or area where it installs, provisions and manages its own servers, storage and local area networks. This contained environment uses the facility's broader network capabilities to gain access to other third-party services and external access to the internet, as well as external access to the services provided by the managed platform.
Here are the main ways to prevent and manage noisy neighbor issues:
- Ensure code is correctly written. Even in a DevOps environment, it's vital that code is tested to check that there are no memory leaks and that network chatter is optimized to be as low as possible. Stress testing to see how the workload deals with peak loads is also necessary to see what resources the workload attempts to use.
- Set resource limits. These parameters should define how the platform should react as the workload flexes. For example, how much extra CPU or storage resource should it be allowed to claim? Monitoring systems must be tied to real-time system alerts so any issues can be flagged for administrators as soon as possible, and then they can override the rules if necessary or address the issues indicated.
- Prioritize workloads. The last thing an organization needs is for a low-level, nonprofit generating workload to use up resources while highly important workloads are unable to gain the extra resources they require. In addition to setting resource limits, prioritize your workloads to best meet the organization's needs.
- Redeploy workloads to alternative platforms. A badly behaving workload might still need to run while the underlying problem is dealt with. It should be possible to spin the workload up on a less used or less important part of your overall platform using virtual machines or containers or to offload it to the cloud while developers deal with the issue.
- Enable human intervention. As mentioned, monitoring systems must be in place to identify any issues and flag them. Alongside this, administrators must be able to set off events that can throttle a specific workload so its demands on the platform's resources are minimized to allow other workloads to use the resources they need.
Managing public cloud noisy neighbors
With a public cloud platform, your organization has less control. Workloads in a colocation facility are far more controllable on your own platform, but on a public cloud, the platform is shared among many different customers with countless more workloads on that platform.
Such massive hyperscale public cloud platforms should be able to manage the need for dynamic compute and storage resource requirements without too much of a problem. However, this isn't always the case, depending on how the part of the cloud you're using is configured and how your contract is written. There's also an issue when it comes to WAN resource availability, as it's likely many organizations will be using the same link you are.
In addition to the ways to manage noisy neighbor issues in a colocation facility -- e.g., ensuring base code is clean, stress testing -- the more specific areas that an organization must look at when dealing with public cloud noisy neighbor issues include:
- Understand what you get. The contract with the cloud provider must include how your workload will be dealt with when it comes to it requiring more resources. Don't expect a carte blanche allowance for as many resources as the workload demands. Remember, it could be your workload in the wrong and understand how you would want a badly behaving workload to be dealt with if it was someone else's. Also, understand what steps the provider will take to throttle or deprovision your workload to prevent it being an issue to others. A secondary issue is ensuring you understand the cost implications of resource overages.
- Gain as much visibility of cloud platform as you can. You can't gain visibility of other organization's workloads on a granular basis, but you should be able to ask for a level of visibility into overall performance so you can decide whether a resource problem is likely to arise due to trends in overall utilization. This enables you to approach the cloud provider and ask them to investigate the issue at a greater depth, as they should have access to more granular data to figure out what's causing the problem.
- Learn how well-behaving workloads will be supported. When your workloads can be hit for an extended period due to no fault of your own, there is something deeply wrong with how things are setup. The cloud provider should be able to throttle badly behaving workloads, reallocate resources to them from other sources, redeploy the workload to a different area of the platform or shut down the workload depending on the severity of the issue. Ensure trigger events and timescales are detailed in the contract.
- Avoid initial overage on resources. Some cloud providers offer more static environments with a high ceiling around initial resource availability. However, they might not offer dynamic resource allocation -- and even a minor adjustment in resource needs might not be possible. Look for overall flexibility, yet control, in resource provisioning.
Noisy neighbor issues can bring other workloads to a grinding halt. However, dealing with such issues should be relatively easy if you approach the problems with as much information as you can glean from the deployment type and situation at hand.