cutimage - Fotolia
A look at HCI hardware maintenance
Resource forecasting, capacity planning and upgrade compatibility are essential components of hyper-converged infrastructure maintenance. Learn why admins must implement them.
Hyper-converged infrastructure simplifies day-to-day operational tasks, but it does not eliminate the need for general hardware upkeep.
Hyper-converged infrastructure (HCI) hardware will eventually fail or run low on capacity; all hardware has a finite lifespan. To keep your HCI operational and delivering business value, you must keep the hardware platform healthy with regular component maintenance.
If you have a small HCI deployment, you may not see a single failure in the three- to five-year lifespan of your servers. Though the more servers you have and manage, the higher the probability of a failure. If you are running hundreds of HCI hardware nodes, a component may fail every couple of months, although modern servers are designed to be failure tolerant.
Systems usually have redundant fans and power supplies so a single component failure doesn't cause an outage. That said, your HCI maintenance plan should include replacement hardware, whether it's on-premises or you get all service through the vendor's support.
HCI hardware fills up eventually
HCI requires ongoing capacity management. Resource demand grows over time, and each cluster resource is a finite pool when it comes out of the box.
Capacity monitoring should be a core part of your HCI hardware management plan -- preferably with forecasting -- to predict when you need more resources. When you create your budget forecasts, include time for the financial approval, ordering, fulfilment and hardware deployment.
It's poor operations -- and stressful -- to run out of capacity while extra hardware is still on a delivery truck. Be mindful of resource balances, because HCI platforms are purchased as a combination of compute and storage. This makes it trickier to expand any compute hardware than a regular, hot-swappable server. To track resource availability, you can use HCI management software to get regular reports or alerts when resources reach a certain threshold.
If your workload has an uneven distribution of compute and storage consumption, then you could be paying for resources you do not use, making your HCI less cost-effective.
Consider whether adding compute-only or storage-only nodes is the more cost-effective way to expand your HCI hardware setup. Also remember that maintenance activities can take resources away from the HCI cluster; you may need to shut down the node to replace parts such as fans or hard drives.
Cluster expansion considerations
When the time comes to expand your HCI cluster, consider the effects new hardware has on resource availability. If you continue to expand with similar HCI hardware nodes from the same vendor, you will likely not affect overall performance.
If you expand with nodes that have significantly different storage and processing resources, there may be an imbalance in performance across your infrastructure. For example, a cluster with four older medium-sized 256GB HCI nodes may be expanded with two newer, and much more powerful 768GB nodes.
If your cluster expands from 1TB of RAM to 2.5TB of RAM, and one of the new nodes fails, the cluster can lose nearly a third of its RAM; but if one of the older nodes fail, you lose only 10% of the RAM. This potential imbalance might affect CPU or storage capacity and lead to maintenance or compatibility issues on the newer nodes.
The next step after rolling cluster expansion is rolling component replacement. When your HCI nodes reach the end of their lives, you can deploy new nodes into the cluster and then retire out any older nodes.
Figuring out if an asset is at the end of its life is a business decision. End of life can be when the asset's value depreciates to zero, when you decide to remove the risk of failure from old hardware or when new hardware improvements make older hardware expensive to run.