WavebreakMediaMicro - Fotolia

How do you verify a HA server or cluster is working?

I want to ensure server high availability in my production IT environment. Is it better to test server HA and risk disruption, or trust in the systems' reliability?

A server or any IT system that's configured as HA is one that management deems important to running the business. Simply put, when an HA server fails, it has a negative effect on the business. So should you test them? Two experts weigh in.

Kevin Tolly: While the question of how to verify high availability (HA) is specific to clustered servers, it is probably beneficial to consider HA testing principles in general.

Two concerns must be balanced when deciding on HA verification frequency:

  • What is the level of effect on the business if an HA system fails? If it is high, this is a vote for frequent tests.
  • What level of effect does a failed test have on the live user base? If it is high, this is a vote for infrequent tests.

Many businesses will answer "high" to both questions. IT organizations want to avoid a failed system and a failed test, so they need to find a balance.

If the budget is available, build a shadow system to mirror your HA server environment. Include servers, switches, firewalls and whatever other HA infrastructure exists in production. The mirror setup should use the same components as the live environment right down to the software version levels of operating systems, drivers and so forth. This testing environment lets the IT team frequently strain aspects of the high-availability cluster -- even doing so daily -- without any failures affecting the actual production environment.

Even with a shadow server farm for testing, IT organizations need to test their live production servers. I recommend a two-prong strategy: Periodically test HA servers and test them after any significant system hardware or software changes or updates.

Give business managers control over the window of user disruption that live HA server testing will cause. The longer you wait between tests, the greater the likelihood that the HA safeguards in place won't work. Target for a live test no less than once per quarter. In addition, should any significant new elements, components or software versions be introduced into the HA environment, schedule a test of the HA to be sure that the "upgrade" hasn't broken availability systems.

Joe Clabby: The only time I'd mess with a high-availability cluster is if it had a performance problem, which necessitates some tests to identify the cause and fix the problem.

Regardless of the quality of service level you expect from an HA cluster, testing shouldn't be needed often, because:

  • If you require 100% uptime on HA servers, the cluster should already be fault tolerant, which means you can perform tests rarely. 
  • If the uptime requirement is four nines (99.99%), you can still test the HA server environment infrequently because the uptime requirement allows for some downtime.
  • Five nines (99.999%) uptime indicates the IT organization expects excellent availability. The server cluster is configured to provide that; I recommend testing rarely.

About the experts:
Kevin Tolly is founder of The Tolly Group, which provides third-party validation/testing services. Tolly is also the founder and CEO of Tolly Research, which provides research services to IT vendors and end-user companies.

Joe Clabby is the president of Clabby Analytics and has more than 32 years of experience in the IT industry, with positions in marketing, research and analysis. Clabby is an expert in application reengineering services, systems and storage design, data center infrastructure and integrated service management. He has produced in-depth technical reports on various technologies, providing guidance on numerous topics, such as virtualization, provisioning, cloud computing and application design.

Next Steps

Explore HA server clustering options

HA considerations with Linux servers

Improve vSphere HA setups

3 best practices to achieve high availability in cloud computing

Dig Deeper on Data center ops, monitoring and management