Hospital system addresses IBM SVC performance, stability problems with alternative configuration

Spectrum Health breaks up IBM SVC eight-node cluster, taking uncommon tiered approach to improve stability and performance of storage virtualization system.

Prioritizing stability over ease of management, Spectrum Health System took an uncommon if not unprecedented approach when upgrading the server hardware for the IBM SAN Volume Controller (SVC) software that virtualizes its storage of about 2 PB of data.

The not-for-profit organization, which is based in Grand Rapids, Mich., and operates nine hospitals, had encountered infrequent but memorably disruptive outages over the last five years, most often in connection with IBM SVC code updates. So, the IT team last year made plans to break up its eight-node SVC cluster into four two-node clusters, or tiers, each with its own service level of availability.

On its face, the new approach might not appear so radically different. Last month, at the completion of the project, Spectrum still had a large storage virtualization environment of four pairs of SVC nodes, or "I/O groups," each with a primary node backed up by a partner node.

But, by dividing SVC into four distinct clusters, the IT team hopes to restrict any problem to a single two-node cluster and, ideally, discover and resolve the issues on a test tier to avoid impacting the organization's top-tier applications, such as the Cerner Corp. clinical system.

Architectural trade-offs

The Cerner application had been accessing disks from each of the node pairs in the eight-node cluster under the original architecture, which Spectrum's value-added reseller (VAR), IBM and Cerner had recommended for performance reasons. But, as Spectrum learned, the setup carried the unintended risk of exposing the Cerner system to an outage, even if a problem affected only one of the SVC node pairs.

"That was really more of an architecture for ease of administration," said Scott Dresen, vice president of enterprise technology services at Spectrum. "It's one of the sides of virtualization that people need to be very careful about."

Under the old setup, the eight IBM SVC nodes operated logically as a single cluster, and Spectrum was able to manage its two IBM DS8300s, three XIVs and DS5300 from the same interface as one storage pool. With the SVC management layer in place, the IT team could move data between storage systems without taking down applications. During the past five years, Spectrum has relied on SVC to migrate to a new data center and new storage systems, move hundreds of terabytes of data as well as implement several hundred terabytes of tier 2 storage with no impact to applications or end users.

"[Users] didn't see any kind of outage or forced downtime or maintenance window," said Dan Lawrence, manager of storage at Spectrum. "That's really been the power of virtualization."

The trade-off with the new architecture is that Spectrum will be able to non-disruptively shift data only among the disks managed by any one of the four SVC clusters, since each manages only the storage associated with it.

"We are compromising to some degree the opportunity virtualization can give us," said Dresen. "We cannot seamlessly migrate from the tier 1 cluster to the tier 2 cluster without a service interruption of the application. But, that is OK by us."

Despite the lack of one big pool of storage, the SVC storage virtualization layer will still bring benefits. For instance, the IT team can manage all of the storage using the same tools. It also can continue to use the same multipath drivers on all of the servers.

More importantly, Spectrum can test SVC code updates on the fourth tier before installing them on the production systems. The third tier will get the updates first, the second tier on a less frequent basis, and the first tier only on a semiannual basis or even less often.

"Having three [production] tiers gives us a chance to roll code out to progressively more sensitive applications with increasing levels of confidence, because we have deployed the application updates on different systems of lower sensitivity," said Dresen.

Spectrum reserved the first tier for storage of the most important applications. The second has the bulk of the rest of the organization's 650 or so applications; the third handles archive and near-line data; and the fourth is for testing.

"They're being smartly pragmatic," said Marc Staimer, president of Dragon Slayer Consulting in Beaverton, Ore. "I'm a big fan of people looking at what they have and saying, ‘Alright, this is what we know. How do we make things better without introducing more variables that we don't know how to deal with?' "

Yet, Spectrum's approach is unusual, according to IBM's Chris Saul, market segment manager for storage virtualization and midrange disk. Saul said large customers with hundreds of terabytes of storage quite commonly have more than one SVC cluster because they need more performance than a single cluster can deliver. They generally don't move from a single eight-node cluster to four two-node clusters and typically don't encounter stability problems, he said.

"For more than the last three years, our entire installed [IBM SVC] base has been delivering better than five 9s of availability, which is an average of about five minutes [of] downtime a year," Saul claimed. "So, the design is that the system is extremely redundant and very highly available."

Node failures

SVC uses IBM System x servers deployed in redundant pairs for high availability, and, indeed, Spectrum on many occasions has been able to address a hardware problem on an SVC node with no application outage, as the downed node failed over to a partner node.

"When it works well, it works really well," said Dresen.

But, he added that Spectrum never experienced even four 9s of availability with the eight-node cluster and encountered "the predictable annual outage," most recently in April, when an SVC bug factored into an eight-hour system stoppage, during a time when the organization was partway through its infrastructure upgrade. The outage affected several key applications.

"Essentially, you took the storage out from right underneath the applications, and the applications responded differently to that loss of storage," Dresen said. "Some apps came up right away. Other apps had some corruption that we had to deal with and manage through to get them back up and running."

Despite the misfortune, Dresen saw a bit of a silver lining: the validation of Spectrum's new SVC approach. The outage affected only business applications that the IT team hadn't yet migrated to the new configuration. The major Cerner clinical system was not affected because it was already running the latest IBM SVC code on the updated hardware in the newly segregated top-tier cluster.

"Our new architecture has been rock-solid since we put it in," said Dresen.

HBA driver compatibility issues

The April outage illustrated one of the challenges that users can face in a large IBM SVC environment. The fix for the SVC bug was in the latest version of the software. Spectrum, however, couldn't upgrade to new SVC code until the IT team checked whether the host bus adapter (HBA) drivers on its servers were up to date to ensure compatibility.

"Had we been able to upgrade to the current version of the SVC, the way we would have liked to have done, that issue wouldn't have occurred," Dresen said. "But, it takes us a long time to go through all the different servers we have and make sure that they're upgraded."

Prior to April, Spectrum hadn't experienced a major outage since 2009, as the IT team orchestrated a major hardware platform uplift of the Cerner system, according to Dresen. Any problems tended to stem from incompatibilities between the SVC code and HBA code, he said.

"As long as you keep your code levels consistent between the SVC and the servers and everything in between, you're going to be fine," said Mark LaBelle, manager of database and midrange system at Spectrum. "But, if you let those code levels get out of sync, you're playing with fire."

Outage forensics

Dresen said the team looked to IBM for an analysis of the IBM equipment but wasn't always able to get useful information. "Candidly, what they would say to us very frequently is, ‘If the version of the adapter that you're running works now, when you do the code update to SVC, it should work after,' " he said. "It became very frustrating to hear because too often we've found that was not the case."

Lawrence said the system occasionally failed due to client-side pathing issues with drivers. When one SVC node was down, the host servers didn't always handle the failover to the other path, and the application might shut down, he said.

The code upgrade process sometimes proved challenging. Updating code in an eight-node IBM SVC cluster is a structured affair, starting with the first pair of nodes. As Dresen described it, one node flips from active to inactive mode to receive the upgrade, after its "old-code" partner becomes the active node.

As soon one of the nodes in each of the four pairs has the upgrade, the freshly updated nodes flip from inactive to active state to run the cluster, while the partner nodes return to an inactive state to receive their code upgrades. Dresen said the process could take six to eight hours.

IBM's Saul said the system is designed to upgrade a single node at a time and move on to the next node only if the prior node's update was successful. If a problem materializes, the system backs out the code upgrade and returns to its previous state, he said.

 "If there's any sort of failure within the upgrade process, we go back to where we were before so that we give the customer the most stable system," Saul said.

But, Spectrum's IT team at least once reached the seventh node only to get stuck, neither able to finish the upgrade nor roll back, according to Dresen. The running node shut down, assuming the partner node was alive; but, because the partner node wasn't running yet, the system went down, Dresen said.

"If you have a bug in that update process, you get into a state where the only thing you can do is take the whole cluster down to fix the issue," said Dresen. "And, if you've got all your applications presented through it, you're going to introduce downtime."

Another outage was simply a fluke, when an IT staffer had to physically move a server box when the hardware indicated a memory chip failure, even though a code problem was the culprit. He accidentally dislodged the power cord on the server partner node, taking down the redundant node pair and the Cerner system with it. (IBM didn't offer dual supplies with SVC servers back in 2006, when Spectrum purchased its original system. IBM added the hardware feature in October 2009.)

Spectrum simply wants to do everything that it can to eliminate even the remotest chance of an outage, knowing that any storage system can fail, no matter how reliable it purports to be.

"I tell people here all the time, in my 20-plus years now in the health care IT industry, how many times over and over I've tried to put a dependency on a technology that shouldn't go down," Dresen said, "and it always finds a way to go down."

Dig Deeper on Storage tiering