pressmaster - Fotolia

Tip

Build a data center shutdown procedure to prepare for the worst

A data center shutdown checklist helps IT teams focus on backup, testing and system verification before pulling the plug and losing valuable information.

Although policy and process are critical for modern IT, data center admins are often unprepared to shut off things when the need arises. The need may be as dramatic as an approaching storm, or just a mundane municipal power grid upgrade. But the way a business prepares for and responds to a facility shutdown can encourage or avoid costly disaster.

A well-conceived and tested data center shutdown procedure plays a vital role in business continuity planning. It defines the best process to migrate or close applications, protect valuable data, shut down physical systems -- and then restart them successfully later. Let's consider the major elements found in a basic shutdown document.

Verify and update system documentation

Every data center shutdown procedure is a prelude to an eventual restart, so proper preparation is key to ensure successful restarts once an outage period has passed. Create a comprehensive -- or at least current -- documentation set that captures each system's volume, operating system and application configurations, paying special attention to anything that could potentially or unexpectedly change during a reboot. There are countless tools to create this documentation and most modern configuration management and enforcement tools can capture and report system states. Don't forget to capture or record the configuration of any networking equipment or storage arrays.

Manage dependencies

Actual dependencies vary greatly between organizations and facilities, so IT planners need to decide on a startup sequence that includes network equipment, storage arrays, DNS servers, backup servers and schedulers. Once all of the necessary server, storage, network and critical services, such as DNS, are back online, the startup sequence can move to restart applications, such as databases, followed by dependent applications, such as the corporate sales system. Then, start up any processes that depend on those applications, such as company storefront websites.

During preparation, also identify and understand the myriad of different dependencies within your data center. Documenting dependencies allows IT staff to reboot systems, services and applications in their proper order to avoid disruption and lost startup time. For example, you don't want to start a server before starting up the storage array that it depends on.

Perform and verify backups

Backups are an important process within any data center, but solid backups are critical before a planned facility outage. Complete and verify any regularly scheduled backups before a shutdown begins, and manually back up any systems that aren't regularly backed up or have long recover point objectives prior to shut down.

Traditional backup approaches might seek to capture each server's operating system state along with separate data backups, such as data on a SAN. Virtualized data centers may opt for more recent VM-aware backups, such as snapshots and remote replication. There is no single means or measure for a proper backup -- the process and underlying tools must be suitable for your own data center and business needs -- but the key is to make sure that everything is backed up, and to test those backups to verify they are complete and recoverable.

If preparation time is limited, concentrate on mission-critical backups. However, any system or data not backed up will present a risk to the application and the business.

Check and verify system hardware

The third step to prepare a data center shutdown checklist is to inspect hardware status and identify any hardware failures. Modern systems management tools can generate error reports to email or messaging systems, record events to log files and even track events on comprehensive, real-time dashboards. But not all incidents are addressed immediately. For example, a disk in a RAID 5 of RAID 6 group may fail and be rewritten to another spare disk, but it might take some time before a technician is able to replace and rebuild the failed disk. Similar issues occur on servers which may migrate or restart VM workloads to other available systems -- yet the troubled system could remain problematic because it hasn't yet been dealt with.

A review of error logs and dashboards won't fix these problems either, but it will uncover any problems before shutdown, alerting IT staff that the problems were not caused by the downtime or restart. IT staff can make an informed decision to address pending incidents before the shutdown, or at least ensure that pending problems will not affect the restart.

Shut down systems in the proper order

Generally, a successful data center shutdown procedure starts at the periphery of the IT environment and works inward. An organization may first log off and shut down end users, applications, like web servers, services like Exchange and then databases and middleware. Virtualized environments may then acquiesce and shut down virtual instances such as virtual machines or VMs, followed by management tools like VMware vCenter or Microsoft System Center. Only then should the IT team shut down the physical servers. Once servers are off, the IT team can shut down storage and network devices. IT teams might also wrap up the shutdown by securing uninterruptable power supply systems, monitors, power distribution units and other ancillary equipment.

Restore and verify systems

When the planned outage is over, the IT team can implement the restart process. Ideally, a restart would be the exact reverse of the shutdown, but this isn't always the case. Restarts are often carefully paced to streamline power redistribution into the facility and prevent enormous surges that can trip circuit breakers and damage equipment. Each major step also involves some amount of verification or testing to ensure that the gear or software is operating properly before implementing the next startup step

For example, turn the network gear on and verify that it has booted properly before attempting to start up any storage arrays. Once storage arrays are turned on, check for any failed disks, problematic disk groups and other possible problems.

Next Steps

Extend the life span of your data center

Power design and maintenance considerations

Modernize data center facilities with these five strategies

Dig Deeper on Data center hardware and strategy