Getty Images/iStockphoto

Tip

How to prepare for long-lasting cloud outages

Cloud outages are more than a technical inconvenience -- they can threaten business continuity, customer satisfaction, security and innovation. Will your business survive one?

Brian Kirsch, Milwaukee Area Technical College

Published: 06 Dec 2024

In an era when businesses rely heavily on the cloud, the harsh reality is that cloud outages still occur more frequently -- and for longer -- than many anticipate.

The allure of cloud's subscription-based pricing model accelerated the removal of on-site products. This has fueled the removal of on-site, organizational control of cloud resources. All control is now in the hands of cloud providers and vendors. These providers and vendors promise reliability, but interruptions can strike like lightning, disrupting services and leaving companies scrambling.

The question isn't whether cloud outages could happen -- because they certainly will -- but whether businesses are prepared for the storms that lie ahead.

The CrowdStrike outage: A wake-up call

The CrowdStrike incident in 2024 showed that many companies couldn't handle an outage. There were no contingency plans for services that could not boot up. Cloud outages typically involve the loss of applications; while that is bad, there are some workarounds. For example, contingency plans can rely on backup systems or paper trails. But the CrowdStrike outage was different. This was core infrastructure that went down.

Consider having software that requires an active internet connection to connect to a cloud to work. Companies like Microsoft and Adobe have products that can be nonfunctional when users aren't connected to their cloud. What we see now is the infrastructure following suit, and the CrowdStrike incident was a clear wake-up call to everyone. When core infrastructure is dependent on someone else, it can cost organizations hundreds of millions of dollars, as Delta Airlines found out.

Graphic showing the timeline of CrowdStrike outage events.

A bad update caused the CrowdStrike outage, but that is only a short step from losing access to a consistent internet connection required to operate. This event caused many companies to look at what updates are out of their control. Customers do not control cloud vendor updates. Businesses receive an email telling them about forthcoming updates. Then, users cross their fingers and hope their applications work after any changes. The CrowdStrike outage now has people looking for alternatives and advice on how to handle these cloud outages. But not everything is completely lost.

Open the lines of communication

While customers can't control what happens with their cloud services, they can control how they communicate about them. One of the biggest issues when things go wrong is a lack of information. People get frustrated when they don't know what is going on. Staff doesn't know what to tell customers, and that frustration snowballs into chaos.

When core infrastructure is dependent on someone else, it can cost organizations hundreds of millions of dollars.

The CrowdStrike outage was unique because the failed infrastructure meant many of the traditional communication methods were inaccessible. I have seen this happen during a data center power outage, where the network operations center (NOC) could not call anyone because the phone list was saved on the computers. The entire raised floor, including the NOC, was taken down due to a pushed power shunt button. Of course, a paper printout fixed that, but who has paper printouts anymore?

Communication plans, such as old-fashioned call trees, still work. While I might call my boss at their desk, I still have their cellphone number as a backup. It is also a good idea to seek out alternative communication methods held on public servers, such as Discord, Google Chat and Slack. While these tools might be frowned upon in normal working conditions, they can be a lifesaver when there is a large-scale outage, and the primary forms of communication are down.

Don't take a back seat

It's up to customers to be proactive with their contingency planning. Unlike disaster recovery situations, there are few options that can protect users from these outages unless they're willing to invest in dual infrastructure with different software up and down the stacks. That is not going to be a cost-effective option. Companies must also worry about new licensing terms, as VMware's acquisition by Broadcom and even AT&T suing Broadcom could affect these terms. This is another case of infrastructure vendors holding companies hostage and putting them in a situation where changes or outages can dramatically affect core business functions.

True cloud outage planning calls for companies to examine all their paid services. Vendors will always work to lock their customers into the cloud subscription model, but enough customers asking what happens to their paid services during cloud outages has an effect. Currently, CrowdStrike is looking at implementing more customer control functionality.

Keep in mind that vendors want that monthly paycheck. That is the biggest lever customers can pull to get the information they're going to need. It's not going to help them recover when the cloud they use goes down, but it does give valuable insight into how they're going to be impacted. That communication is really the key to dealing with, and recovering from, a cloud outage.

Brian Kirsch, an IT architect and Milwaukee Area Technical College instructor, has been in IT for 30 years and holds multiple certifications.

How to prepare for long-lasting cloud outages

Cloud outages are more than a technical inconvenience -- they can threaten business continuity, customer satisfaction, security and innovation. Will your business survive one?

The CrowdStrike outage: A wake-up call

Open the lines of communication

Don't take a back seat

Dig Deeper on Cloud infrastructure design and management

Texas judge throws out second lawsuit over CrowdStrike outage

One year on from the CrowdStrike outage: What have we learned?

CrowdStrike outage explained: What caused it and what’s next

CrowdStrike incident shows we need to rethink cyber