olly - Fotolia
Massive Slack outage caused by AWS gateway failure
An AWS Transit Gateway failure led to Slack being down for nearly five hours earlier this month. The companies said they made changes to prevent such outages in the future.
Millions of Slack subscribers returning to work from the holiday break earlier this month overloaded cloud provider AWS' gateway, setting off a series of events that downed the messaging service for hours.
Slack released a root cause analysis report to the media this week, detailing how AWS problems set off a domino effect that left the service inaccessible. Slack relies entirely on AWS for its cloud hosting.
Slack declined to discuss the problems related to the AWS Transit Gateway. However, a source familiar with the matter confirmed that the gateway failed to scale up fast enough to handle the incoming traffic.
The nearly five-hour Jan. 4 outage began about 9 a.m. EST with customers experiencing occasional errors immediately. By 10 a.m., the service was unusable for all subscribers.
The gateway problem contributed to packet loss between servers within the AWS network, which worsened over time. That led to an increase in error rates from Slack's back-end servers. Slack's IT team did not discover the escalating problem until almost an hour after it started.
At the same time, Slack experienced network problems between its back-end servers, other service hosts and its database servers. The troubles resulted in the back-end servers handling too many high-latency requests. While those requests were only 1% of the incoming traffic, they used up about 40% of the back-end server time, putting them in an "unhealthy" state.
"Our load balancers entered an emergency routing mode where they routed traffic to healthy and unhealthy hosts alike," Slack said. "The network problems worsened, which significantly reduced the number of healthy servers."
The result was not enough servers to meet Slack's capacity needs, which led to customers receiving error messages or not loading Slack.
The network instability prevented Slack engineers from accessing their observability platform, a type of network management system, which complicated the debugging process.
Amazon eventually aided Slack in fixing the problem. Amazon increased the network capacity and lifted the rate limit on its AWS Transit Gateway that had prohibited Slack from provisioning new back-end servers to handle the traffic.
To prevent such problems from happening again, Amazon increased its network traffic systems' capacity and moved Slack to a dedicated network.
"It's a great idea from the Slack perspective," said Irwin Lazar, principal analyst at Metrigy. "They're not fighting over other providers for resources."
Slack's report outlined the measures it took to avoid similar mishaps in the future. Slack documented new procedures for debugging its systems without its observability platform and prepared methods to configure some services to reduce network traffic. By Feb. 12, Slack plans to create an alert system for packet rate limits on the AWS network, increase the number of workers provisioning servers and improve its network management system.
Irwin Lazar Principal analyst, Metrigy
Amazon and Slack announced a partnership last June. The messaging app became the de facto communication standard for Amazon, and Amazon Chime became Slack's audio and video calling service. However, Chime has not experienced the growth that Teams and Zoom did during the COVID-19 pandemic.
Salesforce has since acquired Slack, but that shouldn't affect the Amazon and Slack partnership, Lazar said. Amazon does not compete directly with Salesforce.
"The biggest challenge that companies like Slack have is they have to be careful about being too reliant on a single cloud provider," Lazar said. "Cloud providers have outages. That's just the nature of the beast."