alotofpeople - stock.adobe.com

Bluesky, Squarespace pros give incident, project postmortems

Bluesky saw a massive surge in requests per second after the 2024 U.S. election, while Squarespace engineers had 10 months to migrate 9 million domains from Google.

SANTA CLARA, Calif. -- The Bluesky social media platform and its bare-metal back end had to cope with a mass exodus from X, while Squarespace had 10 months to absorb Google Domains -- and with 90 days to go, it was well behind schedule.

Those were the settings and stakes for the incident and project postmortems engineers presented during SREcon Wednesday, along with takeaways for their fellow IT pros. The bottom line: Planning is important, presenters said, but it's crucial to have a collaborative team that's willing to be flexible, adapt quickly and make sometimes risky changes on the fly.

"Oftentimes during an incident, if you make no decision, it can be worse than making the wrong decision," said Jaz Volpert, back-end Go engineer at Bluesky Social PBC. "Decisiveness is very important, and you must weigh the costs of doing nothing versus the cost of trying the thing that you can try."

Bluesky survives Election Day and Backhoe Day

Bluesky has seen steep growth since Volpert joined the company about two years ago. Then, the fledgling, decentralized social network had about 100,000 users. Now, it has 33 million.

Decisiveness is very important, and you must weigh the costs of doing nothing versus the cost of trying the thing that you can try.
Jaz VolpertBack-end Go engineer, Bluesky Social PBC

Much of that growth was spurred by changes to X, formerly Twitter, after its purchase by billionaire Elon Musk in 2022, since Bluesky offers a similar microblogging interface. When X announced a change to its terms of service in October 2024, mandating that user data help train its AI models, Bluesky's hyperscale growth shifted into its highest gear, according to Volpert.

In less than a month between October and November, in the immediate wake of the U.S. election, Bluesky went from a daily peak of 5,000 requests per second to 50,000.

"I was at the bar one night, and I was watching a report from a Golden State Warriors game, and I saw a Bluesky post on TV, and I said, 'Huh, that's strange,'" Volpert recalled. "A few days later, we saw Bluesky rocket to the top of the free apps [list] in every app store in the U.S., Canada, the U.K. ... lots of major markets."

Bluesky still operates with a small team of 21 full-time employees. Volpert said that in that month of explosive growth, including 11 straight "days of hell" following the U.S. election, about half a dozen people spent more than 16 hours per day in a situation room. Compounding matters was an event that one of Volpert's slides called "Great American Backhoe Day," in which a fiber cable was cut at one of Bluesky's data centers, affecting 50% of its users.

Jaz Volpert on stage at SREcon Americas 2025.
Jaz Volpert, Bluesky back-end Go developer, presents an incident postmortem.

This was the point where the cost of indecision was greatest, according to Volpert. An initial attempt to fail over traffic to another data center thrashed its databases to the point where service was degraded for all the platform's users and had to be rolled back.

"What have we learned from that? Well, it's best to roll with punches," they said. "We had tried this failover before at smaller scales. It worked fine. We had never tried it at this scale before, but there's a first time for everything, so we learned the hard way."

The company also manages its back-end hardware on bare-metal servers, which meant it had to cope with that month's spike in demand with fixed server capacity. Bluesky had previously run in the cloud, but its bandwidth-heavy, compute-heavy service made the cloud dramatically more expensive, according to Volpert. And some of the most significant failures the Bluesky team encountered had to do with systems and software the company didn't control.

Other issues Bluesky engineers encountered during their 11-day situation room marathon arose out of stress and exhaustion, such as misconfigurations in proxy server deployments.

"From that [we learned]: Automate all future proxy deploys with human approval. Make sure that's automated. Make sure at least two people take a look at it," Volpert said. "I'm sure we're going to break this rule at some point in the future when something crazy happens again, but we can at least try to practice it so that it's easy to do under pressure."

The small team also performed admirably when it came to dividing up work into pairs and trios to try multiple solutions in parallel, which allowed for quick responses, they said.

Squarespace absorbs Google Domains on deadline

Unlike Bluesky, website building and hosting company Squarespace had some warning when a major surge in system load was headed its way in 2024. But that didn't eliminate the need to improvise, according to another SREcon presentation.

"It was April 2023 when I went into a one-on-one with the now-CTO of Squarespace, and he proceeded to inform me of this potential acquisition of a thing called Google Domains," said Franklin Angulo, vice president of engineering at Squarespace. "And he also proceeded to inform me that the company needed me to lead the acquisition [migration]."

The terms of the deal for Squarespace to acquire the Google Domains business, moving from a web domain reseller to a registrar, included a strict 10-month window to nondisruptively migrate all 9 million domains from Google's business to Squarespace, including links to their owners' payment and Google Workspace accounts.

"We were going to transition upward of 9 million domains, which was going to quadruple the number of domains under management for us, and those domains came with ... upward of a million Workspace seats, [which] was going to triple the number of Workspace seats under management," Angulo said.

That wasn't all. In fact, six major projects had to be completed within that 10-month window, said co-presenter Divya Kamat, Squarespace senior manager of engineering. In addition to building a migration engine, engineering teams would also have to integrate with Google's domain reseller API; cope with a massive influx of traffic to a new web front end; adapt to the expanded responsibilities of being a domain registrar in addition to being a reseller; achieve feature parity with Google Domains services that Squarespace didn't have before, such as domain forwarding; and finally, launch a new domains product.

Divya Kamat on stage at SREcon Americas 2025.
Divya Kamat, Squarespace senior software engineering manager, presents a project postmortem.

Within this massive workload, the team tasked with building a new event-driven migration engine to take in streaming domain data from Google consisted of 10 people: nine engineers and one product manager, Kamat said.

The deal closed in September 2023, kicking off the migration window. Squarespace separated the migration project into two phases: making domains eligible for migration by supporting feature parity with Google Domains, and then performing the migration itself. By April 2024, the migration project had reached "a stressful checkpoint," Angulo recalled. With 90 days left, 1.2 million domains were eligible for migration and fewer than half a million had been migrated.

"The implications of not finishing after 10 months was [that] the [Google] servers were going to get shut off, so the domains were going to be dropped on the floor, or things were going to stop working," Angulo said. But at the rate things were going, the migration system would need to run 18 hours a day, five days per week for nine weeks straight to meet the deadline.

"That left a lot of room for things to go wrong," Angulo said. It was time to rethink the system.

The migration team cleared bottlenecks in its migration engine by increasing the number of partitions and consumers in an Apache Kafka data pipeline and consolidating domain forwarding rules into batches to reduce the number of reads and writes to APIs. At the same time, the team working on domain eligibility brought in engineers from other parts of the company to prioritize building features that would make the most domains eligible at a time.

In April, the team was migrating 3.2 domains per second, for 11,520 domains per hour. Within three weeks, that increased to nine domains per second. By the end of the project, that number increased to 12 domains per second, up to 1 million per day.

"In the second week of June, we migrated, in a single week, 2.5 million domains, [where] before the acquisition, we had only 2 million domains under management at Squarespace," Angulo said. "And that did not affect any of our systems. There were no outages. There was no effect to our customers."

As with Bluesky's situation room stint, Kamat said there was no substitute for organizational grit during the intense final days of the project.

"At no point did anyone shy away from having to work hard," she said. "I never had to have an awkward conversation with someone saying, 'You just need to work a little bit longer than you normally do.' Everyone felt that onus on their own."

The Squarespace presentation did not address an issue that arose with DNS hijacking for dozens of newly migrated domains, most of them belonging to Web3 companies, in July 2024. A Squarespace statement released July 9 said the migration of domains involved no changes to multifactor authentication, which had been reported by some industry sources as a cause of the issue.

Beth Pariseau, senior news writer for Informa TechTarget, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.

Dig Deeper on IT systems management and monitoring