Nmedia - Fotolia

GrubHub's infrastructure-as-code feeds pandemic resiliency

GrubHub SREs fostered more consistent deployment patterns for resources such as DNS and AWS load balancers with a move to self-service infrastructure-as-code tools for DevOps teams last year.

GrubHub's adoption of SRE practices paid off quickly as the COVID-19 pandemic struck the U.S. and continued to grow along with the company's customer base throughout last year.

The pandemic sparked a major increase in customer traffic on the online food delivery business as consumers increasingly stayed out of public places such as restaurants and instead ordered takeout online. In the first quarter of 2020, GrubHub reported 23.9 million active diners, an increase of 24% over the first quarter of 2019.

That number reached 30 million in the third quarter, according to the company's earnings releases. The company's first quarter gross food sales went up 8% compared to the first quarter of 2019 to $1.6 billion; by the third quarter, it reported gross food sales of $2.4 billion, up 68% over the third quarter of 2019.

Alex TrevinoAlex Trevino

"Before the pandemic… we would see higher order volume in, you know, urban centers, less so suburban [areas]," said Alex Trevino, technical lead in GrubHub's SRE team. "Now there's higher suburban-originating traffic."

This surge in activity represented an equally large increase in demand for back-end IT services, which GrubHub uses to connect customers to more than 300,000 restaurants in the US and London via web and mobile apps. It serves those apps from a combination of two AWS regions, US-East and US-West, as well as a legacy self-owned data center, and GrubHub engineers have written their own container orchestration tooling to manage approximately 9,000 Docker containers in the cloud.

However, engineering teams had already designed GrubHub's AWS infrastructure and 300 application microservices to automatically scale to accommodate massive growth.

"Preparing for higher traffic and building resiliency is a continuous exercise, part of day-to-day culture here," Trevino said. "One of the things that we do, since our business is more active when the weather is colder, leading up to Labor Day, we go through the exercise of reviewing all of our systems to make sure that we're scaled appropriately."

Thus, instead of having to respond to day-to-day scalability issues, GrubHub site reliability engineers (SREs) were able to focus their efforts more strategically amid the pandemic, on efforts such as expanding DevOps teams' use of infrastructure-as-code tools from Pulumi.

Infrastructure-as-code tool speaks developers' language

GrubHub's IT staff didn't experience drastic upheavals during the pandemic, but furthering the use of infrastructure-as-code, in which infrastructure resources are provisioned and updated alongside application code through CI/CD pipelines, helped them accommodate some of the changes that did occur.

Andrew BlumAndrew Blum

"Documentation being correct and updated more frequently has become immensely important, especially since we have folks that are no longer working in the same time zone," said Andrew Blum, senior SRE at GrubHub who led the infrastructure-as-code rollout. "We [can't] stop by each other's desks and pick each other's brains."

Infrastructure-as-code centralizes both infrastructure provisioning and documentation within the company's Git source control system, where SREs have also built-in enforcement for documentation updates alongside pull requests.

"We store our documentation with the code," Blum said. "When you go to make a change … you also have a thing in the [pull request] that says, 'Did you change documentation for this?'"

Before Pulumi, GrubHub SREs used custom Python scripts to automate infrastructure. Pulumi offered a more systematic alternative to this custom scripting while preserving the Python interface, as opposed to using a domain-specific language (DSL), which is the approach taken by competitors such as HashiCorp's Terraform.

"It's very nice to be able to use our own paradigms, and a natural programming language versus some specific DSL," Blum said.

GrubHub SREs had first adopted infrastructure-as-code tools from Pulumi in 2019 but began to steer developers toward using it instead of requesting infrastructure resources from the SRE team through help desk tickets in mid-2020.

There was some initial resistance to this change among developers, Blum said, but the familiar programming language helped ease the transition.

"They're using their same tools and workflows to interact with this," he said. "And it gives them control and power to do the things they need to do to push their features and products out."

GrubHub's move to infrastructure-as-code also helped SRE teams delegate repetitive infrastructure management work to developers while improving system reliability through repeatable, automated deployments that were subject to quality checks and other tests in CI/CD pipelines.

Infrastructure-as-code improves network management, reliability

One of the most significant systems to make the transition to infrastructure-as-code last year was the company's NS1 Domain Name System (DNS), which translates human-readable web addresses, such as "GrubHub.com," to IP addresses associated with the back-end infrastructure.

In the past, under a previous DNS provider, SREs created and updated DNS servers and records through a helpdesk ticketing system and a traditional console UI, rather than through infrastructure-as-code. Using Pulumi to update DNS cut down on manual errors and offered consistent centralized management, improving the system's reliability.

Before adopting infrastructure-as-code, a select number of engineers had access to the previous DNS provider's console, but not all of them knew the full context of DNS changes, Blum said in a 2020 NS1 conference presentation.

There [are] no surprise DNS changes. We have a process that lets us see who did it, why and when.
Alex TrevinoTechnical lead, SRE, GrubHub

Under the Pulumi system, by contrast, every change to DNS must go through peer review. This encourages collaboration between app developers and those more steeped in the intricacies of DNS, improving the accuracy and efficiency of updates.

"There [are] no surprise DNS changes," Trevino said in an interview. "We have a process that lets us see who did it, why and when."

Infrastructure-as-code has also made security certificate updates and other changes to the company's AWS load balancers more consistent, Blum added in the interview. Similar to DNS, only a few engineers have access to the production AWS console; updating through Pulumi requires a change to a single file that corresponds to a grouping of resources that may cross AWS regions, rather than multiple manual changes to each resource through the console.

"There's a lot of uniformity and other things that [infrastructure-as-code] brings to the table that you can totally miss when you're doing it by hand," Blum said.

GrubHub SREs aspire to a full-fledged GitOps approach to application and infrastructure updates, the purist definition of which requires any updates to Git code repositories to be immediately and automatically deployed.

That's still a work in progress, Blum said.

"There's a lot to cover -- we have a lot of resources that were originally created by hand [and] part of this project is to create tooling to import everything, in addition creating new resources," he said. "This is a long project that we are undertaking."

Dig Deeper on Systems automation and orchestration