drx - Fotolia
SRE model requires technical, organizational optimization skills
Practitioners categorize SRE job descriptions two ways -- by distributed systems architecture design that evolves rapidly, and also by the ability to optimize human workflows.
BOSTON -- In some IT circles, technical DevOps skills and so-called 'soft' skills concerned with how teams collaborate are separate discussions, but under the SRE model they are inseparable DevOps components.
Make no mistake, technical skills are essential to build the complex yet flexible distributed systems platforms that are the hallmark of the site reliability engineering (SRE) model. However, site reliability engineers must also use a scientific mindset to address how to optimize human work hours. To maintain distributed systems at scale, the overall DevOps organization -- not just the hardware and software -- must run as effectively and efficiently as possible.
"Grumpy humans are really bad at running systems," said Liz Fong-Jones, developer advocate at Google and former leader of the Google SRE team responsible for Bigtable. Fong-Jones spoke from experience about how to optimize human labor at an SRE conference here last week. "Unfair distribution of work prevents system scale," she said.
But optimal human work distribution, in Google's case, does not mean emotional hand-holding for employees. Instead, SREs continually take a dispassionate look at how to organize teams and divide work, the same way they optimize distributed systems of computational machines for high data throughput. For example, Google coined the term "toil budget" when it pioneered the SRE model, to quantify the amount of time DevOps pros spend on manual, repetitive and potentially automatable tasks.
"Moreover, what you think works -- the naïve solution -- doesn't," said Fong-Jones in her presentation. Fledgling SRE teams often think that a round-robin process to IT service desk tickets or an evenly divided queue of such tickets among workers is the fairest approach, but that didn't work well at Google, because break/fix work inevitably means frequent interruption to workers' flow and concentration.
Through trial and error, Google arrived at a better division of labor for Bigtable SRE teams in Dublin and New York, Fong-Jones said. One team would work on sets of break/fix tickets while the other conducted what Google calls interruption reduction projects to reduce the number of tickets and time spent on toil, and the teams switched these roles quarterly. A balance between reactive and proactive work meant less burnout among teams and better system performance for Bigtable, Fong-Jones said.
SREs impart distributed systems wisdom to developers
What unifies the organizational and technical optimization skills demanded by the SRE job description is the need to manage the technical debt that's an inevitable side effect as complex systems rapidly evolve. Platform SREs must continually iterate on infrastructure designs, and constantly refactor automated cloud platforms while production workloads run on them, a job akin to rebuilding a train as it speeds down a track.
"The SRE takes on the burden of technical debt for the business, which can block organizations like Wayfair from moving very fast," said Hemant Kapoor, global head of platform, SRE and cloud engineering at Wayfair, the Boston-based home goods e-commerce company, which hosted the SRE conference last week and presented there on its SRE work. "SREs must be able to quickly come up with little adjustments that enable scale."
One such adjustment is to change the way application servers connect to database servers through connection pools, so that response time doesn't degrade under heavy load as systems grow. This means SREs must know the fundamentals of Unix system design inside and out, Kapoor said.
Bill Lincolnassociate director of site reliability engineering, Wayfair
Thus, as the lines blur between developer and IT ops roles at Wayfair, skills upgrades aren't just a one-way street. IT ops pros teach important distributed systems skills to developers as they hand over the reins of self-service infrastructure. One SRE team at Wayfair incorporated chaos engineering concepts in the company's disaster recovery tests, so that developers would learn how to design apps for reliability on ephemeral distributed systems.
Under DevOps, both organizational and technical SRE work should be iterative and cyclical. As Wayfair's hybrid cloud platform evolves, so too must the configuration of the teams that service it. Thus, another SRE team at the company focuses entirely on the organizational dynamics of the SRE model, and looks to automate the traditional IT ops break/fix role and its constant interruptions out of existence. Then, those SREs will transition to an embedded-SRE job description within software engineering groups.
"You don't solve the reliability problem just through infrastructure or just through software -- it has to be a joint effort," said Bill Lincoln, associate director of site reliability engineering at Wayfair.