alphaspirit - Fotolia
ICYMI: Advice for aspiring site reliability engineers
IT ops pros looking to land an SRE job can use these four articles to absorb the key responsibilities and challenges that site reliability engineering entails.
As DevOps takes hold in the enterprise, organizations lean on site reliability engineers to help tear down that metaphorical wall between development and operations teams.
The discipline of site reliability engineering -- or SRE, an acronym that's interchangeably used for the job title of site reliability engineer -- aims to automate repetitive, manual operations tasks, as well as measure and maintain IT systems and software reliability. Other responsibilities range from the design and management of complex, distributed architectures to the optimization of the human side of DevOps workflows.
But what, exactly, does the path to an SRE role look like? And what can IT pros expect once they're there?
In case you missed it, veteran TechTarget journalist Beth Pariseau uncovered a number of telling themes -- from the hurdles admins face during a transition to the role to some general dos and don'ts for infrastructure automation -- with input from SREs in diverse industries. Get up to speed with this article roundup.
Prepare for cultural roadblocks
Site reliability engineering might emphasize automation, but the process to establish an SRE team is far from automatic.
The challenge for many IT ops professionals who want to embrace the SRE model is a tough, almost cyclical one -- they're simply too bogged down with manual, day-to-day processes to implement the levels of automation that would reduce, and ultimately eliminate, those same tedious tasks.
Like so many other IT-related challenges, this issue is just as much cultural as it is technical. SREs at the SREcon event in March 2019 in New York cited two important steps to jump-start a transition to, or an implementation of, the role. First, sell upper management on the benefits of reliability engineering, such as increased efficiency through automation, to get them on board and invested. And, second, get your priorities straight.
"Even if 80% of the team is dedicated to firefighting, the rest can tap into automation to get rid of tedious work," said Arnaud Lawson, senior infrastructure software engineer at Squarespace, a website creation company in New York, at SREcon.
For example, IT ops pros should look for opportunities to give runbooks to help desk admins to streamline break/fix processes. This practice, in turn, affords the ops team more time to work on automating those processes down the line.
Know the limits of automation
As the company credited with defining the SRE role, Google and its engineering team know a thing or two about infrastructure automation -- including its limitations and risks.
Aspiring SREs should be careful not to view infrastructure automation, in and of itself, as the end goal of any IT project. This outlook downplays the importance of human judgment and might cause teams to overlook the business objectives that drive IT initiatives.
For example, while a team of Google site reliability engineers used automation to deploy a new Google Cloud Platform region in Mumbai in 2018, the primary goal of that project was not automation itself; it was to enable Google to meet its cloud service-level objective for its users. According to an SREcon presentation from Max Luebbe, a Google SRE, the team actually chose not to automate certain tasks, such as load balancer configurations, because humans could perform them quickly enough.
"Instead of thinking about automation as a way to replace people, I like to think of it as a way to write programs to amplify people," Luebbe said at the conference. "Human judgment has value -- it's not something you should look to eliminate but look to how you can get more use and impact out of it."
More advice for successful infrastructure automation? Don't try to tackle it all at once. Instead, maximize productivity and satisfy project stakeholders by implementing automation tools and practices in iterations, Luebbe advised.
Enlist the help of SRE software
While most SRE processes evolve naturally from DevOps teams' tribal knowledge, third-party software tools can -- and, in some cases, should -- play a role.
That was the case at least with Procore Technologies, a construction management software firm in Carpinteria, Calif. The company adopted Blameless SRE software, which integrates with collaboration platform Slack, to streamline incident management and post-mortems.
Blameless defines and divvies up responsibilities between SRE team members as incidents occur and provides them with a summary of the issue, as well as status updates, in Slack. For post-mortems, the tool records the events of an incident -- something that, previously, a Procore SRE team member would have to do -- and also creates a follow-up list of action items, captures relevant code and uses SREs' Slack messages to generate incident timelines.
Other tools with similar capabilities for incident management automation are on the market, including Atlassian Jira Ops and BigPanda.
Carefully unleash distributed systems monitoring
In complex, distributed IT environments, monitoring tools must ensure and maintain high performance. Without careful evaluation and testing, these tools might do just as much harm as good.
Site reliability engineers should focus on log monitoring and synthetic monitoring tools. These tools present highly detailed information, potentially at the expense of system performance. SRE teams should evaluate monitoring tools' demanding data extraction and collection methods and weigh the benefits of information with the downsides of gathering it, site reliability experts warned at SREcon.
"I'm not saying don't log; I'm saying log smarter," said Danny Chen, a trading solutions SRE at Bloomberg, the financial and media company based in New York, in a presentation at SREcon. "Understand what the costs are as you make decisions."
SRE teams should test exactly how a monitoring tool might affect distributed system performance under different loads. They should also carefully manage synthetic monitoring tools, which simulate user interaction with a system, and rein in any scripts that degrade performance.