Getty Images/iStockphoto

Tip

An introduction to SRE documentation best practices

SRE documentation stands distinct from other types of IT documentation, not least because it's a core responsibility for site reliability engineers. What else makes it distinct?

The state of site reliability engineering documentation continues to solidify as more large enterprises move their business-critical applications to the cloud, sending the site reliability engineer role mainstream.

SRE documentation vs. traditional IT documentation

Traditional documentation efforts, such as IT documentation, fall under the support function umbrella. Consequently, it's easy to deprioritize documentation efforts through shortchanged resources. SRE documentation, in contrast, is part of the site reliability engineer's job description. There's no way to ignore it as an integral element of SRE best practices.

DevOps and IT documentation might have varied authors, such as developers, QA admins or technical writers. A great SRE should also be a talented communicator. Infrastructure architecture SREs are the primary authors of runbooks, internal tools documentation and procedures. Consider this type of SRE as the communicator of the SRE team.

Foundational SRE documentation

Some common documentation types are foundational to the SRE role and operational best practices.

Product readiness review documentation

When SREs onboard new systems, they conduct a production readiness review (PRR) to ensure that new systems meet their organization's readiness standards. Take the time to create a PRR template that captures your organization's readiness standards, and avoid reusing previous PRR documentation to minimize human error.

Writing a PRR requires IT teams to be as descriptive as possible when documenting system readiness. If you lack the information to complete a section, document why you don't have it.

Choose a reviewer for your team's PRR documentation based on who has the expertise to ask the right questions about system readiness. Collaborate with the PRR creator to determine whether the information in the PRR documentation is sufficiently complete to testify to production readiness.

Service overviews

A service overview document guides SRE troubleshooting. SREs across all shifts must understand the system architecture, components, dependencies and service contracts for every service they support. Consequently, service overviews are high-priority critical documents. Common elements of a service overview include the following:

  • service description;
  • links to other information sources, such as monitoring dashboards and operations documentation; and
  • reference architecture.

Creating a service overview should be a collaborative effort between the development and SRE teams. This enables both teams to design an overview that prioritizes how the SRE team approaches troubleshooting. Choose an internal platform, such as a wiki, for publishing your service overview to ensure that it remains accessible to SRE teams.

Service overviews aren't a one-time effort. Teams must invest time in updating service overviews as services change -- and as new dependencies appear. SRE teams often use the PRR process to output service overviews.

Playbooks

Playbooks, sometimes called runbooks, are core operations documentation that enables on-call engineers to respond to service monitoring alerts. A well-crafted and -maintained runbook reduces the time it takes to mitigate an incident, as it contains troubleshooting procedures and links to operations and monitoring consoles.

SRE teams are increasingly turning to automation to create playbooks. Popular tools include Siemplify -- which is now part of Google Cloud -- Swimlane and Jupyter.

Common elements of an SRE playbook include the following:

  • definition of an incident for your organization;
  • designation of incident response roles and responsibilities;
  • standardized incident response procedures and workflows, reviewed and tested by an SRE; and
  • cheat sheets and checklists for SRE incident response.

Playbook creation best practices include the following:

  • Begin each playbook with a trigger, such as a monitoring alert.
  • Structure playbook entries based on severity, impact, metric, background, mitigation and discovery.
  • Automate every action -- including simple steps -- to remove as much human error as possible from the incident response process.

Playbooks tend to be culturally dependent, so consider any template that you find online to be a starting point, not a roadmap. Invest time in getting feedback from both SREs and the operations team about their playbook requirements, and craft your template to meet your team's needs.

Post-mortems

The bigger the cloud gets, the harder it can fall. In an era of significant cloud outages, your organization must define its post-mortem criteria before a triggering incident occurs. A typical postmortem document includes the following:

  • a management summary capturing the incident's effects and root cause;
  • a technical summary of the incident's effects on the business, users, teams and systems, including approximate response times, detection method and the solution the SRE team applied to resolve the incident;
  • incident background with additional detection details and screen captures of monitoring graphs, timeline, root causes and resolution information; and
  • lessons learned, including details about what went well and what went poorly during incident resolution.

Establish clear and comprehensive templates for post-mortem procedures to set organization-wide standards. The best templates are wikis or located online, with writing tips and guidance for SREs to follow. SREs shouldn't start from a blank page when writing SRE documentation. Post-mortems should be accessible and searchable via your organization's internal collaboration platform.

Culture drives reviews, not the mere existence of documentation. Google's SRE guidelines, for example, dictate that an organization must establish a post-mortem culture beyond documentation to avoid the anxiety of a team's walk of shame after an incident. Post-mortems aren't meant to sink to the bottom of email inboxes -- these documents warrant stakeholder review.

Policy documentation

Operating large-scale complex systems demands both technical and nontechnical policies for production. Policy documentation details mandates for production tasks, such as change logging, log retention, internal service naming, and emergency credential access and use.

Developing policy documentation involves creating and maintaining standard documentation templates. Some organizations train SREs on writing policy documentation as part of their onboarding process.

Because SRE documentation best practices are integral to the SRE job role, SRE documentation often escapes many of the challenges that traditional IT and DevOps documentation faces when competing for resources and engineer attention. As with other SRE and DevOps practices, organizations should grant SRE teams the time to continuously refine their documentation strategy, processes and tools to ensure that SRE documentation grows as an asset for overall systems operations.

Dig Deeper on IT systems management and monitoring