Site Reliability Engineering

How Google Runs Production Systems

To check if my internet connection is working, I usually type «test» into my browser and see if Google returns any results. When reflected more carefully, this action reveals a surprising assumption of mine: I consider it far more likely that the connection between my laptop and my internet service provider is broken than it is that the Google search service ­– involving Google’s DNS server, its Global Software Load Balancer, Google Frontend, Application Frontend, and Application Backend – is down. And for good reason, it usually is my internet connection.

In the book Site Reliability Engineering: How Google Runs Production Systems, Google Site Reliability Engineers (SREs) outline the core ideas, practices, and guidelines that enable them to manage their systems with such reliability. The SRE discipline itself was invented at Google. I have read the book and summarized it on my personal knowledge hub, Brain. If you are interested in a full summary of the book, including the many technical ideas it contains that I will not delve into here, you can find it there:

This blog post is a reflection on how three core SRE principles have shaped my own work as a DevOps engineer at a startup client.

Rapid Innovation versus Product Stability

The Idea

There is an inherent conflict between the goal of rapid innovation, such as developing new features, and product stability, such as performance and availability. While rolling out many changes to production increases feature velocity, it makes running services reliably more difficult. To optimize user happiness, it is important to strike a balance between the two. In practice, an informal balance is often adopted ­– not least based on the teams‘ negotiation skills ­– which may not be ideal.

The SRE approach implements the balance by using an objective measure: a quarterly error budget based on the service-level objective (SLO), as discussed in Chapter 3 of the book. As long as the measured uptime is above the SLO, new features can be released. Once the error budget is depleted, Google chooses to block further releases, prompting engineers to reconsider (and likely improve) the software’s fault tolerance, testing, and release processes. Thus, the error budget provides a common incentive to find the right «innovation versus stability» trade-off.

As an additional benefit, this process highlights essential aspects for reliably running a service. First, sensible SLOs must be defined (Chapter 4). This requires thinking deeply about (and finding out) what users actually care about. Only once the definition of reliable is clear for a particular service can one work toward that goal. Second, proper monitoring must be in place. To establish an objective quarterly error budget, there must be an objective measurement of uptime (Chapters 6 and 10). Third, the mean time between failures (MTBF) and mean time to repair (MTTR) become relevant in order to avoid depleting the error budget. Topics such as testing (Chapter 17), on-call duties (Chapter 11), effective emergency response (Chapter 13), and effective incident management (Chapter 14) are emphasized.

Applying It

At my client, there is a strong focus on rapid innovation. After all, this is to be expected from a start-up whose main selling point is innovation. At the same time, given that critical infrastructure is involved, reliability and security must not be neglected.

Early on, we prioritised improving the monitoring of deployments. Initially, we relied on the monitoring offered by the cloud provider, implementing basic dashboards and alerts within their system. Since then, we have built our own monitoring stack based on OpenTelemetry, Prometheus, Elasticsearch and Grafana, which allows us to export custom application-level metrics for our stack. Achieving good observability is hard, and I am still not entirely satisfied with our current setup. Some alerts are still too noisy, some metrics are exported that are not used, and some dashboards are too cluttered. However, having basic observability in place that is properly persisted is indispensable as a foundation for reliability.

We currently do not work with enforced internal error budgets. Stop-the-line policies seem daunting to implement, particularly for smaller companies, even though I believe they would be beneficial (e.g. introducing clean-up weeks). I was told that the official SLAs use uptime as the underlying SLI, where the software is considered up when its core features are functional. In practice, we DevOps engineers have defined up as «all Kubernetes deployments are ready». To this end, we have implemented and tuned the livenessProbes and readinessProbes of our deployments, thereby improving the MTTR via Kubernetes’ self-healing capabilities. Now that we also have proper monitoring in place, we have started tracking additional custom health checks for some services. 

In our weekly meetings with the CTO, we document all downtime. Although this does not result in policies as strict as a release block, we believe that raising awareness of outages is an important first step in optimising the «innovation versus stability» balance.

Automating Toil

The Idea

SRE replaces the traditional sysadmin approach of manually running services and dealing with events by combining it with software engineering. First, this approach keeps the development (Dev) and operations (Ops) teams together, further helping to establish the common incentive discussed in the previous chapter. Second, it enables SREs to automate manual tasks with no enduring business value, known as toil (Chapter 5). This benefits the company by leading to more consistent systems, a platform that can be extended or even spun out for profit, and time savings (Chapter 7). While manual work scales linearly with the number of employees, automation allows for sublinear growth. Employees benefit by having more interesting work and better career prospects.

Google limits the operational work of SREs to 50% per person. They spend the remaining work hours automating toil through engineering. Due to their extensive company-specific production knowledge, SREs are particularly well-suited to designing highly effective production environment software (Chapter 18). Additionally, capping operational work distributes the work among SREs on a team. This helps prevent both operational overload and underload (Chapter 11). While operational overload is clearly undesirable, operational underload is arguably worse, because losing touch with the production system reduces an SRE’s ability to respond effectively to emergencies.

Applying It

I find the idea of devoting 50% of one’s work time to operational work and 50% to engineering work powerful. Initially, we were drowning in operational work (toil, mostly). Although it was difficult not to spend all day putting out fires, it was possible to invest some time in automation here and there. I took on the task of implementing infrastructure as code, automating the provisioning of cloud resources via Terraform. This now enables us to provision resources much faster, improves the consistency in our setups and ensures that the infrastructure is properly documented in version control. Another engineer automated our Helm deployments using a generalised helper script that knows the location of a deployment, the version of the chart it is based on, and the configuration it uses. This means we can redeploy in seconds instead of having to figure out all these details manually. Further automation work is ongoing, including automatically scanning components and flagging known CVEs or misconfigurations, improving CI/CD pipelines with pre-commit hooks and an AI agent that reviews MRs, and implementing GitOps.

While the operational work is still distributed somewhat unevenly within our DevOps team – some engineers focus primarily on Ops, while others rarely engage with deployments – I personally tried to find a balance between the two. As stated in the SRE book, I find that having first-hand experience of the current production system and working to improve it through automation is fruitful. Furthermore, it is crucial for the company that more than one person is familiar with a particular deployment setup to avoid a single point of failure among employees. While there are a few hiccups here and there, we generally manage to share the operational know-how well within the team.

Fly the Plane

The Idea

Few people naturally respond well to emergencies. This requires preparation and periodic, pertinent, hands-on training. When an incident occurs, engineers tend to focus sharply on the technical problem at hand. As the saying goes, «A pilot’s first responsibility in an emergency is to fly the plane.» Similarly, engineers should first triage the problem and consider how to minimize user impact (e.g. by diverting traffic to another datacenter) before working to find the technical root cause. Google uses the Incident Command System, which clearly separates responsibility into the roles of the incident command, operational work, communication, and planning (Chapter 14). This guarantees that not only the technical problem, but also users and management receive the attention they need.

Once the technical problem is resolved, the incident response process is not complete. Google takes great pride in its blameless postmortem culture. A written report is created for each incident, detailing its impact, the actions taken to mitigate or resolve it, the root causes, and the actions taken to prevent recurrence (Chapter 15). The focus should be on improving the system rather than blaming people for potential mistakes. Properly documenting incidents allows to identify common themes, document solutions for posterity, and reflect on how to improve system reliability.

Applying It

A good example of the «fly the plane» approach arose when our automatic TLS certificate renewal tool malfunctioned when used with the combination of an HTTP challenge and the Kubernetes Gateway API. At the time, we did not have proper monitoring in place for either TLS certificate expiration or the functioning of the certificate manager. As an engineer, being alerted that a certificate has expired and knowing that this should be taken care of automatically by a tool, the first instinct is to figure out why the tool is not working. However, the main goal should clearly be to obtain a valid certificate as quickly as possible. We issued a new certificate manually, informed the relevant parties that the problem had been resolved, and then took the time to thoroughly fix the erroneous configuration of the certificate manager. 

While incident management worked well in this case, which was not particularly complex, we generally do not have an official incident management system in place. Some of the aspects mentioned in the SRE book, such as having a clear command post and a central communication tool, are less relevant to us, since we are a small team working in one office. Nevertheless, this is an area in which we should improve in the future.

We have started documenting incidents internally on a shared page, detailing what happened, how it was fixed, and how we can prevent it from recurring. Additionally, we create tickets for tasks that arise from dealing with incidents (e.g. „fix the certificate manager configuration“). While these cannot replace a proper post-mortem with peer review, they are a first step in that direction.

The above examples demonstrate that the principles of SRE are not only useful in the context of a global corporation such as Google, but can also benefit smaller organisations, such as start-ups, when applied appropriately. When I was constantly operating in firefighting mode, I found it valuable to read (outside of work hours) about Google‘s approach to DevOps and the things that their SREs have found useful over the years. SRE is not a checklist or a tooling choice. It is a way of thinking about trade-offs, incentives, and responsibility. Even in a startup, those principles matter – perhaps even more.

Sources

Featured image: Photo by İsmail Enes Ayhan on Unsplash.


Beitrag veröffentlicht

in

von