The Modern SRE

February 4, 2022 by

By Alex Grote

Site reliability engineers (SREs) have traditionally been viewed as the final step before software launch. The improve software availability, performance, latency, capacity, and resiliency. They bridge the gap between development and operations to ensure that the application that reaches production works exactly as intended.

Since the title was coined almost 20 years ago, the role has predictably evolved. The discipline is ever-changing and is now entering a shift-left phase where reliability is baked in every step of the DevOps process. This manifests itself in everything from identifying issues earlier in the development process to making architectural changes that enable applications to scale better in production.

So, what’s the next step in this evolution?

SREs: Managers of fragility

More than anything, SREs need to manage how strong an application is when placed under stress, aka they see how fragile – or antifragile – the application is. How many concurrent users can an application handle? What happens when an error does occur? One of the main functions of the SRE is to determine if an application can “take a punch” and still stand.

The measurement of fragility feeds into larger discussions around risk management. Combined DevOps and SRE teams need to have transparent and frank environments that foster good conversations around how much risk is too much. Things like unscheduled interruptions and budget issues are part of these discussions. Users will likely accept downtime, but how much is too much?

For a long time, SREs were considered a result of DevOps. I’d counter this by saying they very much work in concert with each other. This symbiotic relationship means better products and more frequent releases. The modern SRE works hand-in-hand with DevOps to plan for and react to fragility.

Accept and plan for failure

Baring something very unexpected, a development cycle will have moments of failure. This is the natural order of things with increasingly complex applications. However, this can be baked into the development cycle if SREs and DevOps avoid silos and work as a single, holistic unit.

At NetApp IT, we believe that success is dependent on the shared ownership of a project, with DevOps and SREs both having equal amounts of responsibility for success. In this, there is a better environment for reacting to failures.

Closer teams mean better relationships.

Better relationships mean better communication.

Better communication means better problem-solving.

Balancing dev and operations mindsets

SREs can sometimes find themselves in the middle of the DevOps process, creating a bit of a Dev – SRE-Ops dynamic. The development side creates business value by releasing code into production. The operations side creates business value by preventing SLA violations. The SRE just wants everything to work right and sometimes gets stuck in the middle of competing goals.

It’s not easy.

However, it’s also an opportunity for SREs to take on an entirely new and meaningful role. By being a mediator and champion for the project, SREs take on a new leadership position.

This also requires a new way of leading. A top-down approach will fail – most DevOps teams aren’t set up this way and you’re working with side-to-side peers. It requires a humble approach, steeped in a servant mentality. Both sides have the same end goal, although how they get there may have different detours. Leaders in this position want both sides to take on a greater sense of ownership. This isn’t about just meeting internal metrics. It’s about doing your best work. Everything else will fall in line if everyone wants to put out something truly amazing.

– SREs can help foster this by:Asking how you can help the DevOps team

– Encouraging out-of-the-box thinking

– Embracing a humble mindset

The modern SRE is much more dynamic than it was in the early 00s It’s not about just tooling or tech skills. It’s about having a generalist mindset and being good at a lot of things. In this, their impact can be immeasurable.


Alex Grote is a DevOps Architect with NetApp IT.

Tags: