What is an SRE anyway?
To
understand the answer to this question, it’s important that you learn a
bit of history. Lets talk about the traditional approach to system
management. Prior to Google’s creation of the SRE position, System Administrators ran company operations.
What is a system administrator?
- A system administrator or ‘sysadmin’ is someone who is responsible for the configuration, upkeep and reliability of complex computing systems.
- They assemble software components (that are written by developers) and deploy them to produce a service.
- They monitor these services and respond if there are any events that occur with the service.
System Administrators worked on the “operations” side of things, whereas engineers worked on the “development” side of things.
What’s so bad about this approach?
According to the SRE Book,
this approach caused division and conflict between developers and
sysadmins. Because the two had different backgrounds, skills, and
incentives, it meant that they had different vocabulary and thought
about reliability very differently. Developers wanted new features to
get out to users as quickly as possible whereas the operations team
members (sysadmins) wanted to avoid breaking anything. Google saw the
concerns with this approach and created the idea of “Site Reliability
Engineering.”
So again, WHAT IS AN SRE?
According to the creator of the position at Google, Ben Treynor defines SRE in this interview as:
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”
A
few months ago, I had the opportunity to visit a data center just like
the one you see pictured here. I toured several large warehouse sized
rooms filled with thousands of machines. The magnitude of this space is
remarkable.
Now
let’s say that one of the servers in the data center went down and
needed to be replaced. With the “old way,” a new server would be
configured manually by a system administrator. What this means is that
the sysadmin would manually
make sure the new machine has the proper operating system, software,
tags, etc. Now imagine that 1,000 servers need to be replaced. See where
I am going with this? It would take forever, or the company would need a
lot of sysadmins to do the labor.
Now consider the “new way” as described in this bullet point that I took from Dropbox’s Site Reliability Engineer Job posting:
“You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement.”
Without any human involvement.
In this example, an SRE would be responsible for writing the software that automates the server configuration process. Cool right? This example really helped me to understand what an SRE truly is:
Site Reliability Engineer = Software Engineer + Systems Enthusiast
According to Tammy Butow, SRE Manager at Dropbox,
“SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.”
By
eliminating human interaction through automation, SREs make systems
more reliable. So essentially, an SRE’s job is to automate themselves
out of a job.
But Krishelle, why do you think this is cool?
The
reason I found this to be really cool is the same reason I decided to
study math. Math allows you to utilize functions and rules to compute
large scale problems. One of my favorite lessons from when I was
teaching is based on this problem:

“You
own a landscaping business and one of your specialties is outdoor brick
staircase. How many bricks would you need to bring if a customer
ordered a 10-high stairwell? How many bricks would a customer need for a
20-high stairwell? How many bricks would a customer need for a 38-high
stairwell?”

My
students quickly realized that counting the bricks was an okay strategy
for the smaller staircases. But as I increased the height all the way
to a 100-high stairwell, they were forced to find another way. They
realized that math can be used as a tool to calculate large scale
problems, avoiding a brute force approach (In my Algebra 1 courses, I
would get students to discover they could use the equation n(n+1)/2 for
the staircase problem.)
Just
as math is a tool for solving large scale problems, in the world of
computers, code is a tool for managing large scale systems. It is a tool
that allows for automating tasks through software and eliminating the
need for manual human labor. Site Reliability Engineers are behind this
work, they manage and automate these systems using their systems
knowledge and their code, making the system more reliable with every
bit.
How do I know if SRE is right for me?
This
is a big question that comes up when I speak to job seekers considering
pursuing SRE roles. I put together some important questions to ask
yourself before you commit completely.
SRE Compatibility Quiz
- Do you like thinking about large scale problems that have a lot of moving parts?
- Do you like thinking about how to make large systems more reliable?
- Are you okay with working on software that will likely never be overtly seen by an external user?
- Do you enjoy looking at a terminal for large amounts of time?
- Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves system level problems that you cannot always see?
- Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)?
- Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario where something needs to be fixed?
- Are you able to stay calm under pressure?
- Do you approach problems in a logical, process-oriented way?
- Are you comfortable attempting a problem that has never been solved before?
- Are you someone who thinks about how you can make things better?
If
you answered yes to at least 8 of these questions, SRE could be a good
position for you. Read on to find more resources on SRE and a list of
companies that offer SRE roles.
So I really want to be an SRE, now what?
There
are many resources out there that are useful to start learning more
about SRE, as well as gain the skills needed to obtain a role. Here are a
few that I recommend starting with.
Understanding SRE Role and Responsibilities
Still trying to wrap your mind around what SRE means? Check out these resources:
๐ Google’s SRE Resources — A
website that contains Google’s definition of SRE, the transcript of an
interview with the creator of the position, as well as other resources
(including the online version of the SRE Book).
๐ SRE Book Notes — Realizing
you may not be ready to go out and spend $40-$50 on the SRE book, this
is an awesome set of notes on each chapter of the book by Dan Luu.
๐ฅ Keys to SRE — A talk given by the creator of the SRE role Ben Treynor of Google.
๐ฅ Site Reliability Engineers — Keeping Google up and running 24/7 — A Webinar with Google SREs.
๐ฅ Site Reliability Engineering at Dropbox — A talk given by Tammy Butow, SRE Manager at Dropbox.
No comments:
Post a Comment