online-php-experts: you want to be an SRE?

What is an SRE anyway?

To understand the answer to this question, it’s important that you learn a bit of history. Lets talk about the traditional approach to system management. Prior to Google’s creation of the SRE position, System Administrators ran company operations.

What is a system administrator?

A system administrator or ‘sysadmin’ is someone who is responsible for the configuration, upkeep and reliability of complex computing systems.
They assemble software components (that are written by developers) and deploy them to produce a service.
They monitor these services and respond if there are any events that occur with the service.

System Administrators worked on the “operations” side of things, whereas engineers worked on the “development” side of things.

What’s so bad about this approach?

According to the SRE Book, this approach caused division and conflict between developers and sysadmins. Because the two had different backgrounds, skills, and incentives, it meant that they had different vocabulary and thought about reliability very differently. Developers wanted new features to get out to users as quickly as possible whereas the operations team members (sysadmins) wanted to avoid breaking anything. Google saw the concerns with this approach and created the idea of “Site Reliability Engineering.”

So again, WHAT IS AN SRE?

According to the creator of the position at Google, Ben Treynor defines SRE in this interview as:

“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”

A few months ago, I had the opportunity to visit a data center just like the one you see pictured here. I toured several large warehouse sized rooms filled with thousands of machines. The magnitude of this space is remarkable.

Now let’s say that one of the servers in the data center went down and needed to be replaced. With the “old way,” a new server would be configured manually by a system administrator. What this means is that the sysadmin would manually make sure the new machine has the proper operating system, software, tags, etc. Now imagine that 1,000 servers need to be replaced. See where I am going with this? It would take forever, or the company would need a lot of sysadmins to do the labor.

Now consider the “new way” as described in this bullet point that I took from Dropbox’s Site Reliability Engineer Job posting:

“You will automate the server provisioning process to reduce the labor of our networking engineering and datacenter operations teams. Once we plug a new server in, it walks itself through all aspects of provisioning to join the fleet without any human involvement.”

Without any human involvement.

In this example, an SRE would be responsible for writing the software that automates the server configuration process. Cool right? This example really helped me to understand what an SRE truly is:

Site Reliability Engineer = Software Engineer + Systems Enthusiast

According to Tammy Butow, SRE Manager at Dropbox,

“SREs are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.”

By eliminating human interaction through automation, SREs make systems more reliable. So essentially, an SRE’s job is to automate themselves out of a job.

But Krishelle, why do you think this is cool?

The reason I found this to be really cool is the same reason I decided to study math. Math allows you to utilize functions and rules to compute large scale problems. One of my favorite lessons from when I was teaching is based on this problem:

“You own a landscaping business and one of your specialties is outdoor brick staircase. How many bricks would you need to bring if a customer ordered a 10-high stairwell? How many bricks would a customer need for a 20-high stairwell? How many bricks would a customer need for a 38-high stairwell?”

Gauss’ Trick for summing the numbers from 1 to n

My students quickly realized that counting the bricks was an okay strategy for the smaller staircases. But as I increased the height all the way to a 100-high stairwell, they were forced to find another way. They realized that math can be used as a tool to calculate large scale problems, avoiding a brute force approach (In my Algebra 1 courses, I would get students to discover they could use the equation n(n+1)/2 for the staircase problem.)

Just as math is a tool for solving large scale problems, in the world of computers, code is a tool for managing large scale systems. It is a tool that allows for automating tasks through software and eliminating the need for manual human labor. Site Reliability Engineers are behind this work, they manage and automate these systems using their systems knowledge and their code, making the system more reliable with every bit.

How do I know if SRE is right for me?

This is a big question that comes up when I speak to job seekers considering pursuing SRE roles. I put together some important questions to ask yourself before you commit completely.

SRE Compatibility Quiz

Do you like thinking about large scale problems that have a lot of moving parts?
Do you like thinking about how to make large systems more reliable?
Are you okay with working on software that will likely never be overtly seen by an external user?
Do you enjoy looking at a terminal for large amounts of time?
Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves system level problems that you cannot always see?
Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)?
Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario where something needs to be fixed?
Are you able to stay calm under pressure?
Do you approach problems in a logical, process-oriented way?
Are you comfortable attempting a problem that has never been solved before?
Are you someone who thinks about how you can make things better?

If you answered yes to at least 8 of these questions, SRE could be a good position for you. Read on to find more resources on SRE and a list of companies that offer SRE roles.

So I really want to be an SRE, now what?

There are many resources out there that are useful to start learning more about SRE, as well as gain the skills needed to obtain a role. Here are a few that I recommend starting with.

Understanding SRE Role and Responsibilities

Still trying to wrap your mind around what SRE means? Check out these resources:

🌐 Google’s SRE Resources — A website that contains Google’s definition of SRE, the transcript of an interview with the creator of the position, as well as other resources (including the online version of the SRE Book).

🌐 SRE Book Notes — Realizing you may not be ready to go out and spend $40-$50 on the SRE book, this is an awesome set of notes on each chapter of the book by Dan Luu.

🎥 Keys to SRE — A talk given by the creator of the SRE role Ben Treynor of Google.

🎥 Site Reliability Engineers — Keeping Google up and running 24/7 — A Webinar with Google SREs.

🎥 Site Reliability Engineering at Dropbox — A talk given by Tammy Butow, SRE Manager at Dropbox.

online-php-experts

Tuesday, 23 April 2019

you want to be an SRE?