Inside the Sexy Field of Site Reliability
In the words of Ben Traynor, site reliability engineering is “what happens when you ask a software engineer to design an operations function.”
He should know.
Traynor serves as Google’s VP of Engineering and is largely recognized as the creator of the site reliability practice after introducing it to Google in 2003 and growing the department from seven engineers to around 1,200 today.
Since then, other organizations have followed suit. According to the 2020 DevOps Skill Survey, the adoption of SRE teams grew from 10 percent in 2019 to 15 percent in 2020.
One of the companies that started up its SRE team last year was BillGO, the bill management and payments platform out of Fort Collins. They started off by hiring Orion Cook, a former full-stack developer who was used to creating Java applications, working with interfaces and writing code to support those applications.
“The big difference between that work and site reliability is that site reliability is focused on supporting the developers who create those applications,” Cook said. “We develop tools for making the internal applications observable and build processes and tools to keep our production services healthy. It’s been a very cool opportunity to get into this niche field of engineering.”
A few months later, Technical Lead Manager Travis Ellett joined to set the direction and grow the team. Shortly after, Software Engineer Matt Cramer signed on. Today, the three make up BillGO’s SRE department, which supports about 170 engineers at BillGO. They work quickly, collaborate often and find pleasure in proactively breaking systems to make sure they don’t fail in the future.
“Our saying is we teach people how to fight fires, and then we go and set fires,” Cramer said.
Consider us intrigued. To learn what this looks like in action and how this will help BillGO continue to scale, we turned to Cramer, Cook and Ellett.
Matt, you’re the most recent to join the team. What led to this decision?
Cramer: A good friend of mine and former colleague joined BillGO a couple of years ago. As soon as he got here, he called me and told me to join him. He described BillGO’s culture of collaboration and the shared determination to build something bigger, better and faster that hadn’t been built before. He called out this collective drive and said it’s unlike anything he’d ever experienced before. Since joining, I would agree.
What are your responsibilities within SRE?
Ellett: I set the direction and build teams that focus on the tooling and processes for observability and incident management. Now we’re starting to prepare for capacity planning and chaos engineering. Ultimately, our goals are to keep production stable, identify how we can measure and incentivize the other engineering teams to meet the benchmarks we’re suggesting, and provide them the tools they need to solve the problems we’ve identified.
Cramer: We each have initiatives that we own and drive. Right now mine are centered on functional initiatives to stabilize production and comparative analyses between versions that are coming and versions that are currently running in production. That’s taken up a lot of my time in recent months and will probably carry on for several more before we get it fully ironed out.
He called out this collective drive and said it’s unlike anything he’d ever experienced before. Since joining, I would agree.”
What makes this work — as Orion said — “niche”?
Ellett: We’re like an engineering support team and we operate in very short cycles. For example, BillGO’s large-scale engineering projects take about a quarter to complete through design, inception, scoping, prototyping, implementation, rollout and support. But site reliability runs on a tight cycle and is very hands on with our customers, who are our colleagues. The tools we build will most likely never be seen by a BillGO customer, and I think that’s fascinating. I have a real passion for quick feedback loops, so that we can adapt quickly. The work is all very compressed.
Cramer: I come from a DevOps role, where I worked and supported dev teams by delivering software to servers and test systems. Here, we’re actually touching the application’s code. We’re working more closely with the teams rather than working adjacently. There’s always this element of surprise and the opportunity to learn something new. A lot of engineers hate bugs or the things they missed while designing, but I like finding these things and coming up with ways to design better in the future. Plus, with chaos engineering as one of our future goals, our team will have some fun cut out for us.
Site Reliability Technology
Can you share some examples of the work your team has completed?
Ellett: We work on time series databases for infrastructure and observability. We’ll instrument applications and the underlying infrastructure and send it off to a database for statistical analysis. In addition, we’ll scrape all the logs from the applications and systems and pipe them up. Right now, we’re parsing over 1 billion logs per month. We’re also looking at 40,000 time series metrics that are ingested per day, such as CPU, RAM, disk usage and business-specific metrics that measure the health of our API from our customers’ perspective.
Matt, you mentioned culture was something your friend sold you on at BillGO. What’s the culture like?
Cramer: Our shared ownership permeates throughout the organization. On the engineering team specifically, we all take call rotations and take care of production to make sure it’s running. We all hold the pager a week at a time. This is a great way to get to know your teammates and go through trials together. When something breaks, we all jump on a call, find the problem and fix it. That draws teams together and is a powerful tool here.
Lastly, what are you most proud of since joining BillGO?
Cook: The culture at BillGO made it really easy for us to create reliable systems without having to go through formal channels and all the bureaucracy. We’ve gone from a small team that had very little influence to being able to get a lot of different engineering teams on board through socialization of what we have and how we can help.
Cramer: I’m really proud of the impact that such a small team can have on such a large group of engineers who are building things at incredible scale and speed. Ordinarily, we might be viewed as the annoying fly that keeps splashing into the soup. Instead, there’s this culture of our teammates coming to us and asking how we can make something better together. We host weekly office hours and engineers across the organization drop in so we can kick ideas around and come up with new designs. That’s a lot of fun and it’s very rewarding.