Site Reliability Engineer

TextUs

| Remote

Sorry, this job was removed at 10:28 a.m. (MST) on Friday, March 5, 2021

View 586 Jobs

Find out who’s hiring remotely

See all Remote jobs

View 586 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

TextUs is the leading conversational messaging platform for mobile-first customer interactions. We improve business outcomes by allowing organizations to have amazing, message-based conversations with their prospects, customers, and employees across their entire journey with the organization.

WHAT YOU'LL DO

TextUs has a strong DevOps culture, it’s now time to expand our team with an SRE specifically dedicated to develop systems and add tooling that continues to increase site reliability and performance. You’ll have ownership in areas like availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

As a Site Reliability Engineer (SRE) you’ll have responsibility for keeping all user-facing services and other TextUs production systems running smoothly. You’re a pragmatic operator and software craftsperson that applies sound engineering principles and operational discipline to instrumenting best-in-class visibility into our environments. You’ll work in partnership with DevOps and Architecture to identify areas within the system that do not scale, provide immediate palliative measures and drive toward long-term, proactive resolution.

The outcomes of this team feedback into other engineering groups within the company to continue to evolve our culture of continuous learning as we strive toward excellence.

WHO YOU ARE

Are you the type of person who sees an inefficient or repetitive task and finds a way to make that easier? The perfect addition to our team has experience operating at scale and loves the opportunity to learn new technologies or approaches, and is comfortable embracing change. You are a person who actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.

You have experience operating within an SRE role within an at-scale environment and now you want to lead by example within that role.

In this role you will:

Think about systems - edge cases, failure modes, behaviors, specific implementations.
Know your way around Heroku and AWS
Have strong programming skills - (Our platform is Ruby; Python and/or Golang experience a plus)
Experience with time-series databases, developing alerts and graphs out of the data (Prometheus, Influx, Grafana, etc)
Embraces collaboration and communicating asynchronously.
Embraces documenting all-the-things so we don't need to learn the same thing twice.
Have a positive, enthusiastic, continuous learning attitude. When you see something broken, you feel a strong desire to fix it.
You feel the urge to deliver quickly and embrace delivering iteratively.
Our values inspire you, and you work in accordance with those values.
Experience in problem-solving and analyzing global scale distributed systems.
Proficiency in algorithms, data structures, complexity analysis, and software design and/or expertise in performance and application issues.
Capable of technical deep-dives into code, infrastructure, and data storage, yet verbally and cognitively agile enough to hold your own in a strategy discussion with our executive team.

RESPONSIBILITIES

SREs collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. You are the champion within Engineering for:
Improve the system issue diagnosis process by making it as boring as possible.
Lead with commitment to refining and delivering on our SLOs, SLIs, and SLAs.
Architect our monitoring and alerting instrumentation to be proactive - warn on symptoms before there are outages.
Research production issues across services and levels of the stack.
Document your findings so that we can turn them into repeatable actions – and into automation.
Design, build and maintain core infrastructure pieces that support TextUs to scale to support hundreds of thousands of concurrent users.
Plan the growth of our infrastructure in partnership with Architecture and DevOps
Propose ideas and solutions to optimize workload through automation.
Lead and contribute to designs for issues, epics, OKRs
Complete Root Cause Analysis (RCA) investigations
Contribute and drive handbook, runbooks, and general documentation
Seek out areas for improvement in Engineering practices
Embody accountability, self-awareness, and conflict resolution
Plan and execute configuration change operations both at the application and the infrastructure level.
Proactively identifies significant projects that result in substantial cost savings or revenue
Utilizes a rigorous data-driven approach to identify productive changes within the product architecture
Embraces the concepts of ‘Infrastructure as Code’ and CI/CD

COMPENSATION RANGE:

Salary $145K - 185K. Title and compensation will be aligned with experience.

REPORTING TO

CTO

LOCATION

Headquartered in Colorado. Remote (US).

WHY TEXTUS

Our team members have a clear voice and an impact on the success of TextUs. We are a collaborative, learning and data-driven culture with an experienced and proven leadership team.

TextUs Benefits include:

Competitive pay
Equity
Health/Dental/Vision/Insurance
401K with company match
Flex vacation policy
Headquartered in Colorado and Remote (US)

Read Full Job Description

Site Reliability Engineer

RESPONSIBILITIES

COMPENSATION RANGE:

REPORTING TO

LOCATION

WHY TEXTUS

TextUs Benefits include:

Technology we use

Location

An Insider's view of TextUs

What’s the vibe like in the office?

Anne Cooper

How does the company support your career growth?

Maria LeFebre

What projects are you most excited about?

Farrah Abdullah

What are TextUs Perks + Benefits

More Jobs at TextUs