Site Reliability Engineer
TextUs is the leading conversational messaging platform for mobile-first customer interactions. We improve business outcomes by allowing organizations to have amazing, message-based conversations with their prospects, customers, and employees across their entire journey with the organization.
WHAT YOU'LL DO
TextUs has a strong DevOps culture, it’s now time to expand our team with an SRE specifically dedicated to develop systems and add tooling that continues to increase site reliability and performance. You’ll have ownership in areas like availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
As a Site Reliability Engineer (SRE) you’ll have responsibility for keeping all user-facing services and other TextUs production systems running smoothly. You’re a pragmatic operator and software craftsperson that applies sound engineering principles and operational discipline to instrumenting best-in-class visibility into our environments. You’ll work in partnership with DevOps and Architecture to identify areas within the system that do not scale, provide immediate palliative measures and drive toward long-term, proactive resolution.
The outcomes of this team feedback into other engineering groups within the company to continue to evolve our culture of continuous learning as we strive toward excellence.
WHO YOU ARE
Are you the type of person who sees an inefficient or repetitive task and finds a way to make that easier? The perfect addition to our team has experience operating at scale and loves the opportunity to learn new technologies or approaches, and is comfortable embracing change. You are a person who actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
You have experience operating within an SRE role within an at-scale environment and now you want to lead by example within that role.
In this role you will:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Know your way around Heroku and AWS
- Have strong programming skills - (Our platform is Ruby; Python and/or Golang experience a plus)
- Experience with time-series databases, developing alerts and graphs out of the data (Prometheus, Influx, Grafana, etc)
- Embraces collaboration and communicating asynchronously.
- Embraces documenting all-the-things so we don't need to learn the same thing twice.
- Have a positive, enthusiastic, continuous learning attitude. When you see something broken, you feel a strong desire to fix it.
- You feel the urge to deliver quickly and embrace delivering iteratively.
- Our values inspire you, and you work in accordance with those values.
- Experience in problem-solving and analyzing global scale distributed systems.
- Proficiency in algorithms, data structures, complexity analysis, and software design and/or expertise in performance and application issues.
- Capable of technical deep-dives into code, infrastructure, and data storage, yet verbally and cognitively agile enough to hold your own in a strategy discussion with our executive team.
RESPONSIBILITIES
- SREs collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. You are the champion within Engineering for:
- Improve the system issue diagnosis process by making it as boring as possible.
- Lead with commitment to refining and delivering on our SLOs, SLIs, and SLAs.
- Architect our monitoring and alerting instrumentation to be proactive - warn on symptoms before there are outages.
- Research production issues across services and levels of the stack.
- Document your findings so that we can turn them into repeatable actions – and into automation.
- Design, build and maintain core infrastructure pieces that support TextUs to scale to support hundreds of thousands of concurrent users.
- Plan the growth of our infrastructure in partnership with Architecture and DevOps
- Propose ideas and solutions to optimize workload through automation.
- Lead and contribute to designs for issues, epics, OKRs
- Complete Root Cause Analysis (RCA) investigations
- Contribute and drive handbook, runbooks, and general documentation
- Seek out areas for improvement in Engineering practices
- Embody accountability, self-awareness, and conflict resolution
- Plan and execute configuration change operations both at the application and the infrastructure level.
- Proactively identifies significant projects that result in substantial cost savings or revenue
- Utilizes a rigorous data-driven approach to identify productive changes within the product architecture
- Embraces the concepts of ‘Infrastructure as Code’ and CI/CD
COMPENSATION RANGE:
- Salary $145K - 185K. Title and compensation will be aligned with experience.
REPORTING TO
- CTO
LOCATION
- Headquartered in Colorado. Remote (US).
WHY TEXTUS
- Our team members have a clear voice and an impact on the success of TextUs. We are a collaborative, learning and data-driven culture with an experienced and proven leadership team.
TextUs Benefits include:
- Competitive pay
- Equity
- Health/Dental/Vision/Insurance
- 401K with company match
- Flex vacation policy
- Headquartered in Colorado and Remote (US)