Job Title: Senior Site Reliability Engineer
Level: Senior
Working Hours: Full Time (40h/Week)
Contract: Contractor
Location: Remote
Your Team 👥
You will report to our Head Of Infrastructure and Deployment and join the Engineering team. The Site Reliability Engineering (SRE) team is dedicated to engineering, maintaining, and continuously improving the reliability, scalability, and performance of all critical Rocket.Chat systems and services. Our mission is to ensure an exceptional and uninterrupted experience for our users and customers, bridging the gap between development and operations to deliver value efficiently and automatically. On TheOrg you can view the complete structure of our organisation, including information about every team member, hiring managers and the size of each department.
Your Responsabilities ✏️
As a Senior Site Reliability Engineer, you will play a critical role in enhancing the reliability, performance, and scalability of Rocket.Chat's entire ecosystem. You will apply software engineering principles to infrastructure and operations, proactively preventing outages, optimizing system efficiency, and ensuring that new features and services are delivered with the highest standards of stability. Your expertise will be instrumental in delivering exceptional user experiences across our core platform, internal infrastructure, and customer-facing services.
Mandatory Hard Skills 🎯
- Strong background in software engineering with expertise in large-scale distributed systems.
- Expertise in Kubernetes, including operator development, and cloud platforms (e.g., AWS, GCP, Azure, OVH).
- Proficiency in programming/scripting languages such as Go, Python, or Bash for tooling and operator development.
- Deep, hands-on experience with monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Loki).
- Experience with Infrastructure as Code (IaC) tools like Terraform, Pulumi or Ansible and CI/CD pipelines using tools like ArgoCD.
- Solid understanding of networking fundamentals (TCP/IP, DNS, routing) and security principles.
- Familiarity with database technologies such as MongoDB or Redis.
Desirable Hard Skills 💕
- Practical experience with chaos engineering principles and tools.
- Experience with disaster recovery planning, testing, and implementation.
- Familiarity with agile management tools such as Jira.
Soft Skills ✨
- Proactive Mindset: Anticipate and address potential issues before they impact users.
- Collaboration: Work seamlessly with other teams, sharing knowledge and expertise to drive reliability.
- Problem-Solving: Strong troubleshooting and analytical skills to identify the root cause of complex issues across diverse technical stacks.
- Leadership: Guide and inspire team members, especially during incidents, and effectively communicate with both technical and non-technical stakeholders.
- Data-Driven Decisions: Base decisions on metrics and data to drive improvements.
- Passion: Genuine enthusiasm for what you do and how it contributes to our company's mission;
- Dream: Proactively seek out opportunities and challenges to achieve extraordinary results. If you're someone who takes initiative and is always striving to improve, you'll fit right in;
- Own: Take ownership of your work, set high standards for yourself, and be accountable for outcomes demonstrating a strong sense of responsibility and commitment. Take full responsibility for the reliability and performance of all Rocket.Chat services and infrastructure.
- Trust: Recognizing the importance of trust and support and actively working towards a collaborative and inclusive workplace;
- Share: Communicating openly and transparently, ensures clarity and honesty in interactions.
What You'll Do 🖥️
- Engineer & Operate Deployment & Platform Services: Design, develop, and maintain the Kubernetes Operators at the core of our managed hosting offerings, ensuring their reliability, scalability, and robust error handling.
- Manage & Optimize Core Infrastructure: Oversee the reliability and performance of foundational infrastructure, including multiple Kubernetes clusters and critical services like ArgoCD, Traefik, and our monitoring stack.
- Ensure Service Reliability & Uptime: Define, monitor, and enforce SLOs for all critical services, manage error budgets, and implement robust monitoring, alerting, and logging solutions.
- Automate Operations & Reduce Toil: Develop and maintain automation frameworks for deployment, configuration, and operational tasks, building internal tools to streamline SRE workflows.
- Lead Incident Management & On-Call Response: Act as a primary responder for critical alerts, lead blameless post-mortems, and continuously improve runbook documentation and disaster recovery plans.
- Foster Cross-Functional Collaboration: Engage early in the product lifecycle to ensure reliability is built-in, and collaborate with Engineering, Security, and QA to integrate reliability best practices.
- Implement Advanced Reliability Practices: Conduct proactive load testing, performance analysis, and chaos engineering experiments to identify system weaknesses and improve fault tolerance.
Benefits ✨
- Fully Remote & Flexible Working Hours
- Flexible Paid Time Off, Holidays and Vacation
- Company Laptop
- Remote Benefit
- iTalki, Courses and Books
- Stock Options
- Multicultural Environment
- Vibrant Company Culture
Check out our handbook to dive into each of our awesome benefits! At Rocket.Chat, we have tailored base pay ranges according to work locations. This approach ensures that we can competitively and consistently compensate our employees across different geographic markets.
Note: While we define an initial seniority level and budget for each role, this can be adjusted during the hiring process. The selection process itself — including interviews and assessments — helps us better understand where the candidate fits within our career framework and which grade they should be positioned in.
About Rocket.Chat 🚀
Rocket.Chat is the world's largest open-source communications platform. Built for organizations needing more control over their communications, Rocket.Chat Secure CommsOS™ is a communication platform that unifies messaging, voice, video, AI, and mission-critical applications—ensuring uncompromising security, compliance, and operational efficiency for governments, defense, and critical infrastructure organizations operating in highly-regulated environments.
Tens of millions of users in over 150 countries and organizations such as Deutsche Bahn, the U.S. Navy and Credit Suisse trust Rocket.Chat every day to keep their communications completely private and secure. As Rocket.Chat we believe in reconnecting the world, one conversation at a time!
See yourself in that? So apply now! Check out our handbook for more information about our rocket.
Top Skills
Similar Jobs
What you need to know about the Colorado Tech Scene
Key Facts About Colorado Tech
- Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
- Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
- Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
- Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
- Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute