SandboxAQ Logo

SandboxAQ

Staff/Senior Staff Site Reliability Engineer

Posted Yesterday
Remote
Hiring Remotely in USA
183K-304K
Senior level
Remote
Hiring Remotely in USA
183K-304K
Senior level
As a Senior Staff Site Reliability Engineer, you will enhance system reliability and performance, lead incident management, analyze capacity planning, and mentor junior engineers.
The summary above was generated by AI
About SandboxAQ

SandboxAQ is a high-growth company delivering AI solutions that address some of the world's greatest challenges. The company’s Large Quantitative Models (LQMs) power advances in life sciences, financial services, navigation, cybersecurity, and other sectors.
We are a global team that is tech-focused and includes experts in AI, chemistry, cybersecurity, physics, mathematics, medicine, engineering, and other specialties. The company emerged from Alphabet Inc. as an independent, growth capital-backed company in 2022, funded by leading investors and supported by a braintrust of industry leaders. 
At SandboxAQ, we’ve cultivated an environment that encourages creativity, collaboration, and impact. By investing deeply in our people, we’re building a thriving, global workforce poised to tackle the world's epic challenges. Join us to advance your career in pursuit of an inspiring mission, in a community of like-minded people who value entrepreneurialism, ownership, and transformative impact. 

About the Role

As a Senior Staff Site Reliability Engineer at SandboxAQ, you will be responsible for maintaining and improving the reliability, performance, and scalability of our infrastructure and services. You will work closely with engineering teams to ensure that our systems are resilient, highly available, and optimized for performance. Your expertise will guide the development of reliable software, and you will play a key role in shaping the reliability culture within the organization.

What You'll Do

  • Incident Management: Lead efforts in incident response, root cause analysis, and postmortem processes, while developing strategies to minimize incidents and reduce recovery times.
  • Capacity Planning: Analyze system performance and growth trends, and create capacity plans to ensure systems scale appropriately as demand increases.
  • Monitoring & Observability: Design and maintain comprehensive monitoring, logging, and alerting solutions to ensure quick detection and resolution of system anomalies.
  • Collaboration with Engineering Teams: Partner with software engineers, product teams, and DevOps to design systems that are both reliable and performant.
  • Cost Optimization: Identify opportunities to optimize infrastructure costs while maintaining system reliability and performance. 
  • Automation & Tools Development: Build and improve automation tools, monitoring systems, and deployment pipelines to streamline operations and increase efficiency.
  • Mentorship & Leadership: Mentor junior and mid-level engineers, providing technical leadership and guidance on SRE best practices, incident management, and system design.
  • On-Call Rotation: Participate in an on-call rotation to respond to system outages and provide support for mission-critical systems.

About You

  • 10+ years of experience in Site Reliability Engineering, DevOps, or similar roles.
  • Strong experience with cloud platforms (AWS, GCP, or Azure), containerization (Docker, Kubernetes), and infrastructure-as-code (Terraform, CloudFormation).
  • Proven ability to lead post-incident reviews and drive continuous improvement in system reliability.
  • Excellent communication and collaboration skills, with the ability to work across cross-functional teams.
  • Expertise in systems administration, networking, and security in a cloud-native environment.
  • Deep understanding of monitoring, observability, and logging tools (Prometheus, Grafana, ELK, Datadog, etc.).
  • Proficiency in scripting languages (e.g., Python, Go, Bash) and configuration management tools (e.g., Ansible, Chef, Puppet).
  • Experience designing and implementing scalable and reliable microservices architectures.
  • Strong knowledge of CI/CD pipelines and related tools (CircleCI,Jenkins, GitLab,  etc.)

Nice to Haves

  • Experience with large-scale distributed systems and databases (e.g., Kafka, PostgreSQL, Cassandra, MySQL).
  • Experience with service mesh (e.g., Istio, Linkerd) and serverless architectures.
  • Strong understanding of compliance and security frameworks.
  • Familiarity with chaos engineering practices and tools (e.g., Gremlin, Chaos Monkey).

The US base salary range for this full-time position is expected to be $183k-$304k per year. Our salary ranges are determined by role and level. Within the range, individual pay is determined by factors including job-related skills, experience, and relevant education or training. This role may be eligible for annual discretionary bonuses and equity.

SandboxAQ welcomes all.

We are committed to creating an inclusive culture where we have zero tolerance for discrimination. We invest in our employees' personal and professional growth. Once you work with us, you can’t go back to normalcy because great breakthroughs come from great teams and we are the best in AI and quantum technology.

 

We offer competitive salaries, stock options depending on employment type, generous learning opportunities, medical/dental/vision, family planning/fertility, PTO (summer and winter breaks), financial wellness resources, 401(k) plans, and more. 

 

Equal Employment Opportunity: All qualified applicants will receive consideration regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or Veteran status.

 

Accommodations: We provide reasonable accommodations for individuals with disabilities in job application procedures for open roles. If you need such an accommodation, please let a member of our Recruiting team know.

Top Skills

Ansible
AWS
Azure
Bash
Chef
CircleCI
CloudFormation
Datadog
Docker
Elk
GCP
Gitlab
Go
Grafana
Jenkins
Kubernetes
Prometheus
Puppet
Python
Terraform

Similar Jobs

8 Hours Ago
Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Enterprise Web • Machine Learning • Natural Language Processing • Software • Conversational AI • Automation
As a Site Reliability Engineer, you'll enhance infrastructure security, automate deployments, optimize CI/CD processes, and drive engineering best practices while ensuring compliance and observability.
Top Skills: Aws CloudElasticsearchGoJavaScriptMongoDBNode.jsReactRedisTerraform
Yesterday
Easy Apply
Remote
Hybrid
8 Locations
Easy Apply
Senior level
Senior level
Fintech • Mobile • Software • Financial Services
The Senior Site Reliability Engineer at SoFi automates infrastructure, enhances service reliability, mentors the SRE team, and manages network systems.
Top Skills: Aws Cloud NetworkingC/C++Cisco SystemsCloudwatchDatadogGitlab Ci/CdJavaJavaScriptPalo Alto NetworksPanoramaPythonRubyTerraform
4 Days Ago
Easy Apply
Remote
Hybrid
2 Locations
Easy Apply
148K-236K Annually
Senior level
148K-236K Annually
Senior level
Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI
As a Lead Site Reliability Engineer, you will enhance cloud infrastructure, automate operations, and troubleshoot complex production issues in a secure environment.
Top Skills: AnsibleAWSBashChefDirect ConnectDockerGoKubernetesPuppetPythonRestRubyScalaSoapTlsTransit GatewayUnix/LinuxVpc

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account