Netskope Logo

Netskope

Staff Site Reliability Engineer, FedRamp

Posted 4 Days Ago
Remote
Hiring Remotely in United States
Expert/Leader
Remote
Hiring Remotely in United States
Expert/Leader
The Staff Site Reliability Engineer will enhance AI/ML infrastructure, manage CI/CD pipelines, ensure system reliability, and troubleshoot applications, focusing on cloud-based operations.
The summary above was generated by AI
About Netskope

Today, there's more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security. 

Since 2012, we have built the market-leading cloud security company and an award-winning culture powered by hundreds of employees spread across offices in Santa Clara, St. Louis, Bangalore, London, Paris, Melbourne, Taipei, and Tokyo. Our core values are openness, honesty, and transparency, and we purposely developed our open desk layouts and large meeting spaces to support and promote partnerships, collaboration, and teamwork. From catered lunches and office celebrations to employee recognition events and social professional groups such as the Awesome Women of Netskope (AWON), we strive to keep work fun, supportive and interactive.  Visit us at Netskope Careers. Please follow us on LinkedIn and Twitter@Netskope.

About the role

We are a team of software engineers focused on improving availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the engineering stacks. If you are passionate about solving complex problems and developing cloud services at scale, we would like to speak with you.

As a SRE, you will be critical to deploying and managing cutting-edge infrastructure crucial for AI/ML operations, and you will collaborate with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will also extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs. You will ensure that training environments are optimally available and efficiently managed across multiple clusters, enhancing our containerization and orchestration systems with advanced tools like Docker and Kubernetes.

  • Work closely with AI/ML engineers and researchers to participate in the designing and architecture of AI ML Applications for scale and reliability. Design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.
  • Involve in production troubleshooting of AI ML Application code as well as infrastructure configurations. 
  • Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.
  • Ensure training environments are consistently available and prepared across multiple clusters.
  • Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.
  • Operate and oversee large Kubernetes clusters with GPU workloads.
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
  • Provide primary operational support and engineering for multiple large-scale distributed software applications

 Is this you?

  • You have professional experience with:
    • Model training
    • Huggingface Transformers
    • Pytorch
    • LLM
    • TensorRT
    • Infrastructure as code tools like Terraform
    • Scripting languages such as Python or Bash
    • Cloud platforms such as Google Cloud, AWS or Azure
    • Git and GitHub workflows
    • Tracing and Monitoring
  • Familiar with high-performance, large-scale ML systems
  • You have a knack for troubleshooting complex systems and enjoy solving challenging problems
  • Proactive in identifying problems, performance bottlenecks, and areas for improvement
  • Take pride in building and operating scalable, reliable, secure systems
  • Familiar with monitoring tools such as Prometheus, Grafana, or similar
  • Are comfortable with ambiguity and rapid change

Preferred skills and experience:

  • Familiar with monitoring tools such as Prometheus, Grafana, or similar
  • 8+ years building core infrastructure
  • Experience running inference clusters at scale
  • Experience operating orchestration systems such as Kubernetes at scale

#LI-SC1

Netskope is committed to implementing equal employment opportunities for all employees and applicants for employment. Netskope does not discriminate in employment opportunities or practices based on religion, race, color, sex, marital or veteran statues, age, national origin, ancestry, physical or mental disability, medical condition, sexual orientation, gender identity/expression, genetic information, pregnancy (including childbirth, lactation and related medical conditions), or any other characteristic protected by the laws or regulations of any jurisdiction in which we operate.

Netskope respects your privacy and is committed to protecting the personal information you share with us, please refer to Netskope's Privacy Policy for more details.

Top Skills

AWS
Azure
Bash
Docker
Git
Git
GCP
Grafana
Huggingface Transformers
Kubernetes
Llm
Prometheus
Python
PyTorch
Tensorrt
Terraform

Similar Jobs

11 Days Ago
Remote
United States of America
148K-195K Annually
Mid level
148K-195K Annually
Mid level
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
The Senior Site Reliability Engineer at Circle builds and maintains infrastructure, collaborates on software development, and ensures system scalability and reliability through effective practices.
Top Skills: AWSGoGoogle Cloud PlatformJavaKubernetesAzureSQL
11 Days Ago
Remote
United States of America
148K-195K Annually
Mid level
148K-195K Annually
Mid level
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
The Senior Site Reliability Engineer builds and maintains infrastructure, develops scalable microservices, and collaborates with teams to improve software delivery and system reliability.
Top Skills: AWSGoGoogle Cloud PlatformJavaKubernetesAzureSQL
11 Days Ago
Remote
United States of America
148K-195K Annually
Mid level
148K-195K Annually
Mid level
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
The Senior Site Reliability Engineer will develop and maintain Circle's infrastructure, improve systems, and collaborate on software delivery within a rapidly evolving environment.
Top Skills: AWSGoGoogle Cloud PlatformJavaKubernetesAzureSQL

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account