fal Logo

fal

Operations Engineer, HPC Networking

Posted 2 Days Ago
Remote
Hiring Remotely in USA
Mid level
Remote
Hiring Remotely in USA
Mid level
As an Operations Engineer, you will manage HPC networking, monitor and debug InfiniBand and Ethernet fabrics, and support fabric bring-up while enhancing operational tools and runbooks.
The summary above was generated by AI

fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access that teams need to move from idea to production, and do it at scale without compromise. For developers and enterprises, fal is the foundation that makes generative media not just possible, but practical: a unified platform where high-performance inference, orchestration, and observability come together to unlock new categories of AI-native products.

As generative media reshapes industries across a market projected to grow by hundreds of billions over the next decade, fal is becoming the ecosystem that ambitious teams build on.

About the role

We're hiring an Operations Engineer for HPC Networking to keep our InfiniBand and Ethernet fabrics healthy as we scale.

This is a hands-on role. You'll bring up new fabrics alongside DC ops, monitor the ones in production, and chase down the weird stuff: link flaps, congestion, NCCL stalls, firmware bugs that only show up at scale. 

You're a fit if you've:
  • Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.
  • Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.
  • Brought up new fabrics from cable pull through validation.
  • Scripted your way through repetitive operational work (bash, python, go, whatever).
  • Nice to have: Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.
Who you are:
  • Detail-oriented. Cable plant hygiene is a personality trait.
  • Calm under fire. A fabric incident during a customer training run doesn't rattle you.
  • You read vendor release notes for fun, or at least out of self-defense.
  • You'd rather find the root cause than reboot the switch.
Responsibilities:
  • Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.
  • Investigate and resolve fabric issues: connectivity, congestion, performance regressions.
  • Support fabric bring-up alongside DC ops and customer-facing teams.
  • Run maintenance and upgrades on switches and control plane components.
  • Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.
  • Improve the tooling and runbooks so the next incident resolves faster than the last.

Similar Jobs

28 Minutes Ago
Remote or Hybrid
United States
42K-42K Annually
Entry level
42K-42K Annually
Entry level
Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
The Customer Care Advocate assists customers with insurance inquiries via phone and digital channels, providing support and resolving complex issues with professionalism. Requires customer service experience, operates within guidelines, and utilizes AI tools for efficiency, ensuring compliance and accuracy in documentation.
Top Skills: Ai-Assisted Service ToolsCompliance StandardsCrm Platforms
An Hour Ago
Remote or Hybrid
CO, USA
80K-140K Annually
Senior level
80K-140K Annually
Senior level
Information Technology • Insurance • Software
The Senior Consultant at Vertafore manages the lifecycle of SaaS implementation projects, collaborating with customers to optimize software solutions and requirements while ensuring effective communication with all stakeholders.
Top Skills: Project Management ToolsSaaS
2 Hours Ago
Remote or Hybrid
Expert/Leader
Expert/Leader
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead data analytics and AI strategy for Airwallex, overseeing forecasting, causal analysis, and team leadership to drive global growth and business efficiency.
Top Skills: AirflowDatabricksDbtPython/RSnowflakeSQL

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account