Utilidata

AI Infrastructure Engineer

Posted 4 Hours Ago

Be an Early Applicant

Remote

Hiring Remotely in United States

170K-210K Annually

Senior level

Remote

Hiring Remotely in United States

170K-210K Annually

Senior level

The AI Infrastructure Engineer designs and builds infrastructure for AI and ML models across various environments, optimizing performance and reliability.

The summary above was generated by AI

Utilidata is a fast-growing NVIDIA-backed edge AI company enabling greater visibility and control of power utilization in energy-intensive infrastructure, like the electric grid and data centers. Karman, the company’s distributed AI platform powered by a custom NVIDIA module, is transforming the way utility companies operate the grid edge and will enable data centers to unlock more compute for the same provisioned power.
The AI Infrastructure Engineer is responsible for designing, building, and owning the end-to-end infrastructure that serves Utilidata's AI and ML models across edge deployments, cloud environments, and data center integrations. They are also responsible for designing, building, and owning the integration of power data with AI inference software. This is Utilidata's first dedicated role of this kind, and will serve as the foundational function for how the company deploys and operates AI capabilities in production. The role requires deep technical expertise in ML model serving, distributed systems, and GPU infrastructure, with a strong emphasis on reliability, performance, and scalability. This position works cross-functionally with product, engineering, and data science teams and is open to fully remote candidates, with periodic travel expected for company retreats and key on-site engagements.
Responsibilities

Lead the design and build of Utilidata's AI inference platform — establishing architecture patterns, deployment standards, and operational practices that will scale with the company
Own end-to-end model serving infrastructure for Utilidata's AI infrastructure (on-prem and datacenter)
Build and maintain fault-tolerant, high-performance systems for serving AI models at scale, with a focus on low latency, reliability, and cost efficiency
Collaborate closely with algorithms engineers to integrate AI inference data and configuration with power optimization algorithms
Optimize GPU utilization and inference performance across our hardware fleet, including NVIDIA accelerators central to Utilidata's edge AI platform
Establish MLOps best practices including CI/CD pipelines for model deployment, monitoring, and rollback across environments
Contribute to infrastructure roadmap decisions, including build vs. buy tradeoffs, tooling selection, and platform evolution as the team grows

Minimum Qualifications

5+ years of software engineering experience with a strong focus on AI infrastructure, backend systems, or distributed systems
Hands-on experience with AI model serving frameworks (e.g., vLLM, SGLang, Triton, TensorRT, TorchServe, or similar)
Understanding of container orchestration and cluster management (Kubernetes, Docker)
Experience deploying and operating infrastructure across both datacenter and on-prem environments
Strong knowledge of GPU workloads and the tradeoffs that come with them — you understand how inference differs from training, and why it matters
Proficiency in Python; C++, CUDA, Go, Rust a plus
Excellent communication skills and comfort working cross-functionally in a lean, fast-moving environment
Willingness to travel up to 10% of time

Enhanced Qualifications (Nice to Have)

Dynamo experience a plus
Experience with edge AI deployments or constrained compute environments
Familiarity with infrastructure as code (Terraform, Helm)
Experience with observability platforms (Datadog, Prometheus, Grafana)
Background in energy, utilities, or industrial IoT
Contributions to open-source ML infrastructure projects

Salary Range: $170,000 to $210,000 base compensation depending on experience plus stock options. Salary will be commensurate with an individual's skills, training, years of experience, and in line with internal compensation bands.
Location: This position can be performed remotely from anywhere in the United States.
Our Commitments:
Utilidata values the diversity of our team. We provide equal employment opportunities without regard to race, color, religion, creed, sex, gender, sexual orientation, gender identity or expression, national origin, age, physical disability, mental disability, medical condition, pregnancy or childbirth, sexual orientation, genetics, genetic information, marital status, or status as a covered veteran or any other basis protected by applicable federal, state and local laws.
We are committed to:

Creating a diverse and inclusive workplace that is welcoming, supportive, affirming and respectful
Empowering employees to solve problems and work together to make a difference
Providing mentorship and growth opportunities as part of a collaborative team
A flexible work environment with flexible paid time off
Competitive compensation and benefits, including health, dental, vision, and employer-match 401k

Top Skills

Ai Infrastructure

C++

Cuda

Datadog

Docker

Grafana

Helm

Kubernetes

Ml Model Serving Frameworks

Prometheus

Python

Rust

Terraform

Similar Jobs

Andromeda (andromeda.ai)

Senior Site Reliability Engineer

2 Days Ago

In-Office or Remote

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

Design and operate large-scale GPU infrastructure for distributed AI training, ensuring reliability, performance, and efficient customer partnerships.

Top Skills: AnsibleCudaDeepspeedFsdpGpuHelmInfinibandKubernetesLinuxMegatronNcclNvidia A100Nvidia B200Nvidia H100NvlinkPyTorchRoceTerraform

Andromeda (andromeda.ai)

Software Engineer

2 Days Ago

In-Office or Remote

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

As a Software Engineer in AI Infrastructure, you will design and develop core platform components, build APIs and services, enhance performance, and automate tooling while collaborating across teams and improving system reliability.

Top Skills: AnsibleGoHelmKubernetesPythonTerraform

Andromeda (andromeda.ai)

Site Reliability Engineer

2 Days Ago

In-Office or Remote

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.

Top Skills: AnsibleBashDatadogGoGrafanaHelmKubernetesLokiPrometheusPythonTerraform

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute