Edgescale AI Logo

Edgescale AI

Principal Core Engineer — Infra / SRE

Posted 4 Days Ago
Be an Early Applicant
Hybrid
Denver, CO, USA
190K-215K Annually
Expert/Leader
Hybrid
Denver, CO, USA
190K-215K Annually
Expert/Leader
Own fleet-scale reliability, upgradeability, and operational excellence for an edge platform. Design and operate automated, secure lifecycle systems, observability, and incident response. Lead cross-domain, high-severity incident ownership, set production standards (SLOs/SLIs, change management), mentor engineers, and apply AI to accelerate diagnostics and operational workflows.
The summary above was generated by AI
The Opportunity

We’re looking for a Core Engineer at the Principal Infra / SRE level to own the reliability, scalability, upgradeability, and operational excellence of our edge platform at fleet scale.

In this role, you’ll be the technical authority for designing and operating compound capabilities that span software, infrastructure, networking, security, data, and hardware—ensuring we can reliably deploy, upgrade, and manage fleets of thousands of devices with the highest technical rigor. You will set and enforce production standards, and you have the authority to stop changes that would put fleet safety or reliability at risk. During high-severity incidents, you are the technical owner—leading root-cause analysis and driving fixes across teams.

This is a hands-on role for someone who thrives in a high-ownership setting and wants to build the infrastructure that makes real-world AI possible. You’ll operate in an AI-native way, using AI to assist diagnostics and operations while ensuring all production changes remain governed, reviewed, and auditable.

What You’ll Do
  • Own platform-wide reliability and scalability architecture across the fleet, including upgradeability, rollback safety, resilience, observability, and incident response.

  • Lead the design and delivery of compound capabilities that span multiple specialist domains (hardware, networking, security, data, infrastructure, and AI runtime).

  • Set and enforce production-grade standards for operational excellence, including SLOs/SLIs, error budgets, on-call readiness, change management, incident management, and postmortem practices, with the authority to stop changes that introduce unacceptable risk.

  • Serve as the technical owner during high-severity incidents, leading diagnosis, root-cause analysis, and coordinated remediation across teams.

  • Design and operate secure, automated fleet lifecycle systems for deployment, updates, configuration management, and health management at scale.

  • Drive the evolution of observability and telemetry systems (metrics, logs, traces, audit, fleet state) so issues are detectable, diagnosable, and preventable.

  • Partner with engineering and commercial teams to translate real-world constraints into platform-level requirements and prioritization decisions.

  • Operate in an AI-native way: develop and use AI systems to accelerate diagnostics, automate operational workflows, and increase engineering velocity, while ensuring all production changes remain governed, reviewed, and auditable.

  • Mentor senior engineers across domains, review technical designs, and raise the quality bar for architecture and reliability across the organization.

What Success Looks Like

In your first 3 months, you will have:

  • Taken full ownership of a platform-wide reliability, upgradeability, or incident reduction initiative and delivered measurable improvements in fleet stability, deployment safety, and operational clarity.

  • Established or strengthened production standards that reduce risk and improve consistency across releases and fleet operations.

  • Demonstrated strong incident ownership by leading at least one high-severity investigation through root cause and durable remediation.

In your first year, you will be:

  • Owning the fleet-scale operational architecture end-to-end, with clear accountability for reliability, upgradeability, scalability, and security posture across thousands of deployed systems.

  • Delivering step-function improvements in platform resilience and operational excellence through durable systems (automated lifecycle management, observability, incident reduction, reliability standards).

  • Raising engineering rigor across the organization by enforcing standards, mentoring technical leaders, and driving cross-domain architectural decisions that compound over time.

Who You Are
  • 10+ years building and operating production infrastructure and distributed systems, including reliability engineering at scale across complex, multi-tenant or fleet environments.

  • Deep experience with SRE practices: SLOs/SLIs, error budgets, observability, incident response, postmortems, and operational automation (e.g., Kubernetes-based platforms, Linux systems, and automation through infrastructure-as-code).

  • Strong systems thinking across software, infrastructure, networking, and security, with the ability to drive outcomes across multiple domains and enforce production standards.

  • Proven ability to lead ambiguous, high-impact initiatives end-to-end with strong technical judgment, crisp execution, and disciplined change management.

  • Clear communicator and trusted technical partner to engineering leadership, with the ability to lead high-severity incident response and drive cross-team alignment.

  • Ownership mindset: outcomes over tasks.

Unique Experiences We Value
  • Designing and operating fleet management and upgrade systems at scale, including safe rollout/rollback, configuration management, and health monitoring (e.g., canary deployments, staged rollouts, and verifiable rollback mechanisms).

  • Building observability platforms that make complex systems diagnosable and measurable across large distributed deployments (e.g., metrics/logs/tracing pipelines, alerting, and dashboards that drive action).

  • Security-first operations experience (secure boot, signed updates, audit logging, default-deny posture) and working in compliance-sensitive environments with governed production changes.

  • Experience operating systems under real-world edge constraints (limited connectivity, bandwidth limits, variable environments, high reliability requirements) and building automation that reduces operational variance.

  • Applying AI to operations and engineering workflows (automated diagnostics, agentic triage, runbook generation, anomaly detection) to increase rigor and speed while keeping production pathways reviewed and auditable.

Benefits
  • We work in a high-ownership, real-world startup environment where you’ll move fast, build new systems, and see your impact immediately—what you ship runs in the field and drives measurable customer outcomes.

  • We work alongside AI every day. Writing static code, docs, or plans “by hand” is no longer accepted—here you’ll use the latest AI tools to iterate and ship faster and to apply AI with our customers at scale.

  • You’ll take on elite technical challenges at the frontier of infrastructure, including next-generation cloud and IoT, hardware/software/networking in real-world edge environments, the foundation for data and AI inference, and industry-leading secure systems in demanding operational (OT) settings.

  • You’ll learn fast by working with exceptional teammates and collaborating directly with industry leaders as partners in software, AI, and infrastructure.

  • Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. This role has a base salary range of $190,000–$215,000.

  • Total compensation for this role includes equity in your work. You are eligible for meaningful equity through stock options in an early-stage, high-growth company.

  • You are eligible to participate in company benefit plans, which may include health, dental, and vision coverage, a 401(k) with company match, flexible PTO, paid parental leave, commuter benefits, and relocation and visa support for eligible roles.

Edgescale AI

At Edgescale AI, we’re deploying AI in the real world—helping customers apply this technology to unlock transformative productivity gains. Our work sits at the intersection of infrastructure, security, networking, and AI, where reliability and performance are non-negotiable and where solutions demand deep, distributed systems thinking.

We’re intensely AI-native. We build with AI, we ship AI, and we use it every day to accelerate how we design, test, deploy, and operate complex systems. If you want to help pave the application of AI in the real world, at global scale, we want to hear from you.

Edgescale AI is building an inclusive, merit-based organization. We are an equal opportunity employer and do not discriminate on any legally protected status. We value diversity, inclusion, and a shared passion for creating real-world impact.

Similar Jobs

25 Minutes Ago
In-Office
Longmont, CO, USA
129K-226K Annually
Senior level
129K-226K Annually
Senior level
Artificial Intelligence • Hardware • Information Technology • Machine Learning
The manager will lead the NVMQRA team, ensuring product quality and reliability of non-volatile memory products through testing, collaboration, and continuous improvement.
Top Skills: C++Python
25 Minutes Ago
In-Office
Centennial, CO, USA
81K-111K Annually
Junior
81K-111K Annually
Junior
Aerospace • Hardware • Information Technology • Robotics • Defense • Utilities
Designs, implements, and maintains quality assurance protocols for manufacturing processes and products. Conducts inspections and tests, analyzes data for trends, recommends process improvements, and implements corrective actions. Ensures compliance with IPC, NASA, and regulatory standards, uses GD&T and engineering drawings, participates in audits, and applies RCCA/CAPA methodologies.
Top Skills: Failure Mode And Effects Analysis (Fmea)Geometric Dimensioning And Tolerancing (Gd&T)Ipc StandardsIso 9001Quality Management SoftwareStatistical Process Control (Spc)
25 Minutes Ago
In-Office
Louisville, CO, USA
25K-35K Hourly
Junior
25K-35K Hourly
Junior
Aerospace • Hardware • Information Technology • Robotics • Defense • Utilities
Assemble, inspect, test, and maintain complex mechanical flight hardware. Translate blueprints into assemblies, perform precision drilling, soldering, harness routing, cleaning, troubleshooting, and in-process testing while following safety, ESD, 5S, and FOD protocols.
Top Skills: 5SA&P LicenseCalibrated ToolsClean Room ProceduresEsdFodHand ToolsHarness RoutingIn-Process TestingIpc J-StdOrbital WeldingPower ToolsPrecision CleaningPrecision DrillingPrecision Measurement ToolsRiggingSolderingStructural Fabrication

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account