P-1 AI Logo

P-1 AI

AI Evals Technical Lead

Posted 7 Days Ago
Remote
Hiring Remotely in United States
165K-225K Annually
Senior level
Remote
Hiring Remotely in United States
165K-225K Annually
Senior level
Responsible for designing, implementing, and validating eval benchmarks to ensure AI systems perform engineering tasks, benchmarking against industry standards.
The summary above was generated by AI

About you:

  • feel an unshakeable pull to work on agentic AI

  • can usually break an AI or a piece of software in under a minute (if you want to)

  • are a skilled developer yourself

  • always develop an interest in the subject matter you’re building tests for, and are eager to do the same for the industrial products that run the world

  • believe in manifesting the future of physical engineering

About P-1 AI:

We are building an engineering AGI. We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built world—helping mankind conquer nature and bend it to our will. Our first product is Archie, an AI engineer capable of quantitative and spatial reasoning over physical product domains that performs at the level of an entry-level design engineer. We aim to put an Archie on every engineering team at every industrial company on earth.

Our founding team includes the top minds in model-based engineering, deep learning, and industries that are our customers. We just closed a $23 million seed round led by Radical Ventures that includes a number of other AI and industrial luminaries. We invite you to join our team of the world’s best engineers and AI researchers, building AI’s most impactful use case.

About the Role:

In this role, you’ll be responsible for the evals that we use to ensure that Archie is learning and retaining the skills needed to successfully perform its engineering work, and benchmark it against industry skill expectations. Working within a small, tightly-knit team of high-performers, you’ll be principally responsible for clearly defining, implementing, and validating these, including input from our engineering experts and industrial partners. You’ll also be responsible for translating these eval tests into multiple formats for use with different types of AI and non-AI systems and agents.

This role is remote and you can be based anywhere in the US or Canada, where you must have existing work authorization. You will be expected to travel to our San Francisco office for co-working sessions approximately one week out of every six. If you are already located in the SF Bay Area or are interested in relocation, you are of course welcome to work out of our SF office. Our AI team is based in the SF office, so there would be some benefit to you being in-office at least part of the time.

Responsibilities:

  • Implement the system for organizing, transforming, running, grading, and reporting on eval benchmarks.

  • Design and execute the process by which we develop and QA our evals, incorporating contributions from our own engineering team, industrial partners, and subject-matter experts.

  • Ensure that evals run effectively within our CI/CD system, continuously benchmarking our evolving AI platform and the experiments we’re performing around it.

  • Create methods for detecting and testing for common quality challenges of AI, including hallucinations, undesirable stochasticity, and regressions.

  • Be a technical leader in the consistent implementation and organization of automated tests across other areas of our technology stacks.

Skills and Experience:

  • Experience in constructing comprehensive test suites for software and/or AI systems, including coordinating the contributions of others.

  • Experience designing metrics to evaluate systems and visualize their performance, including differences across successive generations.

  • Experience in developing, managing, and running evals against LLM-based systems is a strong plus.

  • Good communication skills with a variety of stakeholders (AI researchers, domain experts, application developers).

  • Proficiency in Python programming, complex modules and modern software development tools and practices (Git, CI/CD, etc.).

  • Ability to thrive in a fast-paced, dynamic startup environment.

Interview Process:

  • Initial screening - with Head of Talent (30 mins)

  • Hiring manager interview - with co-founder & Head of Engineering (45 mins)

  • Programming interview - with member of technical staff & Head of Engineering (60 mins)

    • bring your own dev environment and tools

  • Culture fit / Q&A - with co-founder & CEO (45 mins)

Top Skills

Ci/Cd
Git
Llm
Python

Similar Jobs

23 Minutes Ago
Easy Apply
Remote
United States
Easy Apply
200K-275K Annually
Senior level
200K-275K Annually
Senior level
Big Data • Fintech • Mobile • Payments • Financial Services
The Staff Security Engineer will collaborate with Product teams on secure product development, conduct threat modeling, review architecture and source code, and advise on security requirements.
Top Skills: Ci/CdCloud-Based ServicesOauth2SAML
24 Minutes Ago
Easy Apply
Remote or Hybrid
United States
Easy Apply
120K-144K Annually
Senior level
120K-144K Annually
Senior level
eCommerce • Healthtech • Kids + Family • Retail • Social Media
As Order Management Manager, you'll manage the performance of Babylist's order network, lead operations specialists, and enhance fulfillment processes through data analysis and cross-functional collaboration.
Top Skills: OmsWms
29 Minutes Ago
Remote or Hybrid
United States
175K-292K Annually
Expert/Leader
175K-292K Annually
Expert/Leader
Automotive • Cloud • Greentech • Information Technology • Other • Software • Cybersecurity
Lead Cloud Security and Compliance practice at RapidScale, focusing on AI-driven security strategies, team management, client engagements, and developing innovative security offerings.
Top Skills: AIAWSAzureCybersecurityGoogle

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account