P-1 AI

AI Evals Technical Lead

Reposted 5 Days Ago

Remote

Hiring Remotely in United States

200K-250K Annually

Senior level

Remote

Hiring Remotely in United States

200K-250K Annually

Senior level

Responsible for designing, implementing, and validating eval benchmarks to ensure AI systems perform engineering tasks, benchmarking against industry standards.

The summary above was generated by AI

About P-1 AI:

We are building an engineering AGI. We founded P-1 AI with the conviction that the greatest impact of artificial intelligence will be on the built world. Our first product is Archie, an AI engineer capable of quantitative intuition over physical product domains and engineering tool use. Archie initially performs at the level of an entry-level design engineer but rapidly gets smarter and more capable. We aim to put an Archie on every engineering team at every industrial company on earth.

Our founding team includes the top minds in deep learning, model-based engineering, and industries that are our customers. We closed a $23 million seed round led by Radical Ventures that includes a number of other AI and industrial luminaries (from OpenAI, DeepMind, etc.).

About the role:

In this role, you’ll be responsible for the evals that we use to ensure that Archie is learning and retaining the skills needed to successfully perform its engineering work, and benchmark it against industry skill expectations. Working within a small, tightly-knit team of high-performers, you’ll be principally responsible for clearly defining, implementing, and validating these, including input from our engineering experts and industrial partners. You’ll also be responsible for translating these eval tests into multiple formats for use with different types of AI and non-AI systems and agents.

This role is remote and you can be based anywhere in the US or Canada, where you must have existing work authorization. You will be expected to travel to our San Mateo office for co-working sessions approximately one week out of every six. If you are already located in the Bay Area or are interested in relocation, you are of course welcome to work out of our San Mateo office. Our AI team is based in the San Mateo office, so there would be some benefit to you being in-office at least part of the time.

What you’ll do:

Implement the system for organizing, transforming, running, grading, and reporting on eval benchmarks.
Design and execute the process by which we develop and QA our evals, incorporating contributions from our own engineering team, industrial partners, and subject-matter experts.
Ensure that evals run effectively within our CI/CD system, continuously benchmarking our evolving AI platform and the experiments we’re performing around it.
Create methods for detecting and testing for common quality challenges of AI, including hallucinations, undesirable stochasticity, and regressions.
Be a technical leader in the consistent implementation and organization of automated tests across other areas of our technology stacks.

Who you are:

Experience in constructing comprehensive test suites for software and/or AI systems, including coordinating the contributions of others.
Experience designing metrics to evaluate systems and visualize their performance, including differences across successive generations.
Experience in developing, managing, and running evals against LLM-based systems is a strong plus.
Good communication skills with a variety of stakeholders (AI researchers, domain experts, application developers).
Proficiency in Python programming, complex modules and modern software development tools and practices (Git, CI/CD, etc.).
Ability to thrive in a fast-paced, dynamic startup environment.

Our values:

Mission obsession & urgency: We are obsessed with building engineering AGI as quickly as possible. We also recognize that as a startup, speed is our most precious competitive advantage. We are constantly asking ourselves what we can do to go faster. We make tradeoffs and sacrifices (personally and in the workplace) in exchange for speed.

Intellectual excellence & curiosity: We ask “what if?” and experiment liberally. We always look for better ways of doing something. We read voraciously. We challenge each other to be better. We surround ourselves with A players and we actively and unapologetically reject B players (and even B+ players⸺because they tend to surround themselves with C players).

Shipping discipline: We treat production with respect. We test and demo our product constantly. We listen attentively to our customers, users, and stakeholders, and we respect our commitments to them. We also respect our commitments to each other and will go the extra mile (or ten or one hundred) to honor them.

Ownership: We all have significant ownership stakes in the company and operate in founder mode. We believe in hierarchical requirements but not in hierarchical information flows. If we see that something is broken or can be done better, we flag it and we fix it. We encourage each other to play with and fix anything and everything... but there’s a clear owner for everything.

Interview process:

Initial screening call (30 mins)

Biographical/behavioural interview (45 mins)
Technical interview (60 mins)
CEO interview (30 mins)

Compensation:

Salary: $200k - $250k.

This role includes a significant equity component. We are an early-stage startup, so we favor equity over cash in our current compensation philosophy. This role is best suited for candidates who value long-term ownership and impact over short-term cash optimization. Our benefits include healthcare, dental, and vision insurance, 401k with employer matching, unlimited PTO.

Top Skills

Ci/Cd

Git

Llm

Python

Similar Jobs

Square

Product Manager

2 Minutes Ago

Remote or Hybrid

168K-297K Annually

Senior level

168K-297K Annually

Senior level

eCommerce • Fintech • Hardware • Payments • Software • Financial Services

The Product Manager will oversee the Square Credit Card, focusing on expense management, cardholder value, and the economic model to create a market-leading product. Responsibilities include defining strategies, collaborating across teams, analyzing performance, and understanding customer needs.

Top Skills: QuickbooksXero

Mondelēz International

Senior Hunting Intelligence Analyst

5 Minutes Ago

Remote or Hybrid

United States

109K-150K Annually

Senior level

109K-150K Annually

Senior level

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing

The Senior Threat Hunting Intelligence Analyst conducts intelligence-driven threat hunts, develops detection strategies, analyzes telemetry, collaborates with engineers, and communicates findings to enhance security posture.

Top Skills: EdrMitre Att&CkPythonSIEMSoarStix/Taxii

Pluralsight

VP of Global Professional Services

8 Minutes Ago

Remote or Hybrid

USA

196K-245K Annually

Expert/Leader

196K-245K Annually

Expert/Leader

Edtech • Information Technology • Software

The VP of Global Professional Services strategizes and executes a services organization, leveraging AI and analytics to drive platform adoption and customer satisfaction. Responsibilities include overseeing service offerings, financial performance, delivery excellence, and leading a global team.

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute