UP.Labs Logo

UP.Labs

Sr. AI Quality Engineer

Posted Yesterday
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
The role involves ensuring system quality for an AI-powered platform, developing quality metrics, diagnosing issues, and translating product requirements.
The summary above was generated by AI
The Company:
Groundtruth is building a first-of-its-kind AI billing platform backed by one of the largest companies in the transportation and logistics industry. Our platform automates accuracy and trust in complex freight transactions by transforming messy, unstructured data – like complex email chains, documents, telematics, and contracts – into a clear, reliable system of record that drives accurate billing.

The Role:
We’re hiring a hybrid AI QA + Product Analyst to own end-to-end quality for our AI-powered inference system. This role sits at the intersection of LLM inference quality, event-driven backend state-machines, and freight domain logic.
You will define what “correct” means, build the quality measurement and regression approach to enforce it, and lead deep-dive investigations when edge cases or customer-specific rules break downstream behavior. The goal is to make our system more accurate, more diagnosable, and more reliable as email volume and customer complexity scales.

What you’ll do:
Own end-to-end system quality
  • Develop and maintain a quality rubric for key use cases and exception types. (what “right” looks like, and what failure looks like).
  • Build and curate golden datasets (representative emails + expected structured output + expected final outcome), including customer-specific variations.
  • Own ongoing quality review in dev and production: regularly inspect high-volume outputs, diagnose what’s breaking and why, and convert discoveries into concrete roadmap items and regression coverage.
  • Define and execute regression tests for new model changes, backend logic changes, or customer-specific use cases.

Investigate and diagnose issues across the full stack of the product
  • Triage quality incidents and ambiguous failures by tracing through:
    • email ingestion/parsing
    • prompts / model outputs / normalization steps / data contracts
    • intermediate structured representations
    • event streams and state-machine transitions
    • final audit exception generation and downstream reporting
  • Use logs, traces, event histories, and data queries to isolate root cause.
  • Produce high-signal findings reports: minimal reproduction, suspected component, evidence, impact, and recommended fix.

Build scalable quality operations
  • Create a repeatable triage playbook and classification system for quality issues
  • Define monitoring & dashboards for quality signals (volume anomalies, exception drift, per-customer error hotspots).
  • Partner with engineering/AI to improve observability (correlation IDs, structured logging, traceability from email → state transitions).

Act as a product/domain translator
  • Understand freight billing workflows and how real-world documents and communication map to our system’s model of “truth”.
  • Convert customer-specific requirements into testable rules and expected outcomes.
  • Identify systemic gaps where “reality” doesn’t fit the current schema, and propose product changes.

Required qualifications:
  • Experience in roles that blend quality + investigation + systems thinking (examples: QA engineer in distributed systems, product analyst with deep debugging, LLM quality analyst, solutions engineer owning incident triage).
  • Demonstrated experience evaluating AI/LLM output quality (extraction/classification, structured outputs, tool calling, RAG, prompt-driven pipelines, or similar).
  • Strong technical ability to debug production issues using:
    • log/trace tools (Datadog, ELK, Honeycomb, OpenTelemetry/Jaeger, etc.)
    • SQL and/or Python for analysis and repro
    • event-driven architectures and workflows/state machines (or similar distributed workflow systems)
  • Ability to write crisp requirements and acceptance criteria, and translate ambiguity into test cases.
  • Comfort operating in messy, high-volume, edge-case-heavy environments.

Nice-to-have qualifications:
  • Freight/logistics/audit/billing domain experience (carrier invoices, accessorials, detention, lumper, fuel surcharge, tenders, BOLs, rate confirmations, PODs, etc.).
  • Experience designing evaluation metrics (precision/recall, drift detection, per-customer or per-use-case scorecards).
  • Familiarity with workflow engines/state machines and distributed systems failure modes (event ordering, retries, dedupe, idempotency, partial failure).
  • Experience with annotation/labeling workflows, taxonomy design, and building human-in-the-loop QA processes.

Traits that matter in this role:
  • High ownership: you don’t stop at “it’s broken,” you drive it to root cause and resolution.
  • Comfortable with ambiguity and edge cases; systematic in building clarity.
  • Able to communicate across product, engineering, ML, and operations.

Top Skills

AI
Datadog
Elk
Honeycomb
Jaeger
Llm
Opentelemetry
Python
SQL

Similar Jobs

23 Days Ago
Remote
United States
122K-177K Annually
Senior level
122K-177K Annually
Senior level
Aerospace • Defense • Manufacturing
The Senior AI Quality Engineer ensures AI applications are reliable and safe in mission-critical environments, building custom testing frameworks and evaluating AI system performance through all development phases.
Top Skills: AIAmazon Web ServicesCi/CdGuardrails AiJavaScriptK6PlaywrightPytestPython
An Hour Ago
Remote or Hybrid
United States
111K-150K Annually
Senior level
111K-150K Annually
Senior level
Enterprise Web • HR Tech • Information Technology • Software • Cybersecurity
As a Senior Customer Success Manager, you'll partner with Federal/Military customers to ensure effective onboarding, train clients, set strategies, and measure success while collaborating with professional services and support teams.
Top Skills: CybersecuritySaaS
An Hour Ago
Easy Apply
Remote
USA
Easy Apply
218K-257K Annually
Senior level
218K-257K Annually
Senior level
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Lead engineering teams to build systems powering Coinbase's institutional products, collaborating with cross-functional teams and managing projects to enhance onboarding and activation journeys.
Top Skills: DockerGoPostgresRuby on RailsRubySinatra

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account