Protege

Senior Machine Learning Researcher / Principal Scientist

Reposted 11 Days Ago

Remote

Hiring Remotely in USA

Senior level

Remote

Hiring Remotely in USA

Senior level

Lead the evaluation and optimization of large-scale datasets for AI training, ensuring data quality and collaborating with research teams.

The summary above was generated by AI

Company Overview:

We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.

Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.

We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.

Role Overview:

Data is the foundation of AI performance, and we believe model quality starts with data quality. You’ll be at the heart of shaping how we curate, assess, and prepare the training data that powers real-world AI systems.

We’re seeking a Senior Member of the Core Data Team/ Principal Scientist to lead the evaluation and optimization of large-scale datasets used to train state-of-the-art AI models. In this role, you’ll help define what "high-quality data" means in practice, using statistical, computational, and ML-driven methods to ensure our data is diverse, representative, and high-impact. You’ll work closely with research and engineering teams to improve model performance through better data. This is an ideal role for someone with a PhD in machine learning, CS, or a related applied field who is passionate about the role of data in AI training and excited to advance Protege’s mission to become the ubiquitous platform for AI training data.

Key Responsibilities:

Design and apply statistical and machine learning methods to curate, filter, and enrich large-scale unstructured datasets
Develop frameworks to assess data diversity, duplication, and informativeness. Design statistical approaches to de-risk training datasets.
Collaborate with model training teams to identify data bottlenecks and optimize dataset performance. Emphasis on ability to collaborate with large foundational models and smaller startups.
Provide leadership on data quality strategy and shape internal best practices
Evaluate external datasets for integration, focusing on scalability, quality, and relevance to model performance. Help build data scorecards.
Contribute to research and development of tools that automate data preprocessing and validation

About You:

PhD or equivalent Master's Degree + 4+ years industry experience in machine learning, economics, mathematics, engineering, computer science, statistics, or a related quantitative field
Strong understanding of AI model training pipelines, including pre-processing and evaluation
Experience working with large, unstructured datasets, especially text
Background in statistical analysis, bias detection, and data validation
Able to identify high-impact problems and drive independent solutions

Bonus if you have these attributes:

Experience with synthetic data generation or augmentation strategies
Publications or open-source contributions in data-centric AI or related areas
Experience developing evaluation frameworks or performance metrics for training data
Cross-functional collaboration with product, infrastructure, or partnership teams

Top Skills

Data Validation

Machine Learning

Statistical Analysis

Similar Jobs

GitLab

Full-stack Engineer

An Hour Ago

Easy Apply

Remote

Easy Apply

Mid level

Cloud • Security • Software • Cybersecurity • Automation

The role involves developing features for GitLab's AI-powered platform, advocating for quality improvements, and collaborating within a remote team.

Top Skills: AIGraphQLJavaScriptPythonRspecRuby On RailsVue

ServiceNow

Senior Account Escalation Manager

3 Hours Ago

Remote or Hybrid

Orlando, FL, USA

Senior level

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation

The Senior Account Escalation Manager manages customer accounts requiring enhanced interaction, ensuring satisfaction through effective communication and issue resolution during escalations.

Top Skills: ExcelPowerPointWord

CrowdStrike

Specialist, AIDR (Remote)

3 Hours Ago

Remote or Hybrid

CO, USA

Senior level

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity

The AIDR Specialist drives sales initiatives for CrowdStrike's AI Detection and Response solution, collaborating with various teams to engage customers and operationalize security offerings.

Top Skills: AIApplication SecurityCloud SecurityDevsecops

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute