Arena (arena.ai) Logo

Arena (arena.ai)

Site Reliability Engineer, Platform

Posted 24 Days Ago
Remote or Hybrid
Hiring Remotely in CA
Senior level
Remote or Hybrid
Hiring Remotely in CA
Senior level
The role involves defining and evolving technical foundations for AI evaluation, optimizing performance, designing resilient systems, and collaborating with various teams for infrastructure improvements.
The summary above was generated by AI
About Arena Intelligence

Arena Intelligence is the open platform for evaluating how AI models perform in the real world. Created by researchers from UC Berkeley’s SkyLab, our mission is to measure and advance the frontier of AI for real-world use.

Millions of people use Arena Intelligence each month to explore how frontier systems perform — and we use our community’s feedback to build transparent, rigorous, and human-centered model evaluations. Leading enterprises and AI labs rely on our evaluations to understand real-world reliability, alignment, and impact. Our leaderboards are the gold standard for AI performance — trusted by leaders across the AI community and shaping the global conversation on model reliability and progress.

We’re a team of researchers, engineers, academics, and builders from places like UC Berkeley, Google, Stanford, DeepMind, and Discord. We seek truth, move fast, and value craftsmanship, curiosity, and impact over hierarchy. We’re building a company where thoughtful, curious people from all backgrounds can do their best work. Everyone on our team is a deep expert in their field — our office radiates excellence, energy, and focus.

About the Role

Arena Intelligence is seeking a Site Reliability Engineer to own the reliability, performance, and operational security of the platform that millions of people depend on to evaluate frontier AI. This is the first dedicated SRE hire on the team — you'll build observability, incident response, and infrastructure hardening practices from scratch while also owning the CI/CD and developer tooling that keeps our engineering team moving fast.

Our stack runs on Vercel (Next.js, Hono API on Nitro), Supabase (Postgres, GoTrue auth), Cloudflare (Workers, R2, bot management), and AWS (CloudFront, Lambda). You'll work across the full request path — from edge-layer DDoS mitigation to auth hardening to production monitoring — partnering closely with security and product engineering to keep the platform fast, reliable, and resilient under adversarial traffic conditions.

You’ll
  • Harden auth infrastructure against volumetric attacks — edge-layer rate limiting in front of Supabase GoTrue, connection pool tuning, token caching, and origin shielding so DDoS traffic is filtered before it reaches the database

  • Extend CloudFront WAF rules and Cloudflare Worker bot management to cover auth endpoints and close gaps in application-layer rate limiting

  • Define and implement SLOs/SLIs across the full request path — CDN edge through serverless functions to Supabase

  • Build monitoring, alerting, and dashboards on top of existing Datadog and PostHog instrumentation that surface degradations before users notice them

  • Collaborate with security engineering to ensure clean handoff between edge-layer defenses and application-layer anti-abuse systems

  • Own and improve CI/CD pipelines (GitHub Actions, Turborepo) and expand infrastructure-as-code (Terraform) across cloud environments

  • Proactively load-test and stress-test infrastructure, model capacity limits, and drive cost optimization across our multi-cloud footprint

  • Enhance developer workflows to make building, testing, and deploying faster and more reliable

  • Mentor engineers across the company on building reliable, performant, and observable systems

You’ll have
  • 6+ years of experience in SRE, platform engineering, or infrastructure engineering, including operating production systems at scale (millions of users / billions of requests)

  • Direct experience mitigating DDoS attacks and configuring edge security — WAF rules, CDN architecture, rate limiting, and traffic analysis

  • Hands-on experience building observability systems (Datadog, Grafana, Prometheus, or similar) and running incident response processes

  • Strong understanding of auth infrastructure under adversarial load — connection pooling, token caching, and rate limiting on login/signup endpoints

  • Experience with serverless architectures and managed platforms — you know how to make them reliable and observable at scale

  • Experience with infrastructure-as-code (Terraform, Pulumi) and CI/CD pipeline design

  • Track record of collaborating with security and product engineering to deliver both foundational systems and user-facing reliability improvements

Bonus Experience
  • Experience with Vercel, Supabase (GoTrue, Supavisor), Cloudflare Workers, or CloudFront specifically.

  • Experience with Node.js, TypeScript, Python, or Go in production backend environments.

  • Background in platforms with voting, reputation, or community-driven systems.

  • Experience being the first or early infrastructure hire at a startup.

  • Experience hardening auth systems under load (OAuth, JWT, PKCE flows, connection pooling).

What we offer
  • We offer competitive compensation and equity aligned to the markets where our team members are based. The base salary range will depend on the candidate’s permanent work location.

  • Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.

  • The opportunity to work on cutting-edge AI with a small, mission-driven team

  • A culture that values transparency, trust, and community impact

Come help build the space where anyone can explore and help shape the future of AI.

Arena Intelligence provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, genetics, sexual orientation, gender identity, or gender expression. We are committed to a diverse and inclusive workforce and welcome people from all backgrounds, experiences, perspectives, and abilities.

Similar Jobs

3 Days Ago
Remote
Senior level
Senior level
Information Technology • Software • Travel • Hospitality
As a Senior Site Reliability Engineer, you will ensure platform reliability and performance, architect AWS solutions, maintain Kubernetes clusters, support CI/CD processes, automate platform deployments, and optimize system performance.
Top Skills: ArgocdAuroraAWSCloudwatchDatadogGrafanaKubernetesMemcachedMySQLNginxPostgresPrometheusRedisSqsTerraform
3 Days Ago
Remote
Senior level
Senior level
Information Technology • Software • Travel • Hospitality
As a Senior Site Reliability Engineer, ensure platform reliability, implement AWS solutions, support Kubernetes, and automate deployments while optimizing performance.
Top Skills: ArgocdAuroraAWSCloudwatchDatadogGitopsGrafanaKubernetesMemcachedMySQLNginxPostgresPrometheusRedisSqsTerraform
6 Days Ago
Remote
Senior level
Senior level
Information Technology • Software • Travel • Hospitality
As a Senior Site Reliability Engineer, you will ensure platform reliability, implement AWS cloud solutions, support Kubernetes infrastructure, and enhance observability systems while collaborating with global teams.
Top Skills: ArgocdAuroraAWSCloudwatchDatadogEksGitopsGrafanaKubernetesMemcachedMySQLNginxPostgresPrometheusRedisSqsTerraform

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account