Senior Site Reliability Engineer - Data at Angi
About the Role
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other HomeAdvisor production systems running smoothly. SREs are a hybrid of operators and software engineers that leverage engineering principles, operational experience, and automation to our environments. You will help shape our infrastructure and build the foundation our team relies on for the rapid, reliable delivery of our product. We’ll rely on you to instill best practices for building scalable distributed systems, with a keen focus on observability and fault tolerance. Our stack consists of technologies such as Kubernetes, Java Spring Boot, Oracle, Postgres, Coherence, Redis, Elasticsearch inside a hybrid cloud.
We are looking for experienced Site Reliability Engineers who meet the following criteria
- Breadth of knowledge across our infrastructure and application stack.
- Contributes small improvements to all codebase to resolve issues.
- Experience with container orchestration technologies like Kubernetes, Mesos, or Nomad. (We use Kubernetes.)
- A track record of leveraging automation whenever and wherever.
- An appreciation of and enthusiasm for software engineering best practices, such as infrastructure as code, testing, and continuous delivery.
- Data Engineering/Administration or production infrastructure and operations background.
- Experience with SQL data stores like Oracle, MSSQL, Maria DB, PostgreSQL, particularly with performance challenges in migrations from on-prem to cloud (RDS, Aurora, etc.)
- Identifies changes for the product or infrastructure architecture focusing on reliability, performance and availability perspective with a data-driven approach.
- Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources making HomeAdvisor operate with cost as a discipline.
- Identify parts of the system that do not scale, provide immediate and long term resolution of these incidents.
- Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
Collaboration and Communication:
- Know a domain really well and permeate that knowledge across the rest of the engineering organization.
- Perform and run blameless RCAs on incidents and outages and drive to prevent the incident from reoccurring.
- Show ownership of a major part of the infrastructure.
As an SRE you will:
- Be part of an on-call rotation to respond to incidents and provide support for software engineers across HomeAdvisor initiative teams.
- Build visibility into SLIs, SLOs, SLAs, dependency graphs to reduce operational burden or toil.
- Drive on instrumentation patterns to alert on symptoms and not on outages leveraging our monitoring stack of Grafana, Prometheus, Elasticsearch.
- Use your on-call shift to prevent incidents from occurring.
- Run our infrastructure with Terraform and Kubernetes.
- Use a data-driven approach to findings, turn into repeatable actions and then into automation.
- Improve the deployment process to make it as quick and dependable as possible.
- Design, build and maintain core infrastructure pieces that allow HomeAdvisor to scale to meet its market demand.
- Debug production issues across the full stack.
- Plan and shape the growth of HomeAdvisor’s ever-evolving infrastructure.
You may be a fit for this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Have an understanding of large scale system design, monitoring, and operational practices.
- Have strong programming skills - Ruby and/or Go
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have a burning desire for delivering quickly and iterating fast.
- Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies
Projects you could work on:
- Building automation around testing & performance for database services.
- Creation of monitoring of critical data services and tools for remediation, failover, and HA.
- Migrate legacy SQL databases to modern cloud native implementations, e.g. MSQL → Aurora.
- Build tooling to help reduce toil across the engineering organization and the use of large mission critical data transaction services..
Compensation & Benefits:
- The salary band for this position ranges from 140K - 200K, plus bonus and equity, commensurate with experience and performance.
- Full medical, dental, vision package to fit your needs
- Flexible vacation policy; work hard and take time when you need it
- Pet discount plans & retirement plan with company match (401K)
- The rare opportunity to work with sharp, motivated teammates solving some of the most unique challenges and changing the world