Senior Site Reliability Engineer at Angi
Angi® is transforming the home services industry, creating an environment for homeowners, service professions and employees to feel right at “home.” For any home maintenance need, our platform makes it easier than ever to find a qualified service professional for indoor and outdoor jobs, home renovations (or anything in between!). We are on a mission to become the home for everything home by helping small businesses thrive and providing solutions to financing and booking home jobs with just a few clicks.
Over the last 25 years we have opened our doors to a network of 250K+ service professionals and helped over 150 million homeowners love where they live. We believe home is the most important place on earth and are embarking on a journey to redefine how people care for their homes. Angi is an amazing place to build your dream career, join us—we cannot wait to welcome you home!
About the Role
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other HomeAdvisor production systems running smoothly. SREs are a hybrid of operators and software engineers that leverage engineering principles, operational experience, and automation to our environments. You will help shape our infrastructure and build the foundation our team relies on for the rapid, reliable delivery of our product. We’ll rely on you to instill best practices for building scalable distributed systems, with a keen focus on observability and fault tolerance. Our stack consists of technologies such as Kubernetes, Java Spring Boot, Oracle, Postgres, Coherence, Redis, Elasticsearch inside a hybrid cloud.
We are looking for experienced Site Reliability Engineers who meet the following criteria
- Breadth of knowledge across our infrastructure and application stack.
- Contributes small improvements to all codebase to resolve issues.
- Experience with container orchestration technologies like Kubernetes, Mesos, or Nomad. (We use Kubernetes.)
- A track record of leveraging automation whenever and wherever.
- An appreciation of and enthusiasm for software engineering best practices, such as infrastructure as code, testing, and continuous delivery
- Identifies changes for the product or infrastructure architecture focusing on reliability, performance and availability perspective with a data-driven approach.
- Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources making HomeAdvisor operate with cost as a discipline.
- Identify parts of the system that do not scale, provide immediate and long term resolution of these incidents.
- Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
Collaboration and Communication:
- Know a domain really well and permeate that knowledge across the rest of the engineering organization.
- Perform and run blameless RCAs on incidents and outages and drive to prevent the incident from reoccurring.
- Show ownership of a major part of the infrastructure.
As an SRE you will:
- Be part of an on-call rotation to respond to incidents and provide support for software engineers across HomeAdvisor initiative teams.
- Build visibility into SLIs, SLOs, SLAs, dependency graphs to reduce operational burden or toil.
- Drive on instrumentation patterns to alert on symptoms and not on outages leveraging our monitoring stack of Grafana, Prometheus, Elasticsearch.
- Use your on-call shift to prevent incidents from occurring.
- Run our infrastructure with Terraform and Kubernetes.
- Use a data-driven approach to findings, turn into repeatable actions and then into automation.
- Improve the deployment process to make it as quick and dependable as possible.
- Design, build and maintain core infrastructure pieces that allow HomeAdvisor to scale to meet its market demand.
- Debug production issues across the full stack.
- Plan and shape the growth of HomeAdvisor’s ever-evolving infrastructure.
You may be a fit for this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Have an understanding of large scale system design, monitoring, and operational practices.
- Have strong programming skills - Ruby and/or Go
- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have a burning desire for delivering quickly and iterating fast.
- Have experience with Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar technologies
Projects you could work on:
- Improving our Monitoring stack across the board.
- Migrate our ingress controllers to a more cloud-native paradigm ( istio, envoy, traefik ).
- Instrument our rails app to collect important information about our applications.
- Immutable kubernetes upgrade pattern automation.
- Build tooling to help reduce toil across the engineering organization.
Compensation & Benefits:
- The salary band for this position ranges from 140k - 200k, plus bonus and equity, commensurate with experience and performance.
- Full medical, dental, vision package to fit your needs
- Flexible vacation policy; work hard and take time when you need it
- Pet discount plans & retirement plan with company match (401K)
- The rare opportunity to work with sharp, motivated teammates solving some of the most unique challenges and changing the world