Site Reliability Engineer IV - ePay at GHX (Remote)
Sorry, this job was removed at 8:17 a.m. (MST) on Tuesday, November 30, 2021
The Site Reliability Engineer IV is responsible for ensuring ePay application is highly available, resilient, secure and scalable. Our ideal candidate is well-versed in modern cloud-based architecture, experienced in designing systems for reliability as well as implementing monitoring, alerting, and ops automation. Candidate will have proven experience in change management, emergency response and experience working with development teams to help create automated pipelines and solutions required for continuous delivery in an Agile Dev/Ops environment.Principal duties and responsibilities:
- Automate anything and everything! (Infrastructure build out, testing, deploying, monitoring, etc)
- Design and assist in the authoring of software tools that reliably manage application delivery
- Implementation of proactive monitoring, alerting, trend analysis and self-healing systems. Perform quality reviews, manage operational issues
- Partner with development team in defining and implementing improvements in service architecture
- Ensure services are designed with 24/7 availability and operational readiness and rigor
- Improve predictability and reliability of software releases, workflows and operating software.
- Collaborate with Product and Support teams to plan and deploy frequent product releases
- Reduce application deployment windows by leading company towards automated pipelines and solutions required for continuous delivery in an Agile environment.
- Reduce mean time to recovery (MTTR) by helping troubleshoot, monitor, alert, and automating recovery.
- Implement SRE tools, processes, and best practices
- Interface with Dev/Product/OPS teams to identify root cause analysis and re-instrument triggers to prevent future network degradation and outages
- Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
- Explore and innovate new cloud and HA technologies, features, and tools
- Bachelor's degree in Computer Science and or equivalent practical experience.
- Solid knowledge of Linux and experience in production support activities
- Fundamental understanding of TCP/IP, load balancing, routing, firewalling, clustering basics, DNS, HTTP/s
- Fluency with at least one current generation scripting language used by DevOps professionals (Bash, Python)
- Experience with Continuous Integration and Continuous Delivery concepts, best practices including Infrastructure as code, utilizing tools like Terraform, Cloudformation, Ansible, Chef, Puppet or an equivalent
- Proven experience with DevOPS log/monitoring/metric collection toolsets ELK, Thanos/Prometheus etc
- Deep understanding of the software delivery process with the ability to implement and enforce that process across the organization
- Hands-on experience with AWS
- Development Experience is a Must
- Experience in Taking Application Code and Third-Party Products and Building Full End-to-End Pipelines to Build, Test and Deploy Complex Systems
- Ability to Containerize an Application and Build a Process Around Creating Containers and Pushing them to an Artifact Repository
- Understand General Networking Concepts, Connectivity, Systems Architecture, Disaster Recovery
Read Full Job Description