Site Reliability Engineer
Department Summary
DISH is a Fortune 200 company that continues to redefine the communications industry. Our legacy is innovation and a willingness to challenge the status quo, including reinventing ourselves. We disrupted the pay-TV industry in the mid-90s with the launch of the DISH satellite TV service, taking on some of the largest U.S. corporations in the process, and grew to be the fourth-largest pay-TV provider. We are doing it again with the first live, internet-delivered TV service - Sling TV - that bucks traditional pay-TV norms and gives consumers a truly new way to access and watch television.
Now we have our sights set on upending the wireless industry and unseating the entrenched incumbent carriers.
We are driven by curiosity, pride, adventure, and a desire to win - it's in our DNA. We're looking for people with boundless energy, intelligence, and an overwhelming need to achieve, to join our team as we embark on the next chapter of our story.
Opportunity is here. We are DISH.
Job Duties and Responsibilities
The Site Reliability Engineer (SRE) will be responsible for both uplifting and maintaining our evolving technology platforms, infrastructure and technology controls. As an SRE, the role will include both oversight for production operations of our systems, and development/engineering of solutions to maximize system reliability & automation. The role will address three dimensions:
Tools Coverage - Assess the tools coverage and ensure sufficient monitoring is in place to enable mature observability and data driven decision making
Defining and educating Engineering teams - Process, Procedures, Guide Rails and best practices
Culture - Inculcate the culture of high performing teams and adopt the ways of working with the influence of SRE
The role will need to work with a global team responsible for a mission critical business function, and will partner with Infrastructure, DevOps and Core practices (like Security, Identity, ProdOps, Cloud platform and Tools) teams to identify and implement automation opportunities to drive down toil, reduce technical debt and improve system reliability.
Key Responsibilities
- Work with DevOps teams to Build, Release, Monitor and run the services to improve service reliably
- Write software to automate API-driven tasks at scale and contribute to the product codebase in Java/Java Native, JS, React, Node, Go and Python
- Work with Ansible, Puppet, Chef, Terraform or another config management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives
- Define and accelerate implementation of support processes, tools and best practices
- Maintain services once they are live by measuring and monitoring availability, latency and overall system reliability
- Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
- Performance and maturity baselining of DevOps process, tools maturity & coverage, metrics, technology and engineering practices
- Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging- Tracing solutions), Ops process (Incident, Problem Mgmt) and streamline - automate release management
- Strong practitioner of automation to bring in sustained continuous improvement by automating Toil, Runbooks, Improving ability of the applications to auto heal leading to improved reliability
- Troubleshoot, debug, and diagnose operational issues and drive them to closure.
- Knowledge in the one or more of the following key areas: Ops maturity (performance testing, monitoring, operations - SIP), APM, Performance Benchmarking, Software Design and lifecycle (planning - discovery to provision), Infosec (including compliance, security)
- Exp in building monitoring/metrics & alerting tool (APM tool), custom dashboard for each Application stack against supported environment
- Expertise with Python-related Technologies and Frameworks
- Exp on at least one of the Cloud computing Infrastructure - GCP / Azure / AWS
- Familiarity with handling Containerization - Kubernetes, Docker, Rancher, etc. Kafka, Yarn, ElasticSearch etc. Source code management and Implementation of Security best practices. Tech Stack - Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map Reduce
- Be a subject matter expert, able to upskill / cross skill engineering teams on SRE principles, tools and execution
Skills, Experience and Requirements
Applicable Skills and Requirements:
- Engineering degree with 3+ years of experience in Application Support
- Strong understanding of modern monitoring and logging technologies (Logz.io, Cloud Watch, Splunk, DynaTrace, New Relic, AppDynamics, etc.)
- Understand microservice architecture
- Experience in Unix, Shell scripting/Python, SQL, AWS, etc
- Experience in troubleshooting complex application as well as environment issues
- Excellent communication, presentation and documentation skills
- Strong experience with Intake, Problem Management, and Service Availability Management
- Basic knowledge of CI/CD tools and concepts
- Good knowledge of ITIL processes
- Ready to work in shifts
- Should be able to handle team and mentor them as well
#LI-CC2
Salary Range
Compensation: $106,250.00/Year - $143,750.00/Year
Benefits
From versatile health perks to new career opportunities, check out our benefits on our careers website . Successful completion of a pre-employment screen, to include a drug test and criminal background check.