Site Reliability Engineer
About Us
We are a passionate team of open source developers with a desire to build a successful and sustainable business that can impact the world at large. Our mission is to create open source, enterprise-grade products that help individuals and organizations unlock their potential and become top performers in their respective domains. To achieve this, we are building a suite of tools that span the entire web development lifecycle ranging from a best in class local development experience all the way through multi-cloud, high-availability hosting (PaaS or self-hosted). To learn more, please visit https://wwww.drud.com/, our GitHub (https://github.com/drud/), and governance (https://github.com/drud/community) pages.
Roles and Responsibilities
Be professional, courteous, kind and responsive to others you engage with.
Integrate with a fast-paced engineering team to design, develop and deliver our local development and hosting products.
Help maintain 24×7 uptime on public cloud-based infrastructure.
Be a first responder during outages for clients with managed hosting and self-hosting with a support package.
Help design, build, and maintain solutions around logging, networking, monitoring, security, disaster recovery, etc.
Requirements
An overall team-centric philosophy and strong emotional Intelligence score is absolutely a must. Google spent a tremendous amount of effort to discover that the keys to high performing through Project Aristotle, and we feel that we have a lot to gain by standing on the shoulders of giants when building out our team. We have a strong affinity for cloud-native technologies and so should you. You must love highly distributed mission-critical computing using modern technologies and languages.
Qualifications
- Experience managing production Kubernetes clusters.
- Must be fluent in at least one programming language such as Python, GoLang or Ruby.
- 3+ years in a combination of DevOps, SRE, or Systems Operations roles.
- 3+ years experience managing Linux based servers. CoreOS is a big plus
- Demonstrated understanding of containers and container orchestration.
- Troubleshooting skills that span systems, network (TCP/IP), and code.
- Must have experience building or managing large-scale systems and application architectures.
- Solid understanding of system performance and monitoring.
- Working knowledge of cloud computing including virtualization, hosted services, multi-tenant cloud infrastructures, distributed storage systems and content delivery networks.
- Experience working with source control management tools, GitHub is a huge plus.
- Excellent verbal and written communication skills.
Nice to Haves
- Production experience with federated Kubernetes clusters
- Experience with service meshes such as Istio or linkerd
- Experience with multiple large cloud hosting providers: AWS, GCP, and Azure
- Experience with load balancers such as Elastic Load Balancer, NGINX, Envoy, HAProxy or Google Cloud Load Balancer
- Experience with messaging technologies: Kafka, RabbitMQ, NATS.
- Experience with infrastructure configuration and automation processes and tools: Ansible, Fabric, Terraform, Puppet, Chef.
- Experience with monitoring solutions: Prometheus, ELK, Splunk, SUMO, Nagios or fluentd
- Experience with various data technologies including relational and nonrelational databases and message queues.
- Experience with distributed storage systems: Ceph, GlusterFS, EFS, EBS or Rook
Benefits
- Flexible vacation/time-off.
- Competitive salaries and performance-based raises.
- Health, vision and dental insurance.
- Professional development opportunities.
- An amazing team of like-minded individuals to create with.
Applications (including a resume, a cover letter, and any additional information that would be relevant to the position) can be sent to [email protected].