The Baldwin Group

Observability (SRE) Engineer

Posted 23 Days Ago

Remote

Hiring Remotely in US

Mid level

Remote

Hiring Remotely in US

Mid level

Looking for a DevOps Engineer to contribute to the DevOps strategy, manage CI/CD pipelines, ensure compliance, and improve processes within the Platform team.

The summary above was generated by AI

The Baldwin Group is an award-winning entrepreneur-led and inspired insurance brokerage firm delivering expertly crafted Commercial Insurance and Risk Management, Private Insurance and Risk Management, Employee Benefits and Benefit Administration, Asset and Income Protection, and Risk Mitigation strategies to clients wherever their passions and businesses take them throughout the U.S. and abroad. The Baldwin Group has award-winning industry expertise, colleagues, competencies, insurers, and most importantly, a highly differentiated culture that our clients consider an invaluable expansion of their business. The Baldwin Group (NASDAQ: BWIN), takes a holistic and tailored approach to insurance and risk management.

We’re looking for a highly motivated, practical and responsible Observability/Site Reliability Engineer who is excited to play a critical role in our rapidly growing Platform team. The Observability Engineer role will make significant contributions to our Observability, APM, Monitoring and Logging strategy, be integral to our day-to-day operations, and be an advocate for designing and implementing Site Reliability Engineering principles within the company.

The successful candidate will have experience with CI/CD, Observability, APM, Monitoring, Logging, Infrastructure-as-Code, On-Call Support. Understanding of Cloud (AWS/Azure), SRE Practices, version control, configuration management, and automation are also required.Principal Responsibilities:

Develop and maintain comprehensive observability solutions for infrastructure, applications, and services, and implement APM tools and frameworks to monitor application performance, user experience, and system health.
Implement and Maintain tools and systems that provide insights into the health and performance of applications and infrastructure including metrics, logs, and traces to monitor system behavior.
Proactively analyze performance metrics and logs to identify bottlenecks, failures, and areas for improvement, ensuring systems are consistently reliable, highly available, and optimally performing by addressing potential issues before they impact users.
Strategically assess system capacity requirements and plan for future growth to ensure seamless scalability, working closely with development and operations teams to implement robust and effective scaling strategies.
Create automated solutions for monitoring, deployment, scaling, and recovery operations, and develop custom tools and scripts to enhance observability and monitoring capabilities.
Collaborate closely with software engineers, QA teams, and operations staff to seamlessly integrate observability and reliability best practices into the development lifecycle with expert guidance and support for instrumenting code and services with comprehensive monitoring and logging solutions.
Develop and maintain incident response plans, including alerting, escalation, and communication protocols, and lead efforts to resolve production incidents, minimizing downtime, and ensuring thorough root cause analysis and post-mortem reviews

Education, Experience, Skills and Abilities Requirements:

3+ years of experience as a Observability or Site Reliability Engineer role.
Experience with cloud infrastructure platforms such as AWS or Azure.
Proven Experience with administering Observability, Monitoring tools (Datadog or similar).
Experience with containerized and serverless compute technology (Docker, ECS, Kubernetes, Lambda, etc.)
Experience with DevOps & CI/CD processes and tools (GitHub, Terraform, Ansible etc.).
Experience in integrations b/w DevOps, SRE, Testing tools to generate DORA metrics, reports and create dashboards.
Understanding of SRE principles including SLO, SLI, KPI, Metrics, logging, tracing etc.
Proficient in writing scripts (Bash, PowerShell) and program in one or more language (Python, JavaScript, Go, Java, or similar).
Experience in capacity planning and scaling resource requirements based on traffic patterns and performance metrics.
Experience in preparing, executing, and improving incident response plans.
Strong understanding of on-call rotation practices and incident escalation processes.
Knowledge of security best practices and compliance standards relevant to observability and monitoring (e.g., GDPR, HIPAA).
Datadog or relevant Certifications preferred.
Highly self-motived, highly available, and driven to exceed colleague expectation
Ability to think critically and logically under pressure.
Strong technical experience with proven history of troubleshooting complex, cross segment, cross office, and cross team problems.
Demonstrates the organization’s core values, exuding behavior that is aligned with the firm’s culture.

Click here for some insight into our culture!

The Baldwin Group will not accept unsolicited resumes from any source other than directly from a candidate who applies on our career site. Any unsolicited resumes sent to The Baldwin Group, including unsolicited resumes sent via any source from an Agency, will not be considered and are not subject to any fees for any placement resulting from the receipt of an unsolicited resume.

Top Skills

Aws,Aure,Github,Datadog,Terrraform,Ansible,Docker,Ecs,Kubernetes,Python,Javascript,Go,Java

Similar Jobs

Cisco Meraki

Lead Site Reliability Engineer, Observability - Remote

12 Days Ago

Easy Apply

Remote

Hybrid

Easy Apply

148K-236K Annually

Senior level

148K-236K Annually

Senior level

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI

The Lead Site Reliability Engineer will design, develop, and operate observability systems, ensuring service reliability in large distributed environments. Responsibilities include scaling observability systems, writing monitoring libraries, and collaborating with engineering teams.

Top Skills: AnsibleBashElasticsearchGoKafkaPrometheusPythonRubyScalaTerraform

Crusoe Energy Systems

Site Reliability Engineer II - Observability

23 Days Ago

Remote

Hybrid

135K-158K Annually

Senior level

135K-158K Annually

Senior level

Cloud • Greentech • Other • Energy

As a Site Reliability Engineer II on the Observability team, you'll manage and improve observability stacks, support engineering teams with monitoring, develop new tools, and analyze system performance for enhanced reliability.

Top Skills: AnsibleCircleCICloud FormationDockerGithub ActionsGitlab Ci/CdGoKubernetesPythonTerraform

Flock Safety

Senior Site Reliability Engineer, Device Observability

18 Days Ago

Remote

USA

150K-190K Annually

Senior level

150K-190K Annually

Senior level

Hardware • Machine Learning • Security • Software

The Senior Site Reliability Engineer will automate software deployment and monitoring for device fleets, improve release processes, and enhance team collaboration while ensuring reliability and efficiency.

Top Skills: AWSDatadogGitGithub ActionsGrafanaGroovyJavaJenkinsJavaScriptNoSQLPostgresPrometheusPythonRTerraform

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute