The Baldwin Group

Observability (SRE) Engineer

Posted 16 Days Ago

Remote

Hiring Remotely in US

Mid level

Remote

Hiring Remotely in US

Mid level

Looking for a DevOps Engineer to contribute to the DevOps strategy, manage CI/CD pipelines, ensure compliance, and improve processes within the Platform team.

The summary above was generated by AI

The Baldwin Group is an award-winning entrepreneur-led and inspired insurance brokerage firm delivering expertly crafted Commercial Insurance and Risk Management, Private Insurance and Risk Management, Employee Benefits and Benefit Administration, Asset and Income Protection, and Risk Mitigation strategies to clients wherever their passions and businesses take them throughout the U.S. and abroad. The Baldwin Group has award-winning industry expertise, colleagues, competencies, insurers, and most importantly, a highly differentiated culture that our clients consider an invaluable expansion of their business. The Baldwin Group (NASDAQ: BWIN), takes a holistic and tailored approach to insurance and risk management.

We’re looking for a highly motivated, practical and responsible Observability/Site Reliability Engineer who is excited to play a critical role in our rapidly growing Platform team. The Observability Engineer role will make significant contributions to our Observability, APM, Monitoring and Logging strategy, be integral to our day-to-day operations, and be an advocate for designing and implementing Site Reliability Engineering principles within the company.

The successful candidate will have experience with CI/CD, Observability, APM, Monitoring, Logging, Infrastructure-as-Code, On-Call Support. Understanding of Cloud (AWS/Azure), SRE Practices, version control, configuration management, and automation are also required.Principal Responsibilities:

Develop and maintain comprehensive observability solutions for infrastructure, applications, and services, and implement APM tools and frameworks to monitor application performance, user experience, and system health.
Implement and Maintain tools and systems that provide insights into the health and performance of applications and infrastructure including metrics, logs, and traces to monitor system behavior.
Proactively analyze performance metrics and logs to identify bottlenecks, failures, and areas for improvement, ensuring systems are consistently reliable, highly available, and optimally performing by addressing potential issues before they impact users.
Strategically assess system capacity requirements and plan for future growth to ensure seamless scalability, working closely with development and operations teams to implement robust and effective scaling strategies.
Create automated solutions for monitoring, deployment, scaling, and recovery operations, and develop custom tools and scripts to enhance observability and monitoring capabilities.
Collaborate closely with software engineers, QA teams, and operations staff to seamlessly integrate observability and reliability best practices into the development lifecycle with expert guidance and support for instrumenting code and services with comprehensive monitoring and logging solutions.
Develop and maintain incident response plans, including alerting, escalation, and communication protocols, and lead efforts to resolve production incidents, minimizing downtime, and ensuring thorough root cause analysis and post-mortem reviews

Education, Experience, Skills and Abilities Requirements:

3+ years of experience as a Observability or Site Reliability Engineer role.
Experience with cloud infrastructure platforms such as AWS or Azure.
Proven Experience with administering Observability, Monitoring tools (Datadog or similar).
Experience with containerized and serverless compute technology (Docker, ECS, Kubernetes, Lambda, etc.)
Experience with DevOps & CI/CD processes and tools (GitHub, Terraform, Ansible etc.).
Experience in integrations b/w DevOps, SRE, Testing tools to generate DORA metrics, reports and create dashboards.
Understanding of SRE principles including SLO, SLI, KPI, Metrics, logging, tracing etc.
Proficient in writing scripts (Bash, PowerShell) and program in one or more language (Python, JavaScript, Go, Java, or similar).
Experience in capacity planning and scaling resource requirements based on traffic patterns and performance metrics.
Experience in preparing, executing, and improving incident response plans.
Strong understanding of on-call rotation practices and incident escalation processes.
Knowledge of security best practices and compliance standards relevant to observability and monitoring (e.g., GDPR, HIPAA).
Datadog or relevant Certifications preferred.
Highly self-motived, highly available, and driven to exceed colleague expectation
Ability to think critically and logically under pressure.
Strong technical experience with proven history of troubleshooting complex, cross segment, cross office, and cross team problems.
Demonstrates the organization’s core values, exuding behavior that is aligned with the firm’s culture.

Click here for some insight into our culture!

The Baldwin Group will not accept unsolicited resumes from any source other than directly from a candidate who applies on our career site. Any unsolicited resumes sent to The Baldwin Group, including unsolicited resumes sent via any source from an Agency, will not be considered and are not subject to any fees for any placement resulting from the receipt of an unsolicited resume.

Top Skills

Aws,Aure,Github,Datadog,Terrraform,Ansible,Docker,Ecs,Kubernetes,Python,Javascript,Go,Java

Similar Jobs

Cisco Meraki

Lead Site Reliability Engineer, Observability - Remote

5 Days Ago

Easy Apply

Remote

Hybrid

Easy Apply

148K-236K Annually

Senior level

148K-236K Annually

Senior level

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI

The Lead Site Reliability Engineer will design, develop, and operate observability systems, ensuring service reliability in large distributed environments. Responsibilities include scaling observability systems, writing monitoring libraries, and collaborating with engineering teams.

Top Skills: AnsibleBashElasticsearchGoKafkaPrometheusPythonRubyScalaTerraform

Crusoe Energy Systems

Site Reliability Engineer II - Observability

16 Days Ago

Remote

Hybrid

135K-158K Annually

Senior level

135K-158K Annually

Senior level

Cloud • Greentech • Other • Energy

As a Site Reliability Engineer II on the Observability team, you'll manage and improve observability stacks, support engineering teams with monitoring, develop new tools, and analyze system performance for enhanced reliability.

Top Skills: AnsibleCircleCICloud FormationDockerGithub ActionsGitlab Ci/CdGoKubernetesPythonTerraform

Flock Safety

Senior Site Reliability Engineer, Device Observability

11 Days Ago

Remote

USA

150K-190K Annually

Senior level

150K-190K Annually

Senior level

Hardware • Machine Learning • Security • Software

The Senior Site Reliability Engineer will automate software deployment and monitoring for device fleets, improve release processes, and enhance team collaboration while ensuring reliability and efficiency.

Top Skills: AWSDatadogGitGithub ActionsGrafanaGroovyJavaJenkinsJavaScriptNoSQLPostgresPrometheusPythonRTerraform

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
Key Industries: Artificial intelligence, adtech, media, software, game development
Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering