The Reliability Engineer will be a critical contributor within the Site Reliability Engineering (SRE) and Incident Management team, focusing on ensuring the availability, reliability, and performance of critical systems and services. This role is responsible for managing and facilitating major incident response efforts, ensuring that service disruptions are quickly identified, triaged, and resolved.
Requirements
- Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent professional experience)
- 3+ years of professional experience in Site Reliability Engineering (SRE), System Engineering, DevOps, or IT Operations roles
- Highly experienced as a major incident manager, incident commander, or similar role, with a proven ability to facilitate, communicate, and drive resolution of technical incidents
- Strong understanding of ITIL principles and their application in incident management
- Experience with observability tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic, or similar technologies
- Experience with synthetic monitoring, infrastructure monitoring, and metrics and tracing monitoring tools
- Experience with hybrid infrastructure environments and understand monitoring signals from static on-premise infrastructure, cloud based ephemeral infrastructure, and SaaS applications
- Strong understanding of telemetry, logging, tracing, and their roles in system monitoring and observability pipelines
- Experience with Python, Go, Bash, or a similar language to develop and maintain monitoring and automation scripts
- Proven ability to remain calm and effective during high-pressure situations, facilitating resolution in a methodical, professional manner
Benefits
- Competitive salary
- Benefits package
- Opportunities for career growth and development
- Collaborative and dynamic work environment
- Diverse and inclusive company culture