The Reliability Engineer will be a critical contributor within the Site Reliability Engineering (SRE) and Incident Management team, focusing on ensuring the availability, reliability, and performance of critical systems and services. This role is responsible for managing and facilitating major incident response efforts, ensuring that service disruptions are quickly identified, triaged, and resolved.

Requirements

Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent professional experience)
3+ years of professional experience in Site Reliability Engineering (SRE), System Engineering, DevOps, or IT Operations roles
Highly experienced as a major incident manager, incident commander, or similar role, with a proven ability to facilitate, communicate, and drive resolution of technical incidents
Strong understanding of ITIL principles and their application in incident management
Experience with observability tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic, or similar technologies
Experience with synthetic monitoring, infrastructure monitoring, and metrics and tracing monitoring tools
Experience with hybrid infrastructure environments and understand monitoring signals from static on-premise infrastructure, cloud based ephemeral infrastructure, and SaaS applications
Strong understanding of telemetry, logging, tracing, and their roles in system monitoring and observability pipelines
Experience with Python, Go, Bash, or a similar language to develop and maintain monitoring and automation scripts
Proven ability to remain calm and effective during high-pressure situations, facilitating resolution in a methodical, professional manner

Benefits

Competitive salary
Benefits package
Opportunities for career growth and development
Collaborative and dynamic work environment
Diverse and inclusive company culture

Requirements

Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent professional experience)

3+ years of professional experience in Site Reliability Engineering (SRE), System Engineering, DevOps, or IT Operations roles

Highly experienced as a major incident manager, incident commander, or similar role, with a proven ability to facilitate, communicate, and drive resolution of technical incidents

Strong understanding of ITIL principles and their application in incident management

Experience with observability tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic, or similar technologies

Experience with synthetic monitoring, infrastructure monitoring, and metrics and tracing monitoring tools

Experience with hybrid infrastructure environments and understand monitoring signals from static on-premise infrastructure, cloud based ephemeral infrastructure, and SaaS applications

Strong understanding of telemetry, logging, tracing, and their roles in system monitoring and observability pipelines

Experience with Python, Go, Bash, or a similar language to develop and maintain monitoring and automation scripts

Proven ability to remain calm and effective during high-pressure situations, facilitating resolution in a methodical, professional manner

Site Reliability Engineer - Incident Management

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Site Reliability Engineer - Incident Management

Site Reliability Engineer - Incident Management

Manager, Incident Ops and Observability

Products

Use Cases

Insights

Resources

Company

Site Reliability Engineer - Incident Management

About the Company

Job Description

Requirements

Benefits

Similar Jobs

Site Reliability Engineer - Incident Management

Site Reliability Engineer - Incident Management

Manager, Incident Ops and Observability

Job Details

About F5