We are seeking a Staff Site Reliability Engineer to play a critical role in building and scaling the infrastructure behind ServiceTitan’s new AI platform. The ideal candidate will have technical depth and strategic thinking, with expertise in Azure, Terraform, Kubernetes, and modern IaC and container orchestration best practices.
Requirements
- Lead the design, implementation, and optimization of scalable, resilient infrastructure for cloud-native AI services on Azure.
- Establish true continuous delivery (CD) pipelines supporting blue-green deployments, automatic rollbacks, and progressive delivery patterns.
- Champion observability excellence - define best practices for metrics, tracing, and logging; help product team design meaningful SLIs, SLOs, and error budgets.
- Drive automation across the entire lifecycle: infrastructure provisioning, testing, deployment, and recovery.
- Partner with the engineering team to design reliable, fault-tolerant services and perform resilience and capacity reviews.
- Mentor engineers and foster a reliability culture across teams — enabling others to build self-healing, observable systems.
Benefits
- Flextime, recognition, and support for autonomous work
- Holistic health and wellness benefits
- Support for Titans at all stages of life