As a Systems Reliability Engineer (SRE), you’ll own the reliability, scalability, and security posture of the platforms that power our agentic workflows. You’ll build the guardrails and operational foundations that let product and AI teams ship quickly without sacrificing uptime, observability, or customer trust.
Requirements
- 4+ years in SRE/DevOps/Infrastructure roles supporting production systems with meaningful uptime requirements
- AWS Expertise: Strong hands-on experience operating workloads in AWS (IAM, VPC/networking, compute, storage, monitoring, and security controls)
- Solid understanding of distributed systems failure modes (timeouts, retries, cascading failures), and how to design for resilience
- Strong incident leadership instincts; comfortable being the calm, methodical driver during outages
- Automation Mindset: You automate first—repeatable environments, scripted operations, and minimal manual toil
- Clear Communicator: Can write crisp runbooks, postmortems, and technical proposals; able to align engineering, product, and ops on priorities
- Proven ability to improve security posture and reliability without blocking delivery
Benefits
- Equity & Ownership: Competitive equity so you grow alongside the company
- Impact & Visibility: Direct access to co-founders; your work directly improves customer trust and operational outcomes
- Collaborative Culture: Tight-knit team of seasoned operators and AI experts
- Flexible Work: Hybrid with core Bay Area presence and remote flexibility