We're seeking a Site Reliability Engineer (SRE) to join one of our Scrum teams and help ensure the reliability, scalability, and performance of the Florence platform. As an SRE, you'll work closely with product engineers while actively leveraging AI to improve observability, incident response, automation, and overall platform reliability.
Requirements
- Be an embedded member of a Scrum team, participating in planning, refinement, reviews, and retrospectives
- Use AI-powered tools to enhance system reliability, operational efficiency, and developer productivity
- Design, build, and operate reliable, scalable cloud infrastructure supporting platform and product services
- Apply AI-assisted analysis to monitoring, alerting, and observability data to detect, predict, and prevent incidents
- Define and maintain SLOs, SLIs, and error budgets to guide reliability decisions
- Collaborate with software engineers to embed reliability and AI-driven automation into the software development lifecycle
- Lead and participate in incident response, root cause analysis, and postmortems, leveraging AI insights where appropriate
- Automate operational tasks and reduce toil through AI-enabled and traditional automation approaches
- Contribute to disaster recovery planning, testing, and operational readiness
- Produce and maintain documentation such as runbooks, operational guides, and system diagrams
- Contribute code as a secondary responsibility, with coding assignments focused on building reliability tooling, automation, and integrations using AI-assisted development practices
Benefits
- Competitive compensation package
- Medical and dental insurance
- Office space in the heart of the city