Pathway is a leading AI company that is shaking the foundations of artificial intelligence. We are looking for a Senior ML Infrastructure / DevOps Engineer to own the infrastructure that powers our ML training and inference workloads across multiple cloud providers.
Requirements
- Design, operate, and scale GPU and CPU clusters for ML training and inference
- Automate infrastructure provisioning and configuration using infrastructure-as-code
- Build and maintain robust ML pipelines with strong guarantees around reproducibility, traceability, and rollback
- Implement and evolve ML-centric CI/CD
- Own monitoring, logging, and alerting across training and serving
- Work with terabyte-scale datasets and associated storage, networking, and performance challenges
- Partner closely with ML engineers and researchers to productionize their work
- Participate in on-call rotation for critical ML infrastructure and lead incident response and post-mortems
Benefits
- Competitive salary
- Permanent employment contract
- Inclusive workplace culture
- Opportunity to work with cutting-edge AI technology