Thoughtworks Singapore is seeking a Lead AI Infrastructure Engineer to design, maintain, and scale high-performance AI infrastructure. This role involves partnering with ML engineers and clients to deliver optimized solutions for AI workloads, focusing on throughput, latency, availability, and compliance. The ideal candidate will have deep expertise in GPU-based inference, DevOps practices, and platform engineering.
Requirements
- Expertise in GPU-based infrastructure for AI
- Strong knowledge of orchestration frameworks (Kubernetes, Ray, Slurm)
- Experience with inference-serving frameworks (vLLM, NVIDIA Triton, DeepSpeed)
- Proficiency in infrastructure automation (Terraform, Helm, CI/CD pipelines)
- Experience building resilient, high-throughput, low-latency systems for AI inference
- Solid background in observability and monitoring: Prometheus, Grafana, OpenTelemetry
- Familiarity with security, compliance, and governance concerns in AI infrastructure
- Understanding of DevOps, cloud-native architectures, and Infrastructure as Code
- Exposure to multi-cloud and hybrid deployments (AWS, GCP, Azure, sovereign/private cloud)
- Experience with benchmarking and cost/performance tuning for AI systems
- Background in MLOps or collaboration with ML teams on large-scale AI production systems