Together AI is a research-driven AI cloud infrastructure provider offering a purpose-built GPU platform for training and running advanced AI models. Serving leading SaaS companies and pioneering startups, Together AI champions open source AI and decentralized computing, advocating for transparency to drive innovation and societal benefits.
Together.ai is seeking a Large-scale Training Resilience Engineer to ensure the reliability, fault tolerance, and scalability of their large-scale training infrastructure. The role focuses on designing and implementing resilience strategies, optimizing error detection, and maintaining observability systems. This is a hands-on position for individuals passionate about solving complex distributed systems problems.
Together AI is a research-driven AI cloud infrastructure provider offering a purpose-built GPU platform for training and running advanced AI models. Serving leading SaaS companies and pioneering startups, Together AI champions open source AI and decentralized computing, advocating for transparency to drive innovation and societal benefits.
Together AI