As a Data Center System Software Engineer at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our state-of-the-art data center infrastructure.
Requirements
- Maintain and improve the reliability and uptime of xAI’s on-premises and cloud-based data center environments
- Design, implement, and manage monitoring, logging, and alerting systems
- Develop and maintain infrastructure-as-code and continuous deployment pipelines
- Participate in on-call rotations, respond to incidents, perform root cause analysis, and drive post-mortem processes
- Analyze system performance, forecast capacity needs, and optimize resource utilization for massive AI/ML workloads
- Collaborate with hardware, networking, and software engineering teams to design and implement resilient, scalable solutions
- Create and maintain documentation and standard operating procedures
- Contribute to the efficiency of AI training pipelines by identifying and mitigating bottlenecks in compute, storage, and networking at unprecedented scales
Benefits
- Paid Time Off
- Health Insurance
- 401k Matching
- Retirement Plan
- Tuition Reimbursement