xAI is a small, highly motivated team focused on engineering excellence, working to create AI systems that can accurately understand the universe and aid humanity. We're looking for a Site Reliability Engineer - Automation Specialist to focus on automating firmware upgrades, scripting solutions for hardware, and proactively identifying issues to implement automated fixes.
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience)
- 5+ years of experience in site reliability engineering or automation roles, preferably in datacenter or cloud environments
- Proficiency in Python, Bash, Linux, and Kubernetes for scripting, automation, and orchestration
- Hands-on experience with firmware packages, including writing scripts for upgrades and automating deployment processes
- Familiarity with hardware from vendors like NVIDIA, Dell, Supermicro, and HP, including integration and troubleshooting in production settings
- Strong problem-solving skills with a proven ability to identify issues and automate fixes to improve system efficiency