xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence.
Requirements
- Bachelor's degree in Systems Engineering, Electrical Engineering, Computer Science, or a related field (or equivalent experience)
- 5+ years of experience in hardware reliability engineering, preferably in high-performance computing or datacenter environments
- Proven expertise in firmware analysis, hardware specifications review, and release validation
- Strong experience with RMA processes, including filing claims, vendor negotiations, and pushing for resolutions outside standard protocols
- Demonstrated ability to diagnose and prove complex hardware failures, including grey or intermittent issues, using tools, logic analyzers, or diagnostic software
- Familiarity with datacenter hardware components (e.g., servers, GPUs, networking equipment) and emerging technologies
- Proficiency in scripting languages (e.g., Python, Bash) for automation and analysis
- Excellent problem-solving skills with a data-driven approach to reliability engineering
- Ability to work collaboratively with cross-functional teams, including operations technicians