Cornelis Networks is seeking an AI Performance Engineer to optimize training and multi-node inference across next-gen networking silicon and systems. The ideal candidate will have experience with AI/ML at cluster scale, a deep understanding of message passing, collectives, and bottleneck hunting, and hands-on experience with distributed training and inference architectures.
Requirements
- B.S. in CS/EE/CE/Math or related
- 5–7+ years running AI/ML at cluster scale
- Proven ability to set up, run, and analyze AI benchmarks
- Deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving
- Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE)
- Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe
- Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user
- Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED
- Excellent written and verbal communication—turn measurements into persuasion with SLO-driven narratives for inference
Benefits
- Medical, dental, and vision coverage
- Disability and life insurance
- Dependent care flexible spending account
- Accidental injury insurance
- Pet insurance
- Generous paid holidays
- 401(k) with company match
- Open Time Off (OTO) for regular full-time exempt employees
- Sick time
- Bonding leave
- Pregnancy disability leave