Cornelis Networks is seeking an AI Performance Engineer to optimize training and multi-node inference across next-gen networking silicon and systems. The ideal candidate will have experience with AI/ML at cluster scale, a deep understanding of message passing, collectives, and bottleneck hunting, and hands-on experience with distributed training and inference architectures.

Requirements

B.S. in CS/EE/CE/Math or related
5–7+ years running AI/ML at cluster scale
Proven ability to set up, run, and analyze AI benchmarks
Deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving
Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE)
Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe
Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user
Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED
Excellent written and verbal communication—turn measurements into persuasion with SLO-driven narratives for inference

Benefits

Medical, dental, and vision coverage
Disability and life insurance
Dependent care flexible spending account
Accidental injury insurance
Pet insurance
Generous paid holidays
401(k) with company match
Open Time Off (OTO) for regular full-time exempt employees
Sick time
Bonding leave
Pregnancy disability leave

Requirements

B.S. in CS/EE/CE/Math or related

5–7+ years running AI/ML at cluster scale

Proven ability to set up, run, and analyze AI benchmarks

Deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving

Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE)

Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe

Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user

Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED

Excellent written and verbal communication—turn measurements into persuasion with SLO-driven narratives for inference

AI Performance Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

AI Performance Engineer

AI Performance Optimization Engineer

AI/Machine Learning Engineer

AI Performance Engineer

About the Company

Job Description

Requirements

Benefits

Similar Jobs

AI Performance Engineer

AI Performance Optimization Engineer

AI/Machine Learning Engineer

Job Details

About Cornelis Networks