We are seeking a highly skilled Site Reliability Engineer (SRE) with strong experience in Kubernetes troubleshooting, incident response, and deep knowledge of monitoring and alerting systems.
Requirements
- 2+ years in an SRE, DevOps, or Infrastructure Engineer role.
- Bachelor's degree in computer science, IT, or related technical field.
- Hands-on experience on AWS and GCP Cloud
- Deep hands-on experience with Kubernetes (EKS, AKS, GKE)
- Strong understanding of Linux internals, container orchestration, and microservice architecture.
- Hands-on experience with monitoring/logging tools: Prometheus, Grafana, InfluxDB ELK stack (Elasticsearch, Logstash, Kibana)
- Proficient in incident response and alerting tools (PagerDuty etc.)
- Basic knowledge of: Kafka – topic monitoring, consumer health ElastiCache / Redis – caching patterns and troubleshooting InfluxDB – time-series metrics storage
- Experience writing and maintaining automation scripts in Bash, Python, or Go.