Site Reliability Engineer — SRE

A US enterprise SaaS company with 99.99% uptime SLAs serving Fortune 500 customers is hiring a Site Reliability Engineer to join their platform reliability team. You will embed SRE principles across engineering — owning SLOs, incident response, capacity planning and the toil reduction programme that lets the company scale without scaling headcount. Role & Responsibilities: • Define, measure and report on Service Level Objectives (SLOs) and error budgets across all production services • Build and maintain observability infrastructure: Prometheus, Grafana, PagerDuty, Jaeger and centralised logging • Lead incident response: on-call, escalation, war room coordination and post-mortem facilitation • Identify and eliminate toil through automation — building self-healing systems and runbook automation • Perform capacity planning and load testing ahead of customer onboarding and product launches • Partner with development teams on reliability reviews, deployment readiness and chaos engineering • Manage Kubernetes infrastructure on AWS EKS: autoscaling, node group management and cost optimisation • Contribute to the disaster recovery programme: runbooks, RTO/RPO testing and failover automation Required Skills & Experience: • 4+ years of SRE, platform engineering or senior DevOps experience • Strong observability expertise: Prometheus, Grafana, distributed tracing and log aggregation at scale • Production Kubernetes experience: EKS, AKS or GKE — cluster operations, autoscaling and security • Python or Go for automation, tooling and reliability engineering scripts • Deep understanding of SRE principles: SLOs, error budgets, toil, blameless post-mortems • Experience with AWS infrastructure: EC2, RDS, S3, Lambda, CloudWatch • Incident command experience in a high-stakes production environment • AWS Certified DevOps Engineer or CKA certification preferred What We Offer: • Fully remote role on IST hours with US EST overlap required • Salary $110,000–$140,000 based on experience • Work on infrastructure that enterprise customers depend on for critical business operations • Strong SRE culture — you will not be a ticket-taker, you will be a reliability engineer For an engineer who has moved past basic DevOps and wants to practise real SRE — where reliability is an engineering discipline with maths, targets and accountability behind it.

Remote · IST Hours | $110,000–$140,000

SRE
Kubernetes
Observability
Python
AWS