As a Senior Site Reliability Engineer (SRE), you will be instrumental in architecting, deploying, and managing complex cloud-native infrastructure in large-scale SaaS environments. Your role includes designing and maintaining Kubernetes clusters (AWS EKS, Azure AKS) using Infrastructure as Code tools such as Terraform, ensuring seamless scalability and reliability.
You will be responsible for implementing observability solutions across monitoring, logging, tracing, and metrics collection using Prometheus, Grafana, Datadog, and ELK ecosystem to maintain robust system health and performance. Managing data ingestion pipelines with high-volume technologies like Clickhouse and Kafka will be part of your core tasks.
Collaboration with cross-functional teams to drive GitOps CI/CD processes, writing automation scripts in Python, Go (Golang), and Bash to streamline deployments and operations, and enforcing security best practices involving encryption, key management, and policy enforcement are key duties. In addition, you will design and maintain disaster recovery strategies for critical components such as MySQL, Kafka, and Zookeeper, ensuring business continuity and data integrity.
This role requires active participation in a large, dynamic team environment where your expertise will elevate operational standards and foster innovation around high availability SaaS platforms.