Job Description
Must Have Technical/Functional Skills
- 7+ years of experience in SRE, platform engineering, or cloud infrastructure engineering in large-scale enterprise environments (10,000+ employees or equivalent complexity).
- Deep, hands-on expertise with Microsoft Azure — minimum 4 years in a primary Azure cloud engineering role.
- Expert-level proficiency with AKS: cluster lifecycle management, RBAC, network policies, pod security standards, cluster autoscaler, and Workload Identity.
- Strong infrastructure-as-code skills: Terraform (required) and/or Bicep; experience managing Azure Landing Zones or Enterprise-Scale architecture.
- Proficiency in at least one systems programming/scripting language: Python (preferred), Go, or PowerShell.
- Experience designing and operating enterprise observability platforms using Azure Monitor, Log Analytics and Application Insights at scale.
- Demonstrable track record of owning SLOs/SLIs and delivering measurable reliability improvements in production.
- Strong knowledge of enterprise networking in Azure: Hub-and-Spoke/Virtual WAN, ExpressRoute, Azure Firewall, NSGs, Private Endpoints, and DNS Private Zones.
Required/Preferred Certifications:
- AZ-104 | AZ-305 (Preferred) | AZ-400 (Preferred) | CKA | ITIL v4 Foundation
Roles & Responsibilities
Reliability & Availability Engineering
- Define, own, and enforce enterprise-wide SLOs, SLIs, and Error Budgets across all Tier-0 and Tier-1 Azure-hosted services; report SLA compliance to executive stakeholders monthly.
- Lead architectural reviews for new services and ensure relia bility non-functionals (availability targets, RTO/RPO) are embedded from design through to production.
- Champion and implement chaos engineering practices using Azure Chaos Studio and custom fault injection frameworks to proactively surface reliability risks.
- Drive Disaster Recovery (DR) design and conduct quarterly DR drills across Azure paired regions. Incident Management & On-Call
- Serve as Incident Commander for P1/P2 major incidents, own end-to-end incident lifecycle from detection through resolution and Post-Incident Review (PIR).
- Participate in a structured On-Call rotation with follow-the-sun global coverage; maintain response SLAs of <5 minutes for Tier-0 services.
- Drive blameless post-mortem culture and ensure all action items from PIRs are tracked and delivered within agreed SLA.
Observability & Platform Engineering
- Design and operate the enterprise observability stack: Azure Monitor, Log Analytics Workspaces, Application Insights, and Azure Managed Grafana; ensure full MELT (Metrics, Events, Logs, Traces) coverage.
- Build and maintain alerting frameworks using Azure Monitor Alert Rules and Azure Action Groups integrated with PagerDuty and ServiceNow.
- Develop and operate platform automation, runbooks, and self-healing capabilities using Azure Automation, Logic Apps, and Python/PowerShell scripting.
CI/CD & Infrastructure Reliability
- Collaborate with DevOps and development teams to embed reliability gates into Azure DevOps pipelines ; automated performance testing, synthetic monitoring, and progressive deployment (canary/blue-green) strategies.
- Manage reliability of AKS clusters across multiple Azure regions, own node pool scaling, upgrade strategy and cluster hardening in alignment with CIS Benchmarks.
- Contribute to infrastructure-as-code reliability reviews using Terraform/Bicep to enforce standards across Azure Landing Zones.