Real-time systems (preferred): WebSocket lifecycle, low-latency audio/streaming constraints
Responsibilities
Deploy and run services reliably on EKS using Helm, environment-driven configurations, health checks, and rollout/rollback strategies
Debug deployment, connectivity, and latency issues using CloudWatch, logs, and metrics (network timeouts, firewall rules, CPU/memory sizing, and infrastructure components)
Build dashboards for application observability
Tune runtime performance
Automate operations using infrastructure as code
Create automated alerts for performance thresholds
Secure runtime configurations, including secrets integration with deployment pipelines, auditability, and encrypted storage
Core Competencies
Systems thinking: Connect application behavior with infrastructure behavior
Incident response: Quickly triage, isolate issues, and identify resolution paths
Reliability: Proactively improve system reliability
Cross-team communication: Coordinate environment changes to minimize impact on teams and uptime
Change safety: Execute gradual changes with proper testing, validation, and rollback readiness
Nice-to-Have
Understanding of Infrastructure as Code (IaC) concepts