Site Reliability Engineer

Envision Technology Solutions
Atlanta, GA

Principle SRE Engineer (Individual Contributor) – AWS and AI exposure

Atlanta, GA -onsite

Long term contract


Role Summary

We are looking for a Principle SRE Engineer who combines strong operational expertise with hands‑on development skills. This role requires deep experience managing large‑scale distributed systems, improving reliability, eliminating toil through automation, and guiding domain teams in maturing their SRE practices. The ideal candidate is equally comfortable debugging production issues, writing automation or REST APIs, interpreting code repositories, and implementing resiliency patterns such as circuit breakers. This is a pure Individual Contributor role with high technical depth and ownership.


Key Responsibilities

  • Perform SRE operations for distributed systems, ensuring high availability, reliability, and operational excellence.
  • AI in SRE
  • Partner with application/domain teams to strengthen their SRE maturity and operational readiness.
  • Write automation, scripts, and REST APIs to integrate with external systems and eliminate repetitive tasks.
  • Onboard services to Dynatrace/observability platforms; define dashboards, alerts, SLIs, SLOs.
  • Architect and implement resiliency patterns including failover strategies, circuit breakers, graceful degradation.
  • Drive cost optimization (FinOps) initiatives across cloud workloads.
  • Support AWS (or other cloud platforms) operations and engineering needs.
  • Work with ROSA/container platforms for deployment, scaling, and reliability.
  • Recommend improvements in technology, architecture, and domain-specific reliability areas.
  • Manage and support large‑scale systems operating at scale.
  • Reduce toil by identifying repetitive tasks and automating them.
  • Contribute code, read/interpret service repositories, and assist teams with engineering tasks as needed.


Required Skills & Experience

  • Strong background in SRE operations for distributed systems.
  • Proficiency in development/coding (Python, Go, shell scripting, or similar).
  • Ability to read/interpret codebases and build REST APIs.
  • Experience with Dynatrace/observability onboarding and ecosystem.
  • Deep knowledge of resiliency engineering and failover strategies.
  • Strong understanding of FinOps principles and cloud cost optimization.
  • Hands-on experience with AWS or any other cloud provider.
  • Experience with ROSA or Kubernetes-based container platforms.
  • Proven automation skills to eliminate operational toil.
  • Experience managing large-scale systems in production.
  • Capability to suggest architecture and domain improvements.
  • Strong analytical, troubleshooting, and collaboration skills.

// // //