As a Site Reliability Engineering at JPMorgan Chase within the Enterprise technology, liquidity risk team, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.
Job responsibilities
• Lead SRE adoption across teams, balancing feature delivery with efficiency and system stability
• Partner with peers and senior stakeholders to align on reliability goals and make trade-offs that improve outcomes
• Set and track reliability and stability metrics, and use data to drive measurable improvements
• Build a continuous-improvement culture by collecting real-time feedback and turning it into customer-impacting changes
• Coordinate with other teams to share solutions and prevent duplicated work
• Run blameless, data-driven post-mortems and regular debriefs to turn incidents (and wins) into learning
• Coach and develop entry- to mid-level engineers through hands-on guidance and feedback
Required qualifications, capabilities, and skills
• Formal training or certification on software engineering concepts and 5+ years applied experience
• Advanced SRE knowledge and a proven track record implementing SRE practices across application and platform teams (including avoiding common pitfalls)
• Experience leading technologists to resolve complex, firmwide technology issues
• Ability to influence team culture by championing innovation and change
• Experience hiring, developing, and recognizing talent
• Proficiency in at least one programming language, with preference for JavaScript, Go, or Python
• Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab, Terraform)
• Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS)
• Troubleshooting experience with common networking technologies and issues
• Strong fundamentals across modern architectures and observability, including GraphQL (schema design, federation/supergraph), event-driven systems (Kafka concepts like partitions/consumer groups, DLQs, replay), microservices patterns (API gateways/routers, CQRS/event sourcing), and end-to-end telemetry using OpenTelemetry (metrics/logs/traces)
Preferred qualifications, capabilities, and skills
• Strong hands-on ability to code and troubleshoot, with solid data fluency