Senior Site Reliability Engineer

JPMC Candidate Experience page
Ciudad Autonoma Buenos Aires, AR

We are seeking a dedicated Site Reliability Engineer (SRE) III to join our high-performing Production Management team. This role is ideal for individuals who are naturally curious, motivated to learn, and energized by the opportunity to make meaningful improvements. You will play a key role in enhancing the reliability and efficiency of the platforms that support our Sales and Research professionals.

While some responsibilities may involve routine activities, you will have the unique opportunity to identify and address these challenges, transforming them into streamlined processes that improve the experience for both clients and colleagues. The satisfaction of this role comes from making a tangible difference—eliminating obstacles and enabling smoother operations.

Key Responsibilities
  • Production Support: Provide day-to-day support for application processes and workflows, ensuring stability, availability, and timely resolution of issues. Serve as the primary contact for all production support matters.
  • Process Improvement: Identify and prioritize repetitive or high-impact tasks, and implement solutions to automate or eliminate them, driving greater efficiency and consistency.
  • Monitoring and Insights: Design and implement comprehensive monitoring and alerting systems, ensuring clear visibility into platform health and performance.
  • Incident Response: Develop and maintain automated solutions for incident detection and resolution, reducing downtime and improving response times.
  • Team Development: Mentor and support colleagues in adopting best practices, fostering a culture of continuous improvement and shared responsibility for reliability.
  • Operational Readiness: Collaborate with teams to enhance supportability through robust processes and effective configuration management.
  • Tool Integration: Ensure applications are integrated with standard monitoring and alerting tools, providing reliable coverage and actionable insights.
  • Incident Management: Lead effective incident management practices, including rapid detection, clear communication, thorough analysis, and implementation of preventative measures.
  • Resilience and Recovery: Contribute to the ongoing resilience of our platforms through rigorous analysis, testing, and validation of recovery procedures.
Qualifications and Skills
  • Minimum 3 years of experience supporting and maintaining technology services in production environments.
  • Familiarity with large-scale applications and infrastructure, both on-premises and in the cloud.
  • Strong analytical and problem-solving abilities, with experience resolving complex technical issues.
  • Proficiency in at least one programming language (such as Java or Python) for automation and process improvement.
  • Experience with monitoring and alerting tools, and a solid understanding of metrics and system health indicators.
  • Knowledge of incident, problem, change, and service request processes.
  • Understanding of reliability engineering principles, including automation and incident management.
  • Experience with networking and modern delivery tools (such as Jenkins, GitLab, or Terraform) is an advantage.
 
// // //