SRE Application Support

Yeshnex IT Solutions
San Francisco, CA

*Not looking for (DevOps or Cloud engineer)*


Job Title: SRE Application support (Retail industry experience only)

Location: San Francisco, CA (Onsite)

Duration: 12+months Contract


Job Description:

  • Provide production support for Retail Applications and Microservices built using Spring Boot architecture.
  • Ensure high availability, reliability, and performance of business-critical retail systems and services.
  • Apply Site Reliability Engineering (SRE) principles to improve system stability, scalability, and operational efficiency.
  • Define, implement, and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
  • Perform real-time monitoring, troubleshooting, and incident resolution for microservices and retail applications.
  • Use Splunk for log analysis, alerting, and operational intelligence to diagnose and resolve production issues.
  • Use Dynatrace for end-to-end application performance monitoring, distributed tracing, and root cause analysis.
  • Investigate performance bottlenecks, latency issues, and system anomalies across microservices architecture.
  • Build and maintain dashboards, alerts, and monitoring strategies for proactive issue detection.
  • Participate in incident management processes, including on-call rotations, major incident response, and post-incident reviews.
  • Conduct root cause analysis (RCA) and implement preventive measures to reduce recurrence of incidents.
  • Work closely with development, DevOps, and infrastructure teams to improve system reliability and observability.
  • Provide technical troubleshooting support to retail store associates and operations teams through calls or remote sessions.
  • Ensure effective communication and coordination during incidents involving multiple teams and stakeholders.
  • Drive automation and operational improvements to reduce manual intervention and improve system resilience.
  • Support CI/CD pipelines and deployment monitoring for microservices applications.
  • Analyze system logs, metrics, traces, and events to identify trends and proactively prevent outages.
  • Document runbooks, troubleshooting guides, and operational procedures for retail application support.
  • Demonstrate proactive learning, continuous improvement, and knowledge sharing within the SRE team.
  • Mentor team members and contribute to best practices for monitoring, observability, and reliability engineering.
  • Collaborate with engineering teams to improve system design for reliability, fault tolerance, and scalability.
  • Ensure compliance with operational standards, security guidelines, and change management processes.

// // //