Site Reliability Engineer

Verisk | Verisk
Hyderabad, IN

We’re a small engineering team building and operating production services that must stay up and available across multiple regions, even when things go wrong. We’re looking for a pragmatic Site Reliability Engineer who can design, build, and operate resilient systems without unnecessary complexity.

This role is hands-on and collaborative: you’ll work closely with application engineers to make reliability a shared responsibility, not a gate.

Multi-Region Reliability & Availability (Primary Focus)

  • Design and operate multi-region architectures (active/active or active/passive)
  • Implement and improve automated failover and traffic routing
  • Identify and eliminate single points of failure
  • Ensure regional isolation and graceful degradation when dependencies fail

High Availability & Disaster Recovery

  • Define realistic availability goals and failure scenarios
  • Design and test backup and restore processes
  • Own disaster recovery plans and validate them through regular testing
  • Help the team understand RTO/RPO trade-offs

Observability & Incident Response

  • Build and maintain clear, actionable observability (metrics, logs, traces)
  • Create alerts that detect real problems without noise
  • Participate in on-call and help improve incident response
  • Lead or contribute to blameless postmortems and follow-up fixes

Automation & Operations

  • Reduce manual operational work through automation
  • Improve deployment safety (rollbacks, health checks, canaries where appropriate)
  • Manage infrastructure using infrastructure as code
  • Design systems that recover automatically whenever possible

Performance & Capacity

  • Monitor performance and saturation across regions
  • Help with capacity planning and load testing
  • Balance reliability, performance, and cost
// // //