We’re a small engineering team building and operating production services that must stay up and available across multiple regions, even when things go wrong. We’re looking for a pragmatic Site Reliability Engineer who can design, build, and operate resilient systems without unnecessary complexity.
This role is hands-on and collaborative: you’ll work closely with application engineers to make reliability a shared responsibility, not a gate.
Multi-Region Reliability & Availability (Primary Focus)
- Design and operate multi-region architectures (active/active or active/passive)
- Implement and improve automated failover and traffic routing
- Identify and eliminate single points of failure
- Ensure regional isolation and graceful degradation when dependencies fail
High Availability & Disaster Recovery
- Define realistic availability goals and failure scenarios
- Design and test backup and restore processes
- Own disaster recovery plans and validate them through regular testing
- Help the team understand RTO/RPO trade-offs
Observability & Incident Response
- Build and maintain clear, actionable observability (metrics, logs, traces)
- Create alerts that detect real problems without noise
- Participate in on-call and help improve incident response
- Lead or contribute to blameless postmortems and follow-up fixes
Automation & Operations
- Reduce manual operational work through automation
- Improve deployment safety (rollbacks, health checks, canaries where appropriate)
- Manage infrastructure using infrastructure as code
- Design systems that recover automatically whenever possible
Performance & Capacity
- Monitor performance and saturation across regions
- Help with capacity planning and load testing
- Balance reliability, performance, and cost