Job Summary
We are seeking a highly capable Senior DevOps Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform. This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments.
This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.
This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams.
Key Responsibilities
Platform Deployment & CI/CD
• Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments.
• Build and maintain deployment workflows that support safe and seamless promotion across environments.
• Improve and maintain Argo-based deployment workflows to enable controlled release progression from test to staging to production.
• Establish baseline deployment mechanisms for the site-builder application and related services.
• Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm-based lifecycle management for complex services and third-party components.
• Migrate existing deployments to Helm charts where appropriate.
Kubernetes & Runtime Platform Engineering
• Support the deployment and ongoing operation of services running in Kubernetes.
• Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.
• Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.
• Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries.
Infrastructure as Code & Cloud Services
• Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS.
• Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB-compatible document databases across all environments.
• Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns.
• Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices.
Data Services, Snapshots & Developer Enablement
• Setup and maintain RDS, MongoDB, Redis/cache services, and related dependencies for all environments.
• Build tooling and operational processes for:
◦ production and staging database snapshots,
◦ restoring snapshots into development environments,
◦ enabling local debugging and development from realistic data states.
• Support creation of local and development environments, including Minikube-based environment-as-code approaches that mirror production behavior as closely as practical.
• Improve platform reproducibility so engineers can quickly stand up close-to-production development environments.
Workflow Orchestration & Temporal Support
• Lead the setup, deployment, and operational support of Temporal for workflow orchestration.
• Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.
• Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.
• Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.
Observability, Reliability & Incident Readiness
• Design and maintain observability across testing, staging, and production using tools such as Prometheus and Grafana.
• Define and implement monitoring for:
◦ service and cluster utilization,
◦ CPU, memory, storage,
◦ IOPS / throughput metrics,
◦ database connections and session counts,
◦ cache hit / miss / coverage metrics,
◦ RDS and MongoDB utilization,
◦ service health and alerting.
• Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment.
• Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation.
Security, Access & Secrets Management
• Maintain secrets management processes across environments.
• Build tooling for short-lived internal token generation and long-lived secret rotation.
• Support secure access from deployed services to active production devices and southbound systems.
• Help establish credential management patterns for southbound integrations and device-facing access.
• Partner with related teams to define safe operational limits and controls for service integrations.
External Integrations & Platform Support
• Support integration patterns with Nautobot and help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms.
• Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection.
• Support staging and testing by enabling virtual device environments where needed.
• Contribute to end-to-end acceptance testing and production readiness activities.
Operating Model & Cross-Functional Execution
• Help define an effective operating model between Development and DevOps, whether via RACI, embedded Agile delivery, or a hybrid support model.
• Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities.
• Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations.
Required Qualifications
• Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
• 7+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering roles.
• Strong hands-on experience with Kubernetes in production environments.
• Strong experience building and maintaining CI/CD pipelines for multi-environment software delivery.
• Strong experience with ArgoCD, GitOps workflows, or equivalent deployment tooling.
• Strong experience with Helm and Kubernetes package/deployment lifecycle management.
• Experience with AWS managed services, especially RDS/PostgreSQL, document databases, and related infrastructure.
• Strong experience with Infrastructure as Code, such as Terraform and/or similar declarative tooling.
• Experience with Prometheus, Grafana, and modern observability practices.
• Experience with Redis/cache services, secrets management, and operational debugging.
• Strong Linux, networking, and distributed systems troubleshooting skills.
• Strong scripting and automation skills in one or more languages such as Python, Bash, or Go.
• Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving.
Preferred Qualifications
• Experience with Temporal deployment and production operations.
• Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools.
• Experience with MongoDB / DocumentDB operations and restore workflows.
• Experience integrating with Nautobot, NetBox, or similar infrastructure source-of-truth platforms.
• Experience operating in shared-cluster environments with multi-team tenancy and constrained access models.
• Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints.
• Familiarity with network automation or infrastructure orchestration platforms is a plus.
What Success Looks Like
• CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments.
• Kubernetes deployments are standardized, maintainable, and production ready.
• Managed infrastructure is defined as code rather than through manual setup.
• Temporal, databases, cache layers, and observability tooling are stable and supportable.
• Development teams can reproduce realistic environments locally for faster debugging and delivery.
• Secrets, access patterns, and operational tooling are mature enough to support production-scale operations.
• The DevOps operating model is clearly defined and enables faster deployments with less operational risk.