DevOps Engineer

Pentangle Tech Services | P5 Group
Santa Clara, CA

Job Summary

We are seeking a highly capable Senior DevOps Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform. This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments.

This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.

This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams.

Key Responsibilities

Platform Deployment & CI/CD

• Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments.

• Build and maintain deployment workflows that support safe and seamless promotion across environments.

• Improve and maintain Argo-based deployment workflows to enable controlled release progression from test to staging to production.

• Establish baseline deployment mechanisms for the site-builder application and related services.

• Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm-based lifecycle management for complex services and third-party components.

• Migrate existing deployments to Helm charts where appropriate.

Kubernetes & Runtime Platform Engineering

• Support the deployment and ongoing operation of services running in Kubernetes.

• Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.

• Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.

• Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries.

Infrastructure as Code & Cloud Services

• Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS.

• Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB-compatible document databases across all environments.

• Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns.

• Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices.

Data Services, Snapshots & Developer Enablement

• Setup and maintain RDS, MongoDB, Redis/cache services, and related dependencies for all environments.

• Build tooling and operational processes for:

◦ production and staging database snapshots,

◦ restoring snapshots into development environments,

◦ enabling local debugging and development from realistic data states.

• Support creation of local and development environments, including Minikube-based environment-as-code approaches that mirror production behavior as closely as practical.

• Improve platform reproducibility so engineers can quickly stand up close-to-production development environments.

Workflow Orchestration & Temporal Support

• Lead the setup, deployment, and operational support of Temporal for workflow orchestration.

• Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.

• Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.

• Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.

Observability, Reliability & Incident Readiness

• Design and maintain observability across testing, staging, and production using tools such as Prometheus and Grafana.

• Define and implement monitoring for:

◦ service and cluster utilization,

◦ CPU, memory, storage,

◦ IOPS / throughput metrics,

◦ database connections and session counts,

◦ cache hit / miss / coverage metrics,

◦ RDS and MongoDB utilization,

◦ service health and alerting.

• Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment.

• Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation.

Security, Access & Secrets Management

• Maintain secrets management processes across environments.

• Build tooling for short-lived internal token generation and long-lived secret rotation.

• Support secure access from deployed services to active production devices and southbound systems.

• Help establish credential management patterns for southbound integrations and device-facing access.

• Partner with related teams to define safe operational limits and controls for service integrations.

External Integrations & Platform Support

• Support integration patterns with Nautobot and help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms.

• Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection.

• Support staging and testing by enabling virtual device environments where needed.

• Contribute to end-to-end acceptance testing and production readiness activities.

Operating Model & Cross-Functional Execution

• Help define an effective operating model between Development and DevOps, whether via RACI, embedded Agile delivery, or a hybrid support model.

• Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities.

• Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations.

Required Qualifications

• Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.

• 7+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering roles.

• Strong hands-on experience with Kubernetes in production environments.

• Strong experience building and maintaining CI/CD pipelines for multi-environment software delivery.

• Strong experience with ArgoCD, GitOps workflows, or equivalent deployment tooling.

• Strong experience with Helm and Kubernetes package/deployment lifecycle management.

• Experience with AWS managed services, especially RDS/PostgreSQL, document databases, and related infrastructure.

• Strong experience with Infrastructure as Code, such as Terraform and/or similar declarative tooling.

• Experience with Prometheus, Grafana, and modern observability practices.

• Experience with Redis/cache services, secrets management, and operational debugging.

• Strong Linux, networking, and distributed systems troubleshooting skills.

• Strong scripting and automation skills in one or more languages such as Python, Bash, or Go.

• Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving.

Preferred Qualifications

• Experience with Temporal deployment and production operations.

• Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools.

• Experience with MongoDB / DocumentDB operations and restore workflows.

• Experience integrating with Nautobot, NetBox, or similar infrastructure source-of-truth platforms.

• Experience operating in shared-cluster environments with multi-team tenancy and constrained access models.

• Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints.

• Familiarity with network automation or infrastructure orchestration platforms is a plus.

What Success Looks Like

• CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments.

• Kubernetes deployments are standardized, maintainable, and production ready.

• Managed infrastructure is defined as code rather than through manual setup.

• Temporal, databases, cache layers, and observability tooling are stable and supportable.

• Development teams can reproduce realistic environments locally for faster debugging and delivery.

• Secrets, access patterns, and operational tooling are mature enough to support production-scale operations.

• The DevOps operating model is clearly defined and enables faster deployments with less operational risk.

// // //