Senior Kubernetes Engineer

GTN Technical Staffing
Dallas, TX

Senior Kubernetes Engineer (GPU / AI Platforms)

Location: Dallas, TX (Hybrid)

Type: Direct Hire

• Competitive base salary + performance bonus

• 100% company-paid benefits


**This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship. We are unable to sponsor or take over sponsorship of employment visas at this time**


Overview

We are seeking a Senior Kubernetes Engineer to design, implement, and optimize GPU-accelerated container platforms at scale within a high-performance computing environment.

This role focuses on enabling AI/ML, HPC, and large-scale training workloads across hybrid and on-prem infrastructure. The position requires deep expertise across both Kubernetes and NVIDIA ecosystems, with a strong emphasis on GPU scheduling, performance optimization, and platform automation.

The ideal candidate brings hands-on experience building production-grade Kubernetes platforms for GPU-intensive workloads, along with strong development skills and a passion for scalable, high-performance infrastructure.

Key Responsibilities

Kubernetes Platform Engineering

• Architect and operate Kubernetes clusters optimized for GPU workloads

• Leverage NVIDIA GPU Operator, Network Operator, and DCGM for cluster performance and observability

• Ensure platform scalability, reliability, and performance for high-throughput workloads

GPU Enablement & Scheduling

• Integrate NVIDIA device plugins, Multi-Instance GPU (MIG), and GPU sharing capabilities into Kubernetes scheduling

• Optimize GPU utilization and workload placement using scheduler extensions such as kube-scheduler plugins, Slurm, and Volcano

• Support GPU-intensive workloads including LLM training, AI/ML pipelines, and scientific computing

Automation & Operator Development

• Develop, deploy, and maintain custom Kubernetes operators and controllers

• Automate infrastructure services and platform operations using Go or Python

• Contribute to Infrastructure-as-Code practices using Terraform, Helm, and Kustomize

Observability & Performance

• Implement monitoring and telemetry solutions using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry

• Drive performance tuning and capacity optimization across GPU-enabled clusters

• Participate in incident response and production readiness reviews

Security & Multi-Tenancy

• Implement secure multi-tenant environments with RBAC and policy enforcement (OPA, Gatekeeper)

• Ensure proper isolation across users, namespaces, and workloads

DevOps & CI/CD

• Maintain and enhance CI/CD pipelines using GitOps tools such as ArgoCD and FluxCD

• Support continuous deployment and lifecycle management of Kubernetes infrastructure

Cross-Functional Collaboration

• Partner with HPC, ML, DevOps, and platform engineering teams to support high-performance workloads

• Collaborate on infrastructure design, optimization, and operational best practices

Required Experience

• Extensive experience operating Kubernetes in production-grade environments

• Deep expertise with NVIDIA and Kubernetes ecosystems including GPU Operator, device plugins, NVML, MIG, and DCGM

• Strong understanding of Kubernetes internals including CRDs, RBAC, custom controllers, and scheduler extensions

• Proficiency in Go or Python for operator development and automation

• Experience supporting GPU-intensive workloads such as LLM training, AI/ML pipelines, or HPC workloads

• Hands-on experience with Helm, Kustomize, and GitOps workflows

Technical Skills

• Experience with Prometheus, Grafana, DCGM Exporter, and OpenTelemetry for monitoring and observability

• Familiarity with CNI plugins including NVIDIA CNI and Multus

• Experience with Infrastructure-as-Code tools such as Terraform

• Knowledge of CI/CD pipelines and Git-based workflows

Preferred Experience

• Experience with container runtimes such as containerd, CRI-O, and NVIDIA Container Toolkit

• Exposure to Cilium or advanced CNI networking solutions

• Contributions to open-source projects within Kubernetes or NVIDIA ecosystems

• Experience working in HPC or large-scale AI infrastructure environments

// // //