Site Reliability Engineer / OpenShift & Kubernetes Engineer (Skan AI Exposure Mandatory)

Kaav Inc
Concord, CA

Job Title: Site Reliability Engineer / OpenShift & Kubernetes Engineer (Skan AI Exposure Mandatory)

Location: Concord, California (Hybrid - 3 Days Onsite & 2 Days Remote)

Experience: 10–15+ Years

Employment Type: W2 Only


Note: Only locals who are currently living in California State, are eligible


To manage the technical deployment and upkeep of our on-prem infrastructure, including Virtual Assistant, Skan Portal and Gateway components. In this critical role, you’ll lead customer-facing implementations, validate infrastructure environments, perform connectivity assurance, and support continuous reliability and scalability.


Responsibilities:

• Deployment Planning & Coordination: Lead end-to-end customer Gateway deployments - from strategic planning and infrastructure readiness validation to

installation and activation.

• Infrastructure Validation: Confirm customer environments meet requirements for networking (TCP/IP, firewall, DNS), storage and computing before deployment.

• Connectivity Testing & Assurance: Conduct thorough end-to-end connectivity testing across Skan stack components (VA, Portal, Gateway).

• Technical Transition Management: Facilitate smooth transitions from development to Production, liaising with stakeholders to set and meet enterprise technical standards.

• Documentation & Standards: Produce and maintain comprehensive documentation - architecture diagrams, network configurations, troubleshooting guides, and deployment workflows.

• System Monitoring & Health Maintenance: Set up and utilize monitoring and logging tools to proactively identify and resolve performance or stability issues.


Qualifications:

• 10-15 years in Systems or Infrastructure Engineering with a strong track record in largescale deployment projects.

• 3+ years of hands-on experience with OpenShift/Kubernetes, preferably in on-prem environments. Experience with containerization (Docker, Kubernetes).

• Demonstrated understanding of OpenShift Container Platform (OCP) architecture in large Enterprise Environment.

• Knowledgeable with various assets in OCP environment such as

• Pods, Deployments, StatefulSets, Services, Routes, Namespaces / Projects

• ConfigMaps, Secrets

• Persistent Volumes / Persistent Volume Claims

• Resource requests/limits

• Node, cluster, and container troubleshooting

• Strong experience in doing RCA for

• Pod crashes

• Restart loops

• Containerized Databases instances such as Postgres, Star Rocks

• Networking & Security: Solid knowledge of networking (TCP/IP, DNS, DHCP), firewall configurations, and security best practices.

• Monitoring Tools: Familiarity with system monitoring, logging, and incident remediation frameworks.

• Core Competencies: Strong project coordination, proactive problem-solving, and cross-functional communication skills essential.

• 3+ years of experience working with PostgreSQL

• Strong scripting/automation abilities in Python, or similar.

Preferred:

• Familiarity with Flink, and Star Rocks for data layer integration.

• Knowledge of Citrix-based architecture.

Tech Stack Alignment:

• OpenShift, Kubernetes, Docker, GCP, Grafana, Splunk logs, Redis, PostgreSQL, Star

Rocks, S3, Auth0, Superset.


Day in Life of the Role – (Key Responsibilities)

• Must have demonstrated prior experience to be able to work independently.

• Monitor application behavior consistently across all 3 environments

• Troubleshoot incidents for RCA end-to-end with minimal supervision

• Work directly with infrastructure, platform, DBA and application teams as needed

• Play a key role in Capacity usage monitoring, management and expansion when

required.

• Drive reliability improvements, alert tuning, and operational maturity


� IMPORTANT NOTE

Candidates without Skan AI experience will NOT be considered

// // //