Job Title: Site Reliability Engineer / OpenShift & Kubernetes Engineer (Skan AI Exposure Mandatory)
Location: Concord, California (Hybrid - 3 Days Onsite & 2 Days Remote)
Experience: 10–15+ Years
Employment Type: W2 Only
Note: Only locals who are currently living in California State, are eligible
To manage the technical deployment and upkeep of our on-prem infrastructure, including Virtual Assistant, Skan Portal and Gateway components. In this critical role, you’ll lead customer-facing implementations, validate infrastructure environments, perform connectivity assurance, and support continuous reliability and scalability.
Responsibilities:
• Deployment Planning & Coordination: Lead end-to-end customer Gateway deployments - from strategic planning and infrastructure readiness validation to
installation and activation.
• Infrastructure Validation: Confirm customer environments meet requirements for networking (TCP/IP, firewall, DNS), storage and computing before deployment.
• Connectivity Testing & Assurance: Conduct thorough end-to-end connectivity testing across Skan stack components (VA, Portal, Gateway).
• Technical Transition Management: Facilitate smooth transitions from development to Production, liaising with stakeholders to set and meet enterprise technical standards.
• Documentation & Standards: Produce and maintain comprehensive documentation - architecture diagrams, network configurations, troubleshooting guides, and deployment workflows.
• System Monitoring & Health Maintenance: Set up and utilize monitoring and logging tools to proactively identify and resolve performance or stability issues.
Qualifications:
• 10-15 years in Systems or Infrastructure Engineering with a strong track record in largescale deployment projects.
• 3+ years of hands-on experience with OpenShift/Kubernetes, preferably in on-prem environments. Experience with containerization (Docker, Kubernetes).
• Demonstrated understanding of OpenShift Container Platform (OCP) architecture in large Enterprise Environment.
• Knowledgeable with various assets in OCP environment such as
• Pods, Deployments, StatefulSets, Services, Routes, Namespaces / Projects
• ConfigMaps, Secrets
• Persistent Volumes / Persistent Volume Claims
• Resource requests/limits
• Node, cluster, and container troubleshooting
• Strong experience in doing RCA for
• Pod crashes
• Restart loops
• Containerized Databases instances such as Postgres, Star Rocks
• Networking & Security: Solid knowledge of networking (TCP/IP, DNS, DHCP), firewall configurations, and security best practices.
• Monitoring Tools: Familiarity with system monitoring, logging, and incident remediation frameworks.
• Core Competencies: Strong project coordination, proactive problem-solving, and cross-functional communication skills essential.
• 3+ years of experience working with PostgreSQL
• Strong scripting/automation abilities in Python, or similar.
Preferred:
• Familiarity with Flink, and Star Rocks for data layer integration.
• Knowledge of Citrix-based architecture.
Tech Stack Alignment:
• OpenShift, Kubernetes, Docker, GCP, Grafana, Splunk logs, Redis, PostgreSQL, Star
Rocks, S3, Auth0, Superset.
Day in Life of the Role – (Key Responsibilities)
• Must have demonstrated prior experience to be able to work independently.
• Monitor application behavior consistently across all 3 environments
• Troubleshoot incidents for RCA end-to-end with minimal supervision
• Work directly with infrastructure, platform, DBA and application teams as needed
• Play a key role in Capacity usage monitoring, management and expansion when
required.
• Drive reliability improvements, alert tuning, and operational maturity
IMPORTANT NOTE
Candidates without Skan AI experience will NOT be considered