ML Infrastructure & Security Engineer

Omni Instrument
San Francisco, CA

Company Overview

 

Omni Instrument is an early-stage startup for the manufacturing industry. We build autonomous manufacturing tools using Edge devices, custom hardware for perception and controls. We are looking to create our custom data management and ML infrastructure for scaling our Computer Vision models.

 

What You’ll Do

 

As a ML Infrastructure & Security Engineer, you will own and operate multi-GPU cloud infrastructure and Kubernetes clusters for large-scale model training and fleet deployment. You will work closely with Software, Computer Vision, and SLAM Engineers for scalable data management and model deployment for our customers. We are looking for great collaborators with good engineering discipline who are passionate about building secure, reliable systems and scaling deep learning models for production.

 

This is an on-site role in San Francisco, CA and you will report directly to the CEO.

 

Key Responsibilities

 

●     Scale and manage AWS infrastructure for fleet deployment.

●     Optimize GPU utilization and scheduling within Kubernetes clusters for scalable training and inference workloads.

●     Own Kubernetes clusters and data annotation software for secure datasets.

●     Design and implement secure cloud architecture including IAM policies, VPC isolation, secrets management (KMS/Vault), and compliance with GovCloud/CMMC Level 2.

●     Own observability across infrastructure and ML systems (metrics, logging, alerting) to ensure reliability and performance.

 

You have

●     A degree in Computer Science, Cybersecurity or related fields.

●     Well versed in Linux, virtualization, containers, Cluster architecture, and Secrets management.

●     Experience with AWS, IAM, VPC, EC2, S3, and/or GovCloud.

●     Coding expertise in Python, or Go and able to scale Kubernetes clusters either using self managed clusters or EKS.

●     Implemented dataset versioning, annotation tooling and access control.

●     Knowledge in TCP/IP, routing, firewalls, VPNs and other networking technologies.

●     Experience with monitoring, logging, patching, and incident response.

●     Design and optimize model serving pipelines (e.g., Triton Inference Server, KServe) for real-time and batch inference workloads.

 

We Prefer

●     Experience managing Linux servers and secure ML training environments.

●     Experience with Terraform or CloudFormation.

●     Built secure data pipelines from edge devices to cloud storage, including ingestion, dataset versioning, annotation workflows, and access control.

// // //