Required Skills & Experience
Nice to Have Skills & Experience
Job Description
We are seeking a Cloud MLOps Engineer to build and operate the cloud infrastructure that powers machine learning for a robotics platform. This role sits at the intersection of ML research, production systems, and end-user applications, with a strong focus on robot telemetry data, model lifecycle management, and production deployment.
You will enable researchers and applied ML engineers to reliably train, evaluate, and deploy models at scale, while ensuring telemetry-driven insights flow from robots in the real world back into continuous learning systems.
What You’ll Do
Design, deploy, and maintain cloud-native MLOps platforms supporting large-scale ML training, evaluation, and inference workloads
Operate Kubernetes-based infrastructure (self-managed or managed services such as GKE, EKS, or AKS) for ML workloads and data applications
Build and maintain end-to-end ML pipelines that bridge research workflows with production systems
Support robot telemetry ingestion, processing, and analytics, enabling model feedback loops from deployed humanoid robots
Integrate and operate ML tooling such as MLflow, Weights & Biases, Slurm, or similar systems for experiment tracking, scheduling, and reproducibility
Enable model deployment to production, including CI/CD for models, versioning, monitoring, and rollback strategies
Partner closely with ML researchers, perception, controls, and applications teams to productionize models safely and efficiently
Implement observability across ML systems, including model performance, data drift, and system health
Improve reliability, scalability, and security of cloud ML infrastructure supporting real‑world robotic systems