DevOps Engineer, AI Infrastructure | NYC | Confidential — Global Alternative Asset Manager
We're partnering with a leading alternative asset management holding company to find a DevOps Engineer who will own the infrastructure layer underneath their AI platform. This firm operates across asset management, reinsurance, alternative credit, and energy — and they are actively shipping AI agents into production across all of it.
The modern AWS foundation is in place and the AI platform is taking shape. What's being built now is the deployment, runtime, and operational backbone that the rest of the firm will build on. This is not a support role. You will own meaningful, hard problems at the center of a fast-moving AI engineering org.
The Role
You'll sit on the AI team at the holdings level, embedded at the center of the firm's AI buildout. You'll partner closely with platform engineers on shared infrastructure decisions, with forward-deployed AI engineers on what colleagues across the firm need to ship, and with existing infrastructure and security teams on how AI workloads fit the firm's broader posture.
This is a hands-on, high-ownership role. You will carry real responsibility from day one.
What You'll Own
Deployment Infrastructure Build and operate deployment pipelines purpose-built for agentic systems — where what's shipping isn't a deterministic service but a system whose behavior depends on prompts, tools, models, and context that all change independently.
Runtime & Orchestration Build runtime infrastructure for agentic workloads on Kubernetes, including orchestration of long-running multi-step jobs, autoscaling for bursty agent traffic, and the lifecycle management these workloads demand.
Observability Make agent behavior observable end-to-end. When an agent takes ten steps to accomplish something, you can see every step, every tool call, every input and output — and trace failures to root cause in minutes, not hours.
Security Posture Own the security posture of agentic systems in production: the secrets and permissioning model that scopes what agents can do, defenses against prompt injection and data exfiltration, tool-use sandboxing, and clear audit trails for every action an agent takes.
Incident Response & On-Call Carry your share of incident response for AI workloads in production and write the runbooks that let the rest of the team respond as confidently as you do.
Platform Primitives Spot where infrastructure should be productized into shared tooling, and partner with platform engineers to build it once, well.
What We're Looking For
Nice to Have