DevOps Engineer, AI Infrastructure

Oscar Faye

Manhattan, NY

DevOps Engineer, AI Infrastructure | NYC | Confidential — Global Alternative Asset Manager

We're partnering with a leading alternative asset management holding company to find a DevOps Engineer who will own the infrastructure layer underneath their AI platform. This firm operates across asset management, reinsurance, alternative credit, and energy — and they are actively shipping AI agents into production across all of it.

The modern AWS foundation is in place and the AI platform is taking shape. What's being built now is the deployment, runtime, and operational backbone that the rest of the firm will build on. This is not a support role. You will own meaningful, hard problems at the center of a fast-moving AI engineering org.

The Role

You'll sit on the AI team at the holdings level, embedded at the center of the firm's AI buildout. You'll partner closely with platform engineers on shared infrastructure decisions, with forward-deployed AI engineers on what colleagues across the firm need to ship, and with existing infrastructure and security teams on how AI workloads fit the firm's broader posture.

This is a hands-on, high-ownership role. You will carry real responsibility from day one.

What You'll Own

Deployment Infrastructure Build and operate deployment pipelines purpose-built for agentic systems — where what's shipping isn't a deterministic service but a system whose behavior depends on prompts, tools, models, and context that all change independently.

Runtime & Orchestration Build runtime infrastructure for agentic workloads on Kubernetes, including orchestration of long-running multi-step jobs, autoscaling for bursty agent traffic, and the lifecycle management these workloads demand.

Observability Make agent behavior observable end-to-end. When an agent takes ten steps to accomplish something, you can see every step, every tool call, every input and output — and trace failures to root cause in minutes, not hours.

Security Posture Own the security posture of agentic systems in production: the secrets and permissioning model that scopes what agents can do, defenses against prompt injection and data exfiltration, tool-use sandboxing, and clear audit trails for every action an agent takes.

Incident Response & On-Call Carry your share of incident response for AI workloads in production and write the runbooks that let the rest of the team respond as confidently as you do.

Platform Primitives Spot where infrastructure should be productized into shared tooling, and partner with platform engineers to build it once, well.

What We're Looking For

8-12 years running production infrastructure that real users depend on, with hands-on experience owning deploys, on-call, and incident response
Deep AWS experience and infrastructure-as-code discipline, with the judgment to know when to use a managed service versus building your own
Strong Kubernetes fluency — operating clusters in production, debugging workload issues, reasoning about networking, scheduling, and security primitives
First-principles debugging instincts: when something fails intermittently, you can trace it through the load balancer, TLS handshake, DNS resolver, OIDC flow, and upstream API without guessing
Security mindset built in from the start — you think about blast radius and least privilege before you ship, not after
Strong scripting and automation skills, with enough range to read and contribute to application code when the problem calls for it
Clear communicator who can translate what AI engineers need into infrastructure that actually serves them
Must have existing work authorization **
Requires on-site 5 days a week (Manhattan)

Nice to Have

Hands-on experience building agents, including tool-use orchestration and multi-step workflows
Familiarity with workflow orchestration for durable, long-running execution (Temporal or similar)
Experience deploying and operating LLM applications in production, including evaluation harnesses, guardrails, and rollback strategies
Experience building developer platforms or internal tooling that engineers actually enjoy using
Familiarity with Snowflake

DevOps Engineer, AI Infrastructure

Job Information

Related jobs

Trending Job Titles

Trending Locations

Trending Companies

Trending Categories