DevOps Engineer, AI Infrastructure

Oscar Faye
Manhattan, NY

DevOps Engineer, AI Infrastructure | NYC | Confidential — Global Alternative Asset Manager

We're partnering with a leading alternative asset management holding company to find a DevOps Engineer who will own the infrastructure layer underneath their AI platform. This firm operates across asset management, reinsurance, alternative credit, and energy — and they are actively shipping AI agents into production across all of it.

The modern AWS foundation is in place and the AI platform is taking shape. What's being built now is the deployment, runtime, and operational backbone that the rest of the firm will build on. This is not a support role. You will own meaningful, hard problems at the center of a fast-moving AI engineering org.

The Role

You'll sit on the AI team at the holdings level, embedded at the center of the firm's AI buildout. You'll partner closely with platform engineers on shared infrastructure decisions, with forward-deployed AI engineers on what colleagues across the firm need to ship, and with existing infrastructure and security teams on how AI workloads fit the firm's broader posture.

This is a hands-on, high-ownership role. You will carry real responsibility from day one.

What You'll Own

Deployment Infrastructure Build and operate deployment pipelines purpose-built for agentic systems — where what's shipping isn't a deterministic service but a system whose behavior depends on prompts, tools, models, and context that all change independently.

Runtime & Orchestration Build runtime infrastructure for agentic workloads on Kubernetes, including orchestration of long-running multi-step jobs, autoscaling for bursty agent traffic, and the lifecycle management these workloads demand.

Observability Make agent behavior observable end-to-end. When an agent takes ten steps to accomplish something, you can see every step, every tool call, every input and output — and trace failures to root cause in minutes, not hours.

Security Posture Own the security posture of agentic systems in production: the secrets and permissioning model that scopes what agents can do, defenses against prompt injection and data exfiltration, tool-use sandboxing, and clear audit trails for every action an agent takes.

Incident Response & On-Call Carry your share of incident response for AI workloads in production and write the runbooks that let the rest of the team respond as confidently as you do.

Platform Primitives Spot where infrastructure should be productized into shared tooling, and partner with platform engineers to build it once, well.

What We're Looking For

  • 8-12 years running production infrastructure that real users depend on, with hands-on experience owning deploys, on-call, and incident response
  • Deep AWS experience and infrastructure-as-code discipline, with the judgment to know when to use a managed service versus building your own
  • Strong Kubernetes fluency — operating clusters in production, debugging workload issues, reasoning about networking, scheduling, and security primitives
  • First-principles debugging instincts: when something fails intermittently, you can trace it through the load balancer, TLS handshake, DNS resolver, OIDC flow, and upstream API without guessing
  • Security mindset built in from the start — you think about blast radius and least privilege before you ship, not after
  • Strong scripting and automation skills, with enough range to read and contribute to application code when the problem calls for it
  • Clear communicator who can translate what AI engineers need into infrastructure that actually serves them
  • Must have existing work authorization **
  • Requires on-site 5 days a week (Manhattan)

Nice to Have

  • Hands-on experience building agents, including tool-use orchestration and multi-step workflows
  • Familiarity with workflow orchestration for durable, long-running execution (Temporal or similar)
  • Experience deploying and operating LLM applications in production, including evaluation harnesses, guardrails, and rollback strategies
  • Experience building developer platforms or internal tooling that engineers actually enjoy using
  • Familiarity with Snowflake

// // //