Machine Learning Engineer (Inference)
San Francisco, On-Site
$200,000-$300,000 + equity
Why this role
Early-stage infra company building a next-gen AI cloud (neocloud) — rethinking how models run across heterogeneous hardware.
You’ll own the layer that actually executes models in production.
🧠What you’ll do
- Build end-to-end inference systems (request → runtime → response)
- Optimise for latency, throughput, and concurrency under real load
- Design batching, scheduling, and queuing systems
- Manage KV cache + memory at scale
- Debug performance across model → runtime → hardware
The fun technical bits
- Deep dives into LLM inference (prefill, decode, attention)
- Solving tail latency + throughput trade-offs
- Working across systems, ML, and hardware layers
- Optimising across GPUs + next-gen accelerators
- Hands-on with vLLM, TensorRT-LLM, or custom runtimes
🎯 What they want
- Experience with ML inference / model serving systems
- Strong systems or backend engineering fundamentals
- Comfortable with performance, memory, and scaling challenges
- Python + C++