NPU Operator Engineer

Black Sesame Technologies Inc

San Jose, CA

We are looking for a Junior NPU Kernel/Operator Engineer to develop and optimize deep learning operators for a custom AI accelerator / NPU. The role focuses on kernel/operator implementation, performance tuning, and correctness validation across a broad range of neural network workloads.

This is a good fit for candidates with strong C/C++ and Python skills who are interested in hardware-aware software optimization. Prior NPU experience is helpful but not required.

Responsibilities

Implement and optimize NPU operators such as normalization, reduction, transpose, reshape, gather/scatter, quant/dequant, and fused elementwise kernels.
Tune kernels for memory bandwidth, SRAM usage, data reuse, DMA latency, bank conflicts, and compute utilization.
Validate operator correctness against PyTorch, NumPy, or framework reference results.
Benchmark performance on simulator or silicon.
Debug correctness, precision, memory layout, and performance issues.
Work with compiler, runtime, hardware, and model teams.
Document operator behavior, tensor layout, tiling strategy, and performance results.

Requirements

BS/MS in CS, EE, Computer Engineering, or related field.
Strong C/C++ and Python programming skills.
Basic understanding of tensor computation and neural network operators.
Familiarity with basic computer architecture concepts such as memory hierarchy, bandwidth, latency, cache/SRAM, and parallelism.
Good debugging and problem-solving skills.

Preferred

Experience with any of the following:
CUDA, Triton, OpenCL, TVM, MLIR, Halide
SIMD, DSP, embedded C/C++, GPU, NPU, FPGA, or HPC programming
compiler/runtime development
Understanding of tiling, vectorization, memory access optimization, or mixed precision.
Experience with FP32, FP16, BF16, INT8, or other numerical formats.