NPU Operator Engineer

Black Sesame Technologies Inc
San Jose, CA

We are looking for a Junior NPU Kernel/Operator Engineer to develop and optimize deep learning operators for a custom AI accelerator / NPU. The role focuses on kernel/operator implementation, performance tuning, and correctness validation across a broad range of neural network workloads.

This is a good fit for candidates with strong C/C++ and Python skills who are interested in hardware-aware software optimization. Prior NPU experience is helpful but not required.

Responsibilities

  • Implement and optimize NPU operators such as normalization, reduction, transpose, reshape, gather/scatter, quant/dequant, and fused elementwise kernels.
  • Tune kernels for memory bandwidth, SRAM usage, data reuse, DMA latency, bank conflicts, and compute utilization.
  • Validate operator correctness against PyTorch, NumPy, or framework reference results.
  • Benchmark performance on simulator or silicon.
  • Debug correctness, precision, memory layout, and performance issues.
  • Work with compiler, runtime, hardware, and model teams.
  • Document operator behavior, tensor layout, tiling strategy, and performance results.

Requirements

  • BS/MS in CS, EE, Computer Engineering, or related field.
  • Strong C/C++ and Python programming skills.
  • Basic understanding of tensor computation and neural network operators.
  • Familiarity with basic computer architecture concepts such as memory hierarchy, bandwidth, latency, cache/SRAM, and parallelism.
  • Good debugging and problem-solving skills.

Preferred

  • Experience with any of the following:
  • CUDA, Triton, OpenCL, TVM, MLIR, Halide
  • SIMD, DSP, embedded C/C++, GPU, NPU, FPGA, or HPC programming
  • compiler/runtime development
  • Understanding of tiling, vectorization, memory access optimization, or mixed precision.
  • Experience with FP32, FP16, BF16, INT8, or other numerical formats.

// // //