Key responsibilities:
Run RMA intake/triage: capture symptoms/configuration, ensure traceability (serial/lot/date-code), prioritize by severity, and meet turnaround SLAs.
Perform debug + failure analysis at rack/server/board/GPU module level: reproduce issues, isolate root cause across power/clock/reset/PCIe, thermals, firmware/BIOS/BMC, and driver interactions; generate clear FA reports.
Lead RCA/CAPA using structured methods (8D/5-Whys): containment, corrective and preventive actions; drive fix validation and closure.
Analyze return and defect trends (top offenders, correlation by lot/vendor/assembly line), publish metrics, and recommend design/process/test improvements.
Coordinate with manufacturing/test, NPI, supplier quality, and reliability teams on yield issues, incoming quality, and systemic escapes.
Qualifications:
BS/MS in EE/CE (preferred) or related; equivalent experience considered.
Hands-on experience debugging server platforms and GPU modules (PCIe, power delivery/VRs, high-speed IO, thermals).
Familiarity with RMA/quality workflows and tools (issue tracking, reporting/dashboards).
Strong documentation and cross-functional communication; able to drive issues to closure under time pressure