Senior Optical Network Engineer

Oracle
Austin, TX

AI2NE strives to be a global leader in the RDMA cluster networking domain and enable seamless, accelerated High-Performance Compute (HPC), Artificial Intelligence and Machine Learning advancements. We envision a future where artificial intelligence and machine learning revolutionize industries, reshape societies, and unlock limitless possibilities. Our vision is to be a pioneering force, driving the development and design of state-of-the-art RDMA clusters tailored specifically for AI, ML, HPC workloads.

We strive to be the go-to experts in RDMA cluster network architecture, leveraging our deep understanding of the unique demands of AI/ML and HPC applications. By staying at the forefront of technological advancements, we aim to redefine the boundaries of what is possible, pushing the envelope of computational capabilities and unlocking unprecedented performance.

This role supports design, deployment, and operations of large-scale global Oracle Cloud Infrastructure (OCI). Primarily focused on the development and support of high-speed fiber optic network fabric links and systems through a combination of a deep level understanding of optical cables of various types (patch cords, shuffle, bulk/trunk etc.) and high speed optical transceivers for interconnects for leaf-spine RDMA cluster networks at the L0/L1 physical layer1 and L2 protocol level coupled with troubleshooting and automation/programming skills. As OCI is a cloud-based network with a global footprint, this support will include millions of optical links for hundreds of thousands of network devices supporting millions of servers, connected over a mix of dedicated backbone infrastructure, CLOS Network, and the Internet. 

  • Collaborate with engineers from L1 optical engineering team, network design, delivery and AI Ops, DC Ops, and DC build teams and program/project managers to develop milestones and deliverables validating optical cabling and optical transceivers build quality and validation in the AI data center builds to the OCI standards for RDMA backend networks. 
  • Will primarily use existing procedures and tools to develop and safely execute DC network builds and changes. However, may have to develop new procedures from time to time.
  • Provide break-fix support for optical links to meet RDMA cluster performance criteria (pre-FEC BER, Rx power, FEC bin, BOL and EOL margins etc.). 
  • Serve as the escalation point for event remediation and lead post-event root cause analysis.
  • Frequently develops MPOs or scripts to automate routine tasks for team and business units to improve quality of builds. 
  • Support dashboards build with requirements to represent data at L1 layers and device roles that help identify link level issues, anomalies such as link flaps and link downs. 
  • Serves as SME on data center build standards for DC build environment, optical cabling and optics transceivers install and troubleshooting.
  • Participate in AI DC deployment rotations at DC build sites with up to 50% domestic travel for optical link validations for new clusters and prove recommendations to various teams for improvement and enforcement
  • Support Ops to stabilize RDMA networks after turn-up.   
  • Network Engineering
  • Network Infrastructure
  • Optical Systems
// // //