Major Incident Management (MIM) & NOC Lead

Kodeva LLC
Wilmington, DE

Job Title: Major Incident Management (MIM) & NOC Lead (10+ Years)

Location: Onsite – Wilmington, DE (Day1 Onsite)

Employment Type: Full-time

Experience: 10+ years in IT Operations / NOC / Major Incident Management, including leadership ownership.


Role Summary:

The Major Incident Management & NOC Lead is responsible for end-to-end command and control of the enterprise’s 24x7 operational monitoring and incident response. This role leads the MIM and NOC function, drives Major Incident (P1/P2) execution, ensures rapid service restoration, and continuously improves operational maturity through problem management, automation, observability enhancements, and SLA governance.

This role requires a mix of strong incident leadership, technical depth across infrastructure and applications, and people/process management to ensure stability, availability, and performance across critical services.

Key Responsibilities:

A) Major Incident Management (Command & Control)

  • Own the Major Incident (P1/P2) process from detection to resolution, including war-room leadership, stakeholder updates, and closure.
  • Act as the Incident Commander and ensure structured triage, containment, workaround, and restoration.
  • Drive cross-functional coordination (App, Infra, Network, Security, DB, Cloud, Vendor teams) to reduce MTTR.
  • Ensure high-quality incident communications: executive summaries, impact analysis, ETAs, customer/business comms.
  • Lead and facilitate Post Incident Reviews (PIR/RCA); ensure actionable corrective/preventive actions (CAPA).
  • Identify recurring issues and trigger Problem Management with measurable reduction plans.

B) NOC Leadership & Operations

  • Lead the NOC team responsible for 24x7 monitoring, alert triage, event correlation, escalation, and ticket quality.
  • Establish/maintain standard operating procedures (SOPs), runbooks, escalation matrices, and on-call models.
  • Ensure NOC meets SLAs/OLAs, improves alert fidelity, and reduces noise through tuning and automation.
  • Manage handover governance between shifts; maintain service continuity and operational hygiene.

C) Service Reliability & Continuous Improvement

  • Drive operational improvements: monitoring coverage, SLO/SLA alignment, incident prevention, and resiliency initiatives.
  • Partner with Engineering/Platform teams on observability strategy, proactive detection, and reliability patterns.
  • Track and report operational metrics: MTTD, MTTR, incident volume, re-open rate, SLA compliance, and trends.
  • Support readiness for audits and compliance: evidence collection, process adherence, and risk mitigation.

D) Stakeholder & Vendor Management

  • Interface with business stakeholders, service owners, and leadership to provide incident status, risk, and remediation plans.
  • Manage vendor escalations and ensure timely resolution aligned to contractual SLAs.

E) Managerial / Leadership Skills (Must Have)

  • Proven experience leading MIM & NOC Operations teams (shift-based or on-call models).
  • Strong Incident Commander capability: calm under pressure, structured decision-making, priority trade-offs.
  • Excellent stakeholder management across technical teams and business leadership.
  • Ability to build and enforce process discipline (ITIL-aligned), while improving speed and quality.
  • Strong coaching/mentoring: performance management, skill development, hiring support as needed.
  • Effective communication: concise executive updates, clear action plans, facilitation of PIR/RCA sessions.
  • Data-driven mindset: uses metrics and trend analysis to drive operational outcomes.

Technical Skills (Must Have):

A) Monitoring / Observability

  • Hands-on experience with NOC tooling and observability platforms such as:
  • Splunk / ELK, Datadog, Dynatrace, New Relic, AppDynamics
  • Prometheus/Grafana, CloudWatch/Azure Monitor
  • Strong understanding of event correlation, alert tuning, noise reduction, and dashboarding.

B) Incident / ITSM Platforms

  • Strong working knowledge of ServiceNow (Incident, Problem, Change, Knowledge, CMDB) or equivalent ITSM tools.
  • Experience designing workflows, SLAs/OLAs, routing rules, and automation integrations.

C) Infrastructure & Platform Breadth

  • Solid understanding across:
  • Windows/Linux administration basics
  • Network fundamentals (DNS, DHCP, TCP/IP, routing, load balancers, firewalls)
  • Compute/virtualization (VMware/Hyper-V) and storage concepts
  • Databases fundamentals (SQL/Oracle, replication, performance symptoms)
  • Cloud fundamentals and operational support for AWS/Azure/GCP:
  • IAM basics, networking (VPC/VNet), scaling, logging/monitoring, common failure patterns.

D) Automation & Scripting (Good to Have / Preferred)

  • Scripting knowledge: PowerShell / Python / Bash
  • Familiarity with automation tools: Ansible, Terraform, CI/CD operational workflows.
  • Ability to create/maintain runbook automation and self-healing patterns.

E) Security & Resilience (Preferred)

  • Awareness of security operations touchpoints: DDoS symptoms, certificate expiries, IAM issues, endpoint/EDR alerts.
  • Familiarity with BCP/DR processes, failover testing, and resilience design collaboration.

F) ITIL / Process Expectations

  • Strong ITIL understanding across Incident, Problem, Change, Knowledge, and Service Level Management.
  • Ability to implement governance around:
  • Change risk assessment, change windows, incident-change correlation
  • RCA quality, action item tracking, and effectiveness validation

Qualifications:

  • Bachelor’s degree in computer science / IT / Engineering or equivalent experience.
  • ITIL v4 Foundation (preferred).
  • Cloud certifications (preferred): AWS/Azure fundamentals or associate level.
  • Experience in enterprise production environments with stringent availability requirements.
  • Success Metrics / KPIs
  • Reduced MTTD and MTTR for P1/P2 incidents.
  • Improved SLA compliance and reduction in escalation breaches.
  • Reduced repeat incidents via problem management and preventive actions.
  • Improved alert quality: lower false positives, better signal-to-noise ratio.
  • Strong PIR/RCA compliance: on-time RCAs with measurable preventive outcomes.
  • Improved NOC operational maturity: SOP adherence, shift handover quality, audit readiness.

Nice-to-Have Industry Contexts

  • Transportation / financial services / healthcare / e-commerce / SaaS environments with high availability targets.
  • Experience supporting microservices, Kubernetes, and distributed systems.

// // //