Job Title: Major Incident Management (MIM) & NOC Lead (10+ Years)
Location: Onsite – Wilmington, DE (Day1 Onsite)
Employment Type: Full-time
Experience: 10+ years in IT Operations / NOC / Major Incident Management, including leadership ownership.
Role Summary:
The Major Incident Management & NOC Lead is responsible for end-to-end command and control of the enterprise’s 24x7 operational monitoring and incident response. This role leads the MIM and NOC function, drives Major Incident (P1/P2) execution, ensures rapid service restoration, and continuously improves operational maturity through problem management, automation, observability enhancements, and SLA governance.
This role requires a mix of strong incident leadership, technical depth across infrastructure and applications, and people/process management to ensure stability, availability, and performance across critical services.
Key Responsibilities:
A) Major Incident Management (Command & Control)
- Own the Major Incident (P1/P2) process from detection to resolution, including war-room leadership, stakeholder updates, and closure.
- Act as the Incident Commander and ensure structured triage, containment, workaround, and restoration.
- Drive cross-functional coordination (App, Infra, Network, Security, DB, Cloud, Vendor teams) to reduce MTTR.
- Ensure high-quality incident communications: executive summaries, impact analysis, ETAs, customer/business comms.
- Lead and facilitate Post Incident Reviews (PIR/RCA); ensure actionable corrective/preventive actions (CAPA).
- Identify recurring issues and trigger Problem Management with measurable reduction plans.
B) NOC Leadership & Operations
- Lead the NOC team responsible for 24x7 monitoring, alert triage, event correlation, escalation, and ticket quality.
- Establish/maintain standard operating procedures (SOPs), runbooks, escalation matrices, and on-call models.
- Ensure NOC meets SLAs/OLAs, improves alert fidelity, and reduces noise through tuning and automation.
- Manage handover governance between shifts; maintain service continuity and operational hygiene.
C) Service Reliability & Continuous Improvement
- Drive operational improvements: monitoring coverage, SLO/SLA alignment, incident prevention, and resiliency initiatives.
- Partner with Engineering/Platform teams on observability strategy, proactive detection, and reliability patterns.
- Track and report operational metrics: MTTD, MTTR, incident volume, re-open rate, SLA compliance, and trends.
- Support readiness for audits and compliance: evidence collection, process adherence, and risk mitigation.
D) Stakeholder & Vendor Management
- Interface with business stakeholders, service owners, and leadership to provide incident status, risk, and remediation plans.
- Manage vendor escalations and ensure timely resolution aligned to contractual SLAs.
E) Managerial / Leadership Skills (Must Have)
- Proven experience leading MIM & NOC Operations teams (shift-based or on-call models).
- Strong Incident Commander capability: calm under pressure, structured decision-making, priority trade-offs.
- Excellent stakeholder management across technical teams and business leadership.
- Ability to build and enforce process discipline (ITIL-aligned), while improving speed and quality.
- Strong coaching/mentoring: performance management, skill development, hiring support as needed.
- Effective communication: concise executive updates, clear action plans, facilitation of PIR/RCA sessions.
- Data-driven mindset: uses metrics and trend analysis to drive operational outcomes.
Technical Skills (Must Have):
A) Monitoring / Observability
- Hands-on experience with NOC tooling and observability platforms such as:
- Splunk / ELK, Datadog, Dynatrace, New Relic, AppDynamics
- Prometheus/Grafana, CloudWatch/Azure Monitor
- Strong understanding of event correlation, alert tuning, noise reduction, and dashboarding.
B) Incident / ITSM Platforms
- Strong working knowledge of ServiceNow (Incident, Problem, Change, Knowledge, CMDB) or equivalent ITSM tools.
- Experience designing workflows, SLAs/OLAs, routing rules, and automation integrations.
C) Infrastructure & Platform Breadth
- Solid understanding across:
- Windows/Linux administration basics
- Network fundamentals (DNS, DHCP, TCP/IP, routing, load balancers, firewalls)
- Compute/virtualization (VMware/Hyper-V) and storage concepts
- Databases fundamentals (SQL/Oracle, replication, performance symptoms)
- Cloud fundamentals and operational support for AWS/Azure/GCP:
- IAM basics, networking (VPC/VNet), scaling, logging/monitoring, common failure patterns.
D) Automation & Scripting (Good to Have / Preferred)
- Scripting knowledge: PowerShell / Python / Bash
- Familiarity with automation tools: Ansible, Terraform, CI/CD operational workflows.
- Ability to create/maintain runbook automation and self-healing patterns.
E) Security & Resilience (Preferred)
- Awareness of security operations touchpoints: DDoS symptoms, certificate expiries, IAM issues, endpoint/EDR alerts.
- Familiarity with BCP/DR processes, failover testing, and resilience design collaboration.
F) ITIL / Process Expectations
- Strong ITIL understanding across Incident, Problem, Change, Knowledge, and Service Level Management.
- Ability to implement governance around:
- Change risk assessment, change windows, incident-change correlation
- RCA quality, action item tracking, and effectiveness validation
Qualifications:
- Bachelor’s degree in computer science / IT / Engineering or equivalent experience.
- ITIL v4 Foundation (preferred).
- Cloud certifications (preferred): AWS/Azure fundamentals or associate level.
- Experience in enterprise production environments with stringent availability requirements.
- Success Metrics / KPIs
- Reduced MTTD and MTTR for P1/P2 incidents.
- Improved SLA compliance and reduction in escalation breaches.
- Reduced repeat incidents via problem management and preventive actions.
- Improved alert quality: lower false positives, better signal-to-noise ratio.
- Strong PIR/RCA compliance: on-time RCAs with measurable preventive outcomes.
- Improved NOC operational maturity: SOP adherence, shift handover quality, audit readiness.
Nice-to-Have Industry Contexts
- Transportation / financial services / healthcare / e-commerce / SaaS environments with high availability targets.
- Experience supporting microservices, Kubernetes, and distributed systems.