|
The Principal Engineer, Intelligent Operations and Observability plays a principal level technical role within the McCormick Technology Operations Center, providing active leadership during incidents, service degradations, and other operational events affecting the company’s technology environment. This role maintains real time awareness of the operational state across infrastructure, cloud platforms, applications, networks, and end user services, and leads coordinated response efforts to ensure the appropriate teams and service providers are engaged quickly and working with urgency toward restoration.
In parallel, the role serves as the principal technical expert for the company’s observability and operational tooling environment, driving improvements in monitoring, event management, service visibility, and reporting.
This position also acts as a key stakeholder in shaping incident response practices and strengthening operational maturity across the organization through insight, influence, and continuous improvement.
Key Responsibilities
Operational Oversight and Service Restoration
Provides active operational leadership within the McCormick Technology Operations Center during incidents, service degradations, and other events affecting the company’s technology environment. Maintains real time awareness of the operational state across infrastructure, cloud platforms, applications, networks, and end user services, and leads coordinated response efforts to ensure the appropriate teams and service providers are engaged quickly and working with urgency toward restoration. Drives incident response by reinforcing priorities, tracking progress, challenging assumptions, escalating when needed, and helping ensure the right level of visibility, accountability, and stakeholder attention is maintained through recovery and stabilization.
Observability and Operational Tooling
Implements, administers, and continuously improves the operational tooling environment that supports enterprise monitoring, event management, and service visibility. Serves as a principal technical expert across application performance monitoring, infrastructure and systems monitoring, event aggregation, alerting, dashboards, and reporting, ensuring the tool suite delivers meaningful and actionable insight. Partners across infrastructure, cloud, application, and service teams to expand monitoring coverage, improve signal quality, reduce alert noise, and strengthen overall observability and operational decision making.
Continuous Improvement and Operational Maturity
Contributes to the ongoing maturity of the Technology Operations Center by identifying and implementing improvements in operational processes, tooling, visibility, and service assurance practices. Supports the development of runbooks, standards, and response procedures, and uses operational data and trends to recommend opportunities for automation, standardization, and stronger preventive controls. Works within established service management and SIAM aligned practices to improve coordination across internal teams and service providers, while helping advance a more proactive, insight driven, and resilient operations model.
Incident Response Process Contribution
Serves as a senior stakeholder in the ongoing development and improvement of incident response practices, standards, and supporting procedures by providing operational insight and practical recommendations based on real world experience. Helps shape how incident prioritization, escalation, communications, and restoration processes are refined over time, while reinforcing adherence to established practices across service providers and internal teams. Supports governance and operational reviews by identifying execution gaps, highlighting response trends, and recommending improvements that strengthen consistency, discipline, and overall response effectiveness.
Required Qualifications
- Bachelor’s degree required in Information Technology, Computer Science, Engineering, Information Systems, or a related technical discipline.
- Advanced degree in a related field preferred.
- Minimum 10-12 years of experience
- Relevant certifications in IT service management, cloud, reliability engineering, or operational disciplines are preferred, such as ITIL, SRE, or major platform certifications.
- Deep experience administering and optimizing tools related to application performance monitoring, systems monitoring, event aggregation, alerting, dashboards, and reporting
- Significant experience leading coordinated response during incidents, service degradations, and major operational events
- Experience working across internal support teams and third party service providers to drive accountability, process adherence, and timely service restoration
- Strong working knowledge of incident management, event management, service assurance, and service management practices
- Experience identifying operational trends and using data to drive improvements in resilience, service quality, and response effectiveness
- Experience operating within ITIL, SIAM, or similarly structured service management environments preferred
- Strong operational judgment and ability to remain effective and decisive in high pressure situations
- Strong analytical and problem solving skills, with the ability to interpret events, assess business impact, identify patterns, and recommend practical improvements
- Demonstrated ability to lead through influence and drive action across technical teams and service providers without direct authority
- Strong verbal and written communication skills, with the ability to provide clear direction during incidents and translate technical situations into business relevant terms
- Strong technical aptitude across observability, monitoring, event correlation, dashboarding, and operational reporting
- Ability to improve signal quality by reducing noise, refining thresholds, and increasing the value of alerts and event data
- Strong organizational skills, attention to detail, and ability to manage multiple priorities in a fast moving operational environment
- Ability to work collaboratively across infrastructure, cloud, application, service management, and supplier teams
- Ability to identify broader operational risks and recommend improvements that strengthen enterprise wide service resilience
#LI-NP2
|