Senior Engineer, Production Engineering & Incident Response

Plantation, FL

Full Time Senior-level / Expert
Magic Leap logo
Magic Leap
Apply now Apply later

Posted 2 weeks ago

Senior Engineer, Production Engineering & Incident Response

Magic Leap is looking for a senior engineer to focus on live site operations and incident response management.

Job Description

In this role, you will be responsible for day-to-day operations of our production live site systems, coordinate response to an outage and build incident management engineering systems based on industry standards and ITSM principals.

The ideal candidate is very knowledgeable with ITSM and is experienced in IT Incident Management engineering, processes improvement with a proven track record of resolving critical impacting incidents affecting microservice architect-based engineering services.

Responsibilities

  • Oversight of 24x7 Major Incident Response
  • Continually improve the engineering, efficiency and effectiveness of the Incident Response program
  • Develop, measure, and report process performance and functional metrics in order to identify opportunities, measure success, or validate expected outcomes
  • Tightly integrate incident management tools & processes with monitoring & observability platforms, production engineering dashboard and other ITSM tools.
  • Define SLO & SLA metrics with engineering service owners & work with monitoring team to
  • Bring continuous improvement to support and operational practices.
  • Handle escalations and communicate clearly and effectively to all stakeholders including senior company leaders

Qualifications

  • 10+ years of incident management in a high paced technology company
  • Track record of managing complex incident management
  • 8+ years of experience in managing production system of build & release tools, large scale public cloud based micro service with 100K+ concurrent users
  • Prior experience of working in production engineering w/ regional NOC & SOC
  • Prior experience with instrumenting mission critical services on a globally distributed level, using cloud hosting providers like AWS, GCP and more
  • Prior experience integrating event management systems such as Pager Duty and other production engineering system
  • Prior experience with Cloud Watch, StackDriver, Prometheus, Data Dog, Sumo Logic

Education

  • BA/BS in Computer Science or related field and  equivalent experience

Additional Information

All your information will be kept confidential according to Equal Employment Opportunities guidelines.

 

 

 

 

 

 

 

 

 

 

Job tags: AWS Incident response
Job region(s): North America
Job stats:  4  1  0
Share this job: