Technical Program Manager - Manager of Incident Response and Remediation

New York, Boston, Remote-US

Datadog logo
Apply now Apply later

Posted 1 month ago

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams.  We operate at high scale—trillions of data points per day—providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.


The opportunity:

Reporting to the VP of Infrastructure, as Manager of Incident Response and Remediation Programs you will own the team who will create, then maintain and optimize Datadog’s global, company-wide incident response and remediation programs and processes. You will build your team from scratch - Datadog has long handled its incident response and remediation as a side function of Reliability Engineering management. However with our ongoing rapid growth both in new customers and new products, we are now seeking to build a specialized team, starting with you as their leader. The aspects of incident response and remediation your team will own are:

  • Our 24/7 Incident Commander program, especially training our company’s engineers and management to be effective in the program, and working out an effective communication mechanism through to our customers; both big and small
  • Our Incident Post-Mortem review and dissemination programs
  • Our Reliability Risk Review and Prioritization programs


You will:

  • Work with leadership and stakeholder to determine how to best staff your team to deliver its mission, then work with recruiting to hire your initial team of 3-5.
  • Define & remove obstacles that slow down or prevent programs from delivering 
  • Establish credibility and rapport with senior technical and non-technical team members alike
  • Take a hands on approach 
  • Communicate with senior-level executives - confident/comfortable in the presence of execs
  • Be comfortable in an ambiguous environment and respond well to frequent change



  • 3-5 years of experience at a SaaS company in leading teams within either technical/network/site operations, technical program management or equivalent roles.
  • 7-10 years of experience in technical/network/site operations, technical program management or equivalent roles
  • You can navigate technical conversations regarding cloud-based infrastructure
  • Thorough understanding of the software development lifecycle; ability to adjust and apply this knowledge in a dynamic environment using agile methodologies.
  • Strong quantitative and analytical skills, proven ability to track and successfully complete complex programs
  • Degree in Computer Science, other engineering discipline, or Information Systems


Bonus points:

  • You’re familiar with problems that are solved by monitoring tools
  • Experience working in a company in a hyper-growth stage
  • Experience growing a team (identifying roles, partnering with recruiting to hire)


Equal Opportunity at Datadog:

Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.


Your Privacy:

For more information on how we maintain the privacy of the information you submit as part of your application, please refer to our Applicant and Candidate Privacy Notice.

Job tags: Incident response SaaS
Job region(s): North America Remote/Anywhere
Share this job: