Job summary
Amazon strives to be the world’s most customer centric company. To succeed, our products and services must be available at all times to our customers. The Consumer Incident Response (IR) team is responsible for improving the availability of our shopping experiences (website, mobile experience, in-store) worldwide (across 21 different marketplaces).

As a Software Development Manager on the IR team, your contributions will directly impact and reduce the total Mean-Time-to-Repair (MTTR) for retail customer-impacting outages. You will lead the development of new, greenfield systems which classify and pinpoint software outages in a service graph of tens of thousands of services and automatically facilitate engagement of precisely the right teams within seconds to triage and repair those outages. Machine learning will be used to identify patterns in similar types of problems and to recommend the next-best team to engage if we need to keep searching. You’ll determine which initiatives are the most important at any given time based on data and how the retail software ecosystem is currently evolving. Your contributions will have a force multiplier effect on the immediate team and larger organization by automating solutions, instituting operational excellence and leading through others.

This role is a perfect fit for an experienced engineer who is passionate about availability, alerting, metrics, machine learning, site reliability engineering (SRE) and automation. You thrive in a fast-paced, startup-like environment, build new systems end-to-end from the ground up, communicate effectively to all stakeholders (tech, non-tech) and ship complex software in quick iterations at scale. You quickly learn new technologies/concepts and are able to make the right technical decisions for the products and business you support to provide the best experience for Amazon customers. You will shape the future of how Amazon Retail (and beyond) detects and responds to issues proactively.

Reasons to join our team; you will have the opportunity to:
• Deliver high-impact, high-visibility projects that are used by thousands of Amazon Retail services, driving career advancement.
• Invent processes, tools, and technology to force multiply the effect of your contributions across many organizations.
• Be responsible for owning, scoping, leading and delivering projects and experiments end-to-end, leveraging statistical evaluation, pattern recognition, and machine learning.
• Have the ability to dive deep into a wide variety of problems and technologies to guide the right decisions for the products and the customers you will support.

As an ideal candidate, your responsibilities would be:
• Bring multiple years of Engineering experience (with expertise in DevOps or SRE specialization preferred) - from both owning and operating solutions at massive scale; and leverage your unique experience and fresh perspective to drive innovation, simplification and new experiments.
• Be a strong communicator, with proven abilities in both architectural and software solutioning, and a proven track record of shipping complex software solutions through an agile methodology.
• Act a vocal leader and mentor for other engineers. Raise the bar when it comes to best practices, processes, and technical quality.
• Interact with developers across the company to understand their challenges, and work with leaders on the team to develop a roadmap for our portfolio and platform.

Basic Qualifications

· 7+ years of experience working directly within engineering teams
· Experience partnering with product OR program management teams
· 3+ years of people management experience, managing engineers
· 3+ years of experience architecting and designing (architecture, design patterns, reliability and scaling) of new and current systems
• Experience building and maintaining large-scale, high-availability distributed systems
• Excellent oral and written communication skills with both technical and non-technical stakeholders
• Experience identifying and prioritizing the most important software initiatives for the business/customer, taking them from scoping through launch and into daily operation.
• Ability to identify and solve ambiguous problems in short timeframes with limited oversight/direction.
• Experience influencing software engineers, infrastructure engineers and operators on best practices (full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations)
• Computer Science fundamentals in object-oriented design, data structures, algorithms, problem solving and complexity analysis
• Understanding of CI/CD, test automation and robust system health monitoring (metrics, monitors, alarms)
• Experience with Unix/Linux environments
• Understanding and ability to apply Agile (or similar) software development practices to improve software development reliability and velocity
• Bachelor's degree in Computer Science, a related technical field OR equivalent training and industry experience

Preferred Qualifications

• 8+ years of software development experience, preferably in building large-scale end-to-end distributed systems
• Specific production experience with AWS services & tools (IAAS, PAAS, APIs, tools)
• Experience with Site Reliability Engineering (SRE) concepts, practices
• Experience with statistical analysis and machine learning
• Experience with full-stack development

