Site Reliability Engineer
Chewy is looking to hire Site Reliability Engineers at our Dania Beach, FL or Boston, MA locations. Site Reliability Engineers are a cross between system and software engineers who are responsible for all operational aspects of Chewy’s ecommerce platform. The team is responsible for designing, building, monitoring, and maintaining the infrastructure of our internet-facing and internal services. We're looking for engineers who want to be a part of developing infrastructure software, maintaining it, and scaling Chewy’s technology stack. Come help us build a bigger and better Chewy as a Site Reliability Engineer. You will be part of a small family within Chewy that has a huge impact on our incredible growth.
Ideal candidates will possess the ability to discuss complex technical concepts with a diverse audience across all areas of the organization. They will remain calm under pressure and always strive to add structure to high-pressure, fast paced tasks or projects.
What you'll do:
- Focus on service stability and reliability by working with application owners to set SLOs, "Error Budget" and backup and DR strategies
- Define application monitoring and alerting strategy
- Perform capacity planning and production readiness assessment
- Embed with product teams during the design and requirements phase of new product development through to initial production launch
- Identify requirements for other operational teams (release engineering, automation, etc.) during application development phase
- Be a technology and Devops evangelist for the rest of the company
- Participate in on-call rotation for level 3 support escalations
What you'll need:
- At least 5 years of experience working in an SRE role or similar.
- Hands on experience with orchestration and system configuration tools such as Ansible, Puppet, Chef, Terraform, etc.
- Expert in building and maintaining highly available applications including redundancy, fail over, scalability, monitoring and performance.
- Strong experience with virtualization, monitoring and automation.
- Software development experience (both scripting and “programming” languages).
- Experience working with open source community (troubleshooting, patch submission, etc.).
- Demonstrated 5+ years of Linux System Administration.
- Experience with CI tools such as Bamboo, Jenkins, Hudson.
- Ability to organize, troubleshoot and continuously learn.
- Previous experience working within controls such as SOX, PCI, etc.
- This position requires travel.