Director, Site Reliability Engineering
Greater Boston Area
Engineering at Klaviyo
Klaviyo is a fast-growing and profitable startup located in the heart of downtown Boston. Our mission is to use data science to help ecommerce brands grow faster. We love taking on tough engineering problems such as building real-time analytics systems to process billions of events every day.
We believe in ownership and autonomy and look for engineers who are passionate about building, operating & scaling features end to end and breaking through technical challenges to have outsized impact on our customers and on Klaviyo. We pride ourselves in shipping code dozens of times daily to enhance the product that our 10,000+ paying customers rely on to meaningfully engage with more than 1 billion global consumers.
Klaviyo’s most important asset is our people and we are committed to always raising the bar for what it means to be a Klaviyo and to investing in and leveling up our people.
Read more about our tech and teams at https://klaviyo.tech
About the Role
Klaviyo is looking for a Director of Site Reliability Engineering to lead our SRE teams as they grow in size and scale over the coming years. In 2017 we tripled the size of our engineering teams, helped influence a billion dollars in ecommerce transactions, sent billions of personalized emails over the holiday season, and ingested 100 billion data points into petabyte scale clusters. The Director of Site Reliability Engineering will help us breakthrough the next scalability challenges and will be particularly instrumental in leading initiatives such as:
Scaling the SRE organization, processes and teams into industry-leading Site Reliability, Infrastructure, Security and Velocity engineering teams—more than tripling the size of the existing team over the next year and streamlining how SREs interface with the rest of engineering
Handling Klaviyo systems’ growth and holiday traffic scaling--ensuring SRE controlled internal services (monitoring, performance monitoring, logging, load balancing, etc.) operate reliably and SREs effectively pair with other engineering teams to scale the platform 10x year over year while continually improving SLAs
Launching internal platforms and infrastructure services to help our teams move faster and our workloads to run more efficiently. Leveraging technologies such as Kubernetes to improve developer velocity as Klaviyo evolves from a monolithic app into a microservices driven architecture.
Improving Klaviyo's resiliency, self-healing and operational rigor by leading post-mortems, conducting readiness assessments and leveraging chaos engineering techniques to continually harden our services
Evangelizing a DevOps culture by ensuring that all engineers own their own services’ infrastructure and uptime, that we prioritize automation over toil, and that we foster a collaborative environment where we strive to maximize developer productivity and job satisfaction
We have no pure people “managers” at Klaviyo and we expect everyone to be very technical -- this means everyone from the CEO, to the VP of Engineering and team leads being hands on with the code. Directors are expected to be subject matter experts on specific technologies and DevOps practices and spend the majority of their time coding while also working to level up the engineers on their team.
- Lead your team by example, working with them to level up their code quality through code reviews and 1on1 mentorship, creating & vetting technical architectures, and writing code yourself that raises the bar for Klaviyo engineering.
- Be responsible for optimizing cloud systems processing billions of events in real-time running across thousands of servers
- Grow the Site Reliability team by recruiting the best engineers and leveling up existing Klaviyo engineers in key SRE dimensions such as systems & performance engineering, defensive programming, high availability, secure coding practices as well as leadership of people and projects
- Leverage your operational experience in handling large-scale traffic events, scaling complex architectures and learning from distributed systems outages to minimize future service disruptions and continually improve SLAs
- Create and guide new mission-driven SRE teams to help all Klaviyo engineers ship an awesome amount of impactful software ever faster and more reliably and securely
- Contribute to the company as a subject matter expert in multiple areas, constantly pushing yourself to be a better engineer and to level up all of your peers within your team and within Klaviyo.
- Are passionate about DevOps being a culture, not a role. You have experience leading engineering teams building and scaling products and understand that the best performing teams own their systems end to end
- Want to take what you’ve learned about high performance systems and teams and apply it in a hyper growth environment to unlock new learnings and breakthrough innovations
- Have proven that you can build & scale complex distributed systems including solving performance bottlenecks, rolling out disaster recovery plans, eliminating single points of value, achieving security and compliance objectives and keeping systems humming by leading teams through crises and working side by side with them in the trenches
- Act as a servant leader for your team, balancing multiple important priorities and projects and keeping the team organized and focused on achieving their goals
- Have 10+ years of experience building products that matter, growing teams, and pushing yourself and your people to be better engineers and build indispensable always-on products with passionate advocates
- Enjoy working with new technologies, and are particularly passionate and an expert in multiple stack areas. You show this by having strong and opinionated experience with various technologies and know how to pick the right tool for any job.
Technologies we use:
- Python, Django, Celery, Gunicorn, Nginx, Apache Kafka, Apache Flink
- MySQL, Cassandra, RabbitMQ, Redis, ElasticSearch
- Amazon Web Services (EC2, RDS, Aurora, etc.)
- Terraform, Packer, Jenkins, Nagios, Statsd/Graphite, Docker and other infrastructure tooling
Klaviyo is a team of people who are crazy motivated by growth.
It’s what we help our customers do: grow their businesses by making it possible and easy for them to use their data to power better marketing.
It’s how we behave as individuals: we’re all deeply passionate about learning.
It’s how we manage our business: we have thousands of paying customers, we’re profitable, and we’re growing insanely fast.
And it’s what our culture is all about. Working at Klaviyo means you’ll work on things you never imagined you would; you’ll grow in ways you didn’t consider possible; and you’ll do the best work of your career with people who are just as motivated and talented as you are.
Your curiosity has led you this far, so if this sounds like your ideal place to work, apply now!
Read Full Job Description