Site Reliability Engineering
Greater Boston Area
5 days ago
Collaborating with Engineering and Product Managers to define SLOs and monitoring of well designed SLIs.
Embedding with Engineering teams and Independently addressing or collaborating to complete architectural improvements.
Being the primary escalation for major incidents involving assigned services.
Participating in an on-call rotation.
Owning our Incident Response Process, including conducting blameless Postmortems.
Increasing robustness by automation of workflows, process improvements, CI/CD pipelines, and integrating modern toolsets.
Refusing to accept manual work as a solution to areas of weakness.
Partnering with Engineering teams to ensure new services are production ready.
Championing our organizational standards for designing, deploying, and scaling our products.
Making Data-Driven decisions to drive continuous improvement.
Evolving our tooling, logging, monitoring and alerting systems to increase observability and transparency.