Site Reliability Engineer/DevOps
Greater Boston Area
6 days ago
Lead development of processes and software necessary to maintain services post-deployment through data collection and monitoring ensuring overall health of the services provided.
Address service and infrastructure monitoring alerts.
Develop new metrics/monitoring dashboards as additional coverage events become necessary.
Monitor and continuously improve the availability and performance of infrastructure, systems and applications.
Create and maintain documentation for processes, supported infrastructure resources and services.
Drive supportability improvements by improving automation, automatic alerting, self-healing architectures, etc.
Create new alerts, find anomalies, fix things, and ask why something broke.
Manage, monitor, and troubleshoot daily processes and make improvements to current processes related to production operations.
Capture and analyze data on Systems Availability, MTBF, and MTTR across all Digital channels; identify patterns and drive changes to both systems and processes to provide sustained improvements.