What you will be doing:
- Acting as a conduit between product development and platform engineering teams to ensure services meet defined SLAs.
- Facilitate identifying Service Level Indicators (SLI) and define Service Level Objectives (SLO) to assess the stability and reliability of all our applications.
- Bringing efficiency and standardization to our Incident Management culture that thrives on continuous improvement, with a focus on a blameless culture and owning the tooling utilized throughout the incident response process.
- Providing CI/CD stability and automation for the organization to build and deploy safely and reliably
- Empower teams across the organization to monitor and observe their services, ensuring reporting, transparency, and SLO tracking
- Enhancing existing services and applications to increase availability, reliability, and scalability in a microservices environment.
- Building and improving engineering tooling, process, and standards to enable faster, more consistent, more reliable, and highly repeatable application delivery.
What you should bring:
- 7+ years of previous experience working as an SRE.
- History of working on large scale products in either Java or Node.
- Experience with containerized technologies such as Kubernetes or Docker
- Experience implementing monitoring and alerting for services and establishing SLOs for services.
- Established experience working in a cloud-native ecosystem (AWS experience a plus)
- Strong understanding of software engineering principles and experience with software development best practices, including version control, automation, and building/troubleshooting/maintaining continuous integration and delivery.
- Excellent analytical and problem-solving skills, with the ability to work collaboratively within SRE and across Engineering.
- Strong communication and interpersonal skills.
- Experience working with Terraform is a plus.