Site Reliability Engineer (SRE)
Technologies we use:
- AWS, Terraform (IaC)
- MySQL, DynamoDB, Redis
- GitHub Actions for CI pipelines
- Kubernetes (specifically EKS)
- Ambassador, Helm, Argo CD, LinkerD
- REST, gRPC, graphQL
- React, Redux, Swift, Node.js, Kotlin, Java, Go, Python
- Datadog, Prometheus
What you will be doing:
- Acting as a conduit between product development and platform engineering teams to ensure services meet defined SLA.
- Help identify Service Level Indicators (SLI) and define Service Level Objectives (SLO) to assess the stability and reliability of all our applications.
- Continuously influence our engineering practices to consistently improve our MTTR from production incidents.
- Working to bring efficiency and standardization to our incident command practices and norms.
- Collaborating cross-functionally to ensure our CI/CD pipeline is efficient and automated.
- Accountable for metrics and monitoring of team services in production providing data and transparency that helps to track SLOs for a product.
- Enhancing existing services and applications to increase availability, reliability, and scalability in a microservices environment.
- Building and improving engineering tooling, process, and standards to enable faster, more consistent, more reliable, and highly repeatable application delivery.
What you should bring:
- At least 3 years of previous experience working as an SRE.
- History of working on a large scale product in either Java or Node.
- Experience implementing monitoring and alerting for services and establishing SLOs for services. (We use Datadog and Prometheus, but other tool experiences are fine.)
- Strong understanding of working in a cloud-native ecosystem (We use AWS, but we’ll consider other cloud experience.)
- Knowledge of building/troubleshooting/maintaining CI/CD pipelines (we use Github Actions and runners, along with Argo CD, but will consider experience with other CI/CD tooling) Previous experience working with Terraform is a plus!