Site Reliability Engineer
Candescent
Candescent is the leading cloud-based digital banking solutions provider for financial institutions. We are transforming digital banking with intelligent, cloud-powered solutions that connect account opening, digital banking, and branch experiences for financial institutions. Our advanced technology and developer tools enable seamless, differentiated customer journeys that elevate trust, service, and innovation. Success here requires flexibility in a fast-paced environment, a client-first mindset, and a commitment to delivering consistent, reliable results as part of a performance-driven, values-led team. With team members around the world, Candescent is an equal opportunity employer.
- You will be responsible for maintaining and scaling production services and servers for complex and high throughput cloud services.
- You will bridge and own the union between development, quality, security and operations
- You will improve scalability, service reliability, capacity, and performance.
- You will write automation code for provisioning and operating infrastructure at massive scale.
- You are not just an operator, you’re an experienced software engineer focused on operations.
- You will initiate and contribute to continuous improvement of our software delivery processes and practices in a multi-location, multidisciplinary team to empower and accelerate product development
- You will use automation extensively to design, configure, manage, and monitor systems in support of our product development teams
- You will participate in disaster recovery planning and execution
- You will be responsible for maintaining / patching servers supporting SaaS products. This includes Windows Servers, Linux Servers running in in-house Datacenters and/or using cloud PaaS providers (GCP & Azure)
- You’ll work hand-in-hand with all teams to ship our code to production using Continuous Integration / Continuous Deployment (CI/CD) and AppSec tooling.
- You will collaborate with development teams and use intuition, experience and understanding to create SLIs, SLOs, and SLAs
- You will provide timely assistance and remediation solutions during critical situations and production incidents to help resolve service problems (You will be on call for periods of time)
- You will develop monitoring architecture, implementing monitoring agents, build dashboards, manage escalations and alerts
- You will participate in incident management and driving root cause analysis (RCA) and risk management processes
- You will participate in a rotating on-call schedule during off-hours where you may periodically need to remote-in to systems if a production outage occurs
Statement to Third Party Agencies
To ALL recruitment agencies: Candescent only accepts resumes from agencies on the preferred supplier list. Please do not forward resumes to our applicant tracking system, Candescent employees, or any Candescent facility. Candescent is not responsible for any fees or charges associated with unsolicited resumes.