Lead Site Reliability Engineer
Salary: $150,000.00 USD Annually - $180,000.00 USD Annually
Description:
We are seeking a Lead Site Reliability Engineer (SRE) who combines deep technical expertise with strong leadership and client-facing capabilities. This is a high-impact role responsible for ensuring the reliability, scalability, and performance of our cloud infrastructure and kiosk platform.
You will lead a team of engineers while remaining hands-on, owning uptime, SLAs, and incident management, and driving long-term improvements in system resilience and operational maturity. This role also requires working closely with Fortune 500 clients, translating complex technical concepts into clear, business-friendly insights.
What Makes This Role Unique
This is a rare opportunity for a hybrid leader who can:
- Operate as a hands-on SRE expert
- Lead and mentor a team of engineers
- Act as a client-facing technical advisor
- Drive both real-time operations and long-term reliability strategy
Key Responsibilities:
Reliability & Operations- Own platform uptime, SLAs, and overall system reliability
- Lead incident response, root cause analysis, and postmortems
- Develop and maintain disaster recovery and business continuity plans
- Design, build, and optimize cloud infrastructure and Kubernetes environments
- Automate deployments and operational tasks using CI/CD and Infrastructure-as-Code (Terraform preferred)
- Improve system scalability, performance, and resilience
- Implement and enhance monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, New Relic)
- Establish operational standards, runbooks, and best practices
- Lead, mentor, and develop a team of ~6 engineers
- Partner with platform engineering, QA, and development teams to ensure operational readiness
- Serve as a technical point of contact for clients, clearly communicating system health, risks, and solutions
Required Qualifications:
- 8+ years of experience in SRE, DevOps, or Platform Engineering
- 2+ years in a lead or managerial role
- Strong expertise in:
- Cloud infrastructure (AWS, Azure, or Google Cloud Platform)
- Kubernetes and containerized environments
- CI/CD pipelines and release engineering
- Infrastructure-as-Code (Terraform preferred)
- Proficiency in scripting/automation (Python, Bash, or Go)
- Deep understanding of observability, monitoring, and logging systems
- Experience with GitOps workflows (e.g., ArgoCD)
- Proven experience managing production systems with strict uptime requirements
Preferred Experience :
- Client-facing experience in enterprise or SaaS environments (required)
- Experience communicating with non-technical stakeholders and Fortune 500 clients
- Background in high-availability systems and large-scale distributed environments
What We're Looking For :
- A hands-on technical leader who can balance execution and strategy
- Strong communicator with executive presence
- Someone who thrives in high-ownership, fast-paced environments
- A mentor who can elevate team performance and operational excellence
Message & data rates apply and message frequency may vary. Consistent with Judge's Privacy Policy, information obtained from your consent will not be shared with third parties for marketing/promotional purposes. Reply STOP to opt out of receiving telephone calls and text messages from Judge and HELP for help.
Contact:
This job and many more are available through The Judge Group. Please apply with us today!