Lead Site Reliability Engineer

apartmentEPAM Systems placeNew York calendar_month4/25/26

Join our team as a Lead Site Reliability Engineer to drive system reliability, observability, and performance monitoring for mission-critical digital trading products. You will lead monitoring initiatives in a high-availability trading environment, ensuring stable connectivity to external partners while proactively identifying opportunities for continuous improvement.

At EPAM, you'll work on cutting-edge technologies, solve complex challenges, and shape the future of digital innovation. With access to continuous learning, mentorship, and global projects, your expertise will drive meaningful change. Req# 968473077 Responsibilities Define and implement a strategic reliability vision for the trading portfolio, covering infrastructure, network connectivity, application performance, and throughput Lead and oversee a team of SRE engineers, providing technical direction, mentorship, and performance guidance Own and evolve the SLA/SLO/SLI framework, including error budgets and service health reporting Configure and optimize comprehensive monitoring and alerting systems across infrastructure and applications Drive observability best practices using APM and monitoring platforms (e.g., Dynatrace) Analyze application and infrastructure performance to isolate fault domains and determine root causes of critical incidents Lead major incident management, coordinate resolution efforts, and conduct blameless postmortems Participate in 24x7x365 support rotation and ensure operational excellence across the team Identify automation opportunities to improve reliability, scalability, and operational efficiency Requirements 8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering Proven leadership experience (technical lead or team lead), with ability to oversee and mentor engineers Strong hands-on experience with SLA/SLO/SLI definition, governance, and reporting Solid experience working in Microsoft Azure environments (IaaS, PaaS, networking, monitoring) Hands-on experience with Dynatrace (configuration, alerting, dashboards, performance analysis) Experience with observability, monitoring, and APM tools in production environments Ability to operate effectively under pressure in time-sensitive, high-impact environments

local_fire_departmentUrgent

Site Reliability Engineer (SRE) - Remote

apartmentKforce Technology StaffingplaceNew York

RESPONSIBILITIES: Kforce has a client seeking a remote Site Reliability Engineer (SRE) to join their team. Summary: The main function of a Site Reliability Engineer is to streamline IT infrastructure management through the use of code, software...

electric_boltImmediate start

Site Reliability Engineer

apartmentMercorplaceNew York

Have • 3+ years of full-time experience at a major hyperscaler (AWS, Azure, GCP, Oracle Cloud), a cloud-data platform (Snowflake, Databricks), or a Fortune 500 platform/infrastructure team. • Background in cloud architecture, site reliability...

business_centerHigh salary

Maintenance & Reliability Engineer

placeRahway, 16 mi from New York

and Reliability Engineer, to serve as a key technical resource supporting clients in optimizing maintenance and reliability programs, driving performance improvements, and sustaining asset health. You will balance hands-on maintenance leadership with analytical...

Best jobs you don't want to miss:

Systems Engineer Jobs in New York 6 Urgent

Mechanical Engineer Jobs in New York

Engineering Jobs in New York

Cisco Engineer Jobs in New York 7 Urgent

Entry Level Engineer Jobs in New York