Director, Site Reliability Engineering
- locations
- Remote - USA
- time type
- Full time
- posted on
- Posted 9 Days Ago
- job requisition id
- JR100098
Position Purpose:
We are seeking a Director of Site Reliability Engineering (SRE) to lead our SRE team in ensuring the availability, performance, and scalability of our critical systems. This role is responsible for defining and driving reliability strategies, operational excellence, and incident response processes at scale. You will collaborate closely with engineering, DevOps, and product teams to establish best practices and implement processes that enhance system resilience and service performance.
Responsibilities:
- Leadership & Strategy
- Define and execute the vision for site reliability, balancing innovation with operational stability.
- Lead, mentor, and grow a high-performing SRE team, fostering a culture of ownership and continuous improvement.
- Partner with Engineering, DevOps, and Product teams to embed reliability best practices into the development lifecycle.
- Operational Excellence
- Establish and refine SLIs, SLOs, and error budgets to measure and improve service reliability.
- Develop and drive incident management processes, including real-time incident response, on-call coordination, and postmortem analysis to prevent recurring issues.
- Implement and standardize operational readiness reviews and escalation procedures to ensure teams are equipped to handle incidents effectively.
- Drive initiatives to reduce operational toil, leveraging automation where applicable to enhance team efficiency.
- Collaborate with engineering teams to define performance testing and capacity planning strategies to proactively mitigate reliability risks.
- Champion the adoption of observability, logging, and monitoring best practices, ensuring visibility into system health and performance.
Qualifications:
- 8+ years of experience in Site Reliability Engineering, DevOps, or related fields, with at least 3+ years in a leadership role.
- Proven track record of driving operational excellence in large-scale, distributed systems.
- Expertise in defining and implementing SLIs, SLOs, error budgets, and incident management processes.
- Strong knowledge of observability tools such as Prometheus, Grafana, Datadog, New Relic, or similar.
- Experience leading on-call rotations, postmortems, and operational readiness programs.
- Excellent leadership, communication, and stakeholder management skills.
Preferred Qualifications:
- Deep experience with AWS cloud environments, including operational best practices for high availability and reliability.
- AWS certifications such as AWS Certified DevOps Engineer – Professional, AWS Certified Solutions Architect – Professional, or AWS Certified Advanced Networking – Specialty.
- Experience with AWS monitoring and logging tools (CloudWatch, X-Ray, AWS Config, GuardDuty).
- Experience scaling SRE practices in high-growth or regulated environments.
- Hands-on background in software engineering with Python, Bash, or similar languages.
About Us
Benchmark Education Company is a leading publisher of core, supplemental, and intervention literacy and language resources in English and Spanish, both print and digital, as well as world-class professional development. Since its founding in 1998, our company has proven to be one of the most nimble and innovative content creators on the cutting edge of pedagogy and technology. The digital content in our many learning programs delivers all the rigor of its print counterpart and is designed for virtual and blended learning contexts.
Benchmark Education Publishing (BEC) and its affiliates are proud to be an Equal Opportunity Employer.
For further information, visit us at: https://www.benchmarkeducation.com