EdTech Jobs
Ellucian

Senior Site Reliability Engineer

Ellucian
🇺🇸In-Person - VA$155K–$215K/yri2h ago
Prep for this Role

Role Snapshot

Senior Site Reliability Engineer responsible for ensuring reliability, performance, and cost-efficiency of Ellucian's production systems serving 3,000+ higher education customers. This role focuses on observability with DataDog, incident management, and infrastructure optimization across cloud environments.

Key Responsibilities: Own system reliability and performance for production environments, design and manage DataDog-based monitoring and alerting, lead incident response and root cause analysis efforts, and automate operational processes. Partner with engineering teams to build scalable infrastructure and continuously optimize deployment practices and cloud costs.
Skills & Tools: Strong hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting), proficiency in cloud platforms (AWS, Azure, GCP), DevOps practices (CI/CD, Infrastructure as Code), Docker/Kubernetes, and scripting languages (Python, Bash). Excellent troubleshooting and root cause analysis skills with proven ability to optimize cloud costs and collaborate across teams.
Qualifications: 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles with demonstrated expertise in distributed systems troubleshooting and cloud infrastructure management. Strong understanding of high-availability systems, SLI/SLO definition, and cloud cost optimization preferred.
Location: In-Person - VA
Compensation: $155K–$215K/yr (estimated)

Job Description

About Ellucian

Ellucian powers innovation for higher education, partnering with approximately 3,000 customers across 50 countries, serving more than 21 million students. Ellucian's AI-powered platform, trained on the richest dataset available in higher education, drives efficiency, personalized experiences, and strengthened engagement for all students, faculty and staff. Fueled by decades of experience with a singular focus on the unique needs of learning institutions, the Ellucian platform features best-in-class SaaS capabilities and delivers insights needed now and into the future. These solutions and services span the entire student lifecycle, including data-rich tools for student recruitment, enrollment, and retention to workforce analytics, fundraising, and alumni engagement. Ellucian's innovative solutions, vast ecosystem of partners and user community of more than 45,000 provides best practices leading to greater institutional success and achieving better student outcomes.


About the Opportunity

We are seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, performance, and cost-efficiency of our production systems. This role requires deep expertise in DataDog for observability and will focus on DevOps practices, incident management, root cause analysis, and cost optimization across cloud infrastructure and services.

Where You Will Make an Impact

  • Own and improve system reliability, availability, and performance for production environments
  • Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
  • Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
  • Perform detailed root cause analysis (RCA) and drive permanent resolutions
  • Partner with engineering and DevOps teams to build scalable, resilient infrastructure
  • Automate operational processes to improve efficiency and reduce risk
  • Analyze and optimize infrastructure and application costs
  • Define and manage SLIs/SLOs to meet reliability targets
  • Continuously improve deployment, monitoring, and operational practices

What You Will Bring

  • 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
  • Mandatory: Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
  • Experience with cloud platforms (AWS, Azure, or GCP)
  • Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
  • Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
  • Experience with containers and orchestration (Docker, Kubernetes)
  • Scripting or programming experience (Python, Bash, or similar)
  • Proven ability to analyze and optimize cloud costs

Preferred Qualifications

  • Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
  • Familiarity with cloud security and compliance best practices
  • Experience supporting high-availability, customer-facing systems
  • Strong collaboration and communication skills

What Success Looks Like

  • Improved system reliability and reduced incident frequency
  • Faster incident detection and resolution (MTTR)
  • Effective, actionable observability driven by DataDog
  • Measurable cost savings and optimized infrastructure usage

What makes #Ellucianlife

  • Comprehensive health coverage: medical, dental, and vision
  • Flexible time off
  • Thrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests
  • 401k w/ match & BrightPlan - to help you save for the future
  • Parental Leave
  • 5 charitable days to support the community that supports us
  • Telemedicine
  • Wellness
    • Headspace Care (mental health)
    • Wellbeats (virtual fitness classes)
  • RethinkCare & Wellthy– caregiver support
  • Diversity and inclusion programs which provide access to internal employee resource groups
  • Employee referral bonuses to encourage the addition of great new people to the team
  • We Foster a learning culture with:
    • Education Assistance Program
    • Professional development opportunities

#LI-RB1 
#LI-Remote

More Jobs at Ellucian

Ellucian

Associate Manager of Product Communications and Strategic Initiatives

Ellucian

$75K–$95K/yr