Senior Software Engineer - Model Training & AI Evals

🇮🇳Remote - Remote (India), India₹2M–₹4.5M/yri3h ago

Corporate Engineering Machine Learning AI Development Model Training Data Engineering Software Engineering

Role Snapshot

Senior Engineer to own end-to-end evaluation and benchmarking infrastructure for LLM models while contributing to post-training pipelines for vertical foundation models. This role is critical for driving model improvements through rigorous evaluation science and synthetic data generation.

Key Responsibilities: Design and own task-level evaluation frameworks for LLM agents and base models, build comparative benchmarking pipelines against frontier models, produce capability gap reports, and develop domain-specific benchmarks tailored to vertical use-cases. Lead or contribute directly to SFT, RLHF, RLAIF, and DPO training runs while curating high-quality instruction-tuning datasets and translating eval insights into post-training decisions.

Skills & Tools: Expert-level proficiency in LLM evaluation methodologies, benchmark design, synthetic data generation, post-training techniques (SFT, RLHF, RLAIF, DPO), and Python/ML infrastructure. Strong ability to work at the intersection of evaluation science, research, and product with excellent communication skills to partner across teams.

Qualifications: Minimum 5+ years software engineering experience with direct hands-on work inside an LLM lab or foundation model team. Deep understanding of rigorous evaluation at scale, model training pipelines, and proven ability to design and execute complex ML projects from conception to deployment.

Location: Remote - Remote (India), India

Compensation: ₹2M–₹4.5M/yr (estimated)

Prep for this role with AI tools →

Job Description

Job Description ABOUT THE ROLE We are looking for a Senior Engineer to join our AI team at the intersection of evaluation science, post-training, and foundation model development. You will own our end-to-end eval and benchmarking infrastructure — the critical feedback loop that drives every major model improvement — while contributing hands-on to post-training pipelines for industry-specific vertical foundation models. This role is ideal for someone who has worked directly inside an LLM lab and understands what rigorous evaluation looks like at scale: designing the taxonomy of skills being measured, identifying failure modes, engineering synthetic data to close capability gaps, and translating eval signals into actionable training decisions. WHAT YOU'LL DO Evaluation & Benchmarking Design and own task-level evaluation frameworks for LLM agents and base models, covering multi-step reasoning, tool/API use, instruction following, and domain knowledge — grounded in real user failure modes rather than off-the-shelf benchmark suites. Build comparative benchmarking pipelines to assess leading frontier models (GPT-4o, Gemini, Claude, Llama, Mistral, etc.) against each other and against internal models, with structured analysis of where each model family fails, regresses, or excels across subjects, topics, and task types. Produce capability gap reports that quantify performance deltas across dimensions such as subject-matter accuracy, reasoning depth, factual consistency, and refusal behaviour. Track model version regressions across provider releases to maintain a living competitive intelligence benchmark. Develop domain-specific benchmarks tailored to vertical use-cases (e.g., STEM tutoring, legal, finance, healthcare) — including problem taxonomy design, rubric definition, and inter-annotator agreement pipelines. Define and drive synthetic data generation strategies to systematically address model shortcomings in specific subjects, topics, and skill areas: Identify low-performance clusters from eval results and translate them into targeted data generation prompts and pipelines. Design LLM-assisted pipelines for generating high-quality, diverse, and verifiable synthetic training and evaluation data at scale. Validate synthetic data quality through auto-eval, human review, and downstream model performance lift experiments. Build automated regression suites integrated into CI/CD workflows to detect capability degradation across fine-tuning runs and model updates. Partner with product, curriculum, and research teams to translate eval insights into prioritized post-training and data flywheel decisions. Post-Training & Fine-Tuning Lead or directly contribute to SFT, RLHF, RLAIF, and DPO training runs on industry-specific vertical foundation models — from dataset design through training execution and eval-gated release. Curate and engineer high-quality instruction-tuning and preference datasets for domain adaptation, with hands-on experience distinguishing signal from noise in annotation pipelines. Define data quality criteria, rejection sampling strategies, and deduplication pipelines for SFT corpora. Design preference pair construction methodologies and reward model training setups grounded in domain-specific quality rubrics. Implement and experiment with alignment techniques including reward modelling, process reward models (PRMs), and constitutional/RLAIF approaches. Run ablation studies and controlled experiments to attribute model behaviour changes to specific data or training interventions — not just report final numbers. Contribute to continual pre-training and domain-adaptive fine-tuning pipelines for vertical models, including domain data sourcing, mixing strategies, and curriculum design. Infrastructure & Tooling Build scalable eval pipelines that run automatically on every training checkpoint and integrate into CI/CD for continuous model quality tracking. Maintain model cards, eval leaderboards, and internal dashboards providing visibility across experiments for both technical and non-technical stakeholders. Ensure reproducibility through rigorous experiment tracking (W&B, MLflow, or equivalent), versioned datasets, and documented training configs. WHO YOU ARE Required 5+ years of ML/AI engineering experience, with at least 2–3 years focused on large language models.

Lab pedigree: Direct, hands-on experience at an LLM lab, AI research organization, or equivalent frontier AI team — you have shipped models, not just called APIs.

Familiarity with the full model lifecycle: pre-training data, post-training alignment, eval, and production deployment. Deep practical expertise in post-training methods: SFT, RLHF, RLAIF, DPO, PPO — from dataset construction through training and eval-gated release. Experience with reward modeling, preference data curation, and quality control for alignment pipelines. Demonstrated experience designing LLM evaluation frameworks beyond standard benchmarks — including task-level evals for agentic or multi-step workflows. Hands-on experience building synthetic data generation pipelines to address specific model capability gaps: Designing targeted generation prompts based on eval failure analysis. Validating synthetic data quality through downstream model performance experiments. Proven track record of comparative benchmarking across leading foundation models, with structured analysis of capability shortcomings by subject, skill, or task type. Experience training or fine-tuning vertical/industry-specific foundation models — domain data curation, continual pre-training, or domain-adaptive SFT.

Strong software engineering fundamentals: Python, PyTorch or JAX, distributed training Preferred Publications or applied research contributions in LLM evaluation, alignment, or post-training. Experience with multi-modal models or agents with external tool/API use. Exposure to red-teaming, adversarial evaluation, or safety benchmarking. Model distillation, speculative decoding, or inference optimization experience. Prior experience in an education, STEM, legal, biomedical, or enterprise software vertical. WHAT SUCCESS LOOKS LIKE 30 Days Fully onboarded into training infra and eval repos. Running existing benchmarks end-to-end and producing a written gap analysis identifying missing coverage. 60 Days Shipped at least one new domain-specific benchmark and one synthetic data generation pipeline addressing a known model gap. CI-integrated eval running on every checkpoint. 3 Months Standardize model evaluation framework for foundation models. Own golden dataset strategies for fine-tuning with measurable subject-accuracy gains 6 Months Recognized internally as the authority on model quality and competitive benchmarking. Eval insights are directly driving roadmap prioritization. Why do we exist? Students are working harder than ever before to stabilize their future. Our recent research study called State of the Student shows that nearly 3 out of 4 students are working to support themselves through college and 1 in 3 students feel pressure to spend more than they can afford. We founded our business on provided affordable textbook rental options to address these issues. Since then, we’ve expanded our offerings to supplement many facets of higher educational learning through Chegg Study, Chegg Math, Chegg Writing, Chegg Internships, Chegg Skills, and more to support students beyond their college experience. These offerings lower financial concerns for students by modernizing their learning experience. We exist so students everywhere have a smarter, faster, more affordable way to student.

Video Shorts Life at Chegg: http://youtu.be/Fwf90zgaOLA Chegg

Corporate Career Page: https://jobs.chegg.com/ Chegg

India: http://www.cheggindia.com/ Chegg

Israel: http://www.chegg.com/about/working-at-chegg/israel/ Chegg

Skills: https://www.chegg.com/skills Chegg out our culture and benefits! http://www.chegg.com/about/working-at-chegg/benefits/ Chegg is an equal opportunity employer What is Chegg? An ‘always on’ digital learning platform. Chegg puts students first…Everything we build in this company is student-focused, making us the leading student-first connected learning platform. Chegg strives to improve the overall return on investment in education by helping students learn more in less time and at a lower cost. This is achieved by providing students a multitude of educational tools from affordable textbook rentals to Chegg Study which supplements their learning through 24/7 tutor access, step-by-step help with questions, and more. Chegg is a publicly-held company based in Santa Clara, California and trades on the NYSE under the symbol CHGG.