-
Notifications
You must be signed in to change notification settings - Fork 0
Created GRPOTrainerWithEval
subclass for different evaluation reward functions
#9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: working-grpo-2025-03-12
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces a new subclass, GRPOTrainerWithEval, which extends the GRPOTrainer functionality to support evaluation reward functions while maintaining backward compatibility.
- New GRPOTrainerWithEval subclass accepts separate evaluation reward functions and processing classes.
- Configuration handling is unified through the use of an instance attribute (_model_init_kwargs) and a dedicated helper method (_make_reward_processing_classes).
- The diff adds strict checking in zip calls to enforce matching lengths of reward functions and processing classes.
Reviewed Changes
File | Description |
---|---|
trl/trainer/grpo_trainer.py | Introduces GRPOTrainerWithEval and refactors reward processing and model init kwargs |
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
bd32f49
to
646cd14
Compare
646cd14
to
d178296
Compare
8a73624
to
d9c185a
Compare
d178296
to
05bedba
Compare
05bedba
to
3b1a796
Compare
Have you tested this against the new multi-task reward_func setup? |
Hi @shirinyamani thanks for the comment, no we stopped rebasing atop Perhaps if we rebased for newer features in |
This PR creates a
GRPOTrainer
subclassGRPOTrainerWithEval
that adds support for optionaleval_reward_processing_classes
.It should be backwards compatible with
GRPOTrainer
.The only caveat here is I didn't comprehensively think about
args.reward_weights
.