-
Notifications
You must be signed in to change notification settings - Fork 10
Renaming and Organization of RL algorithms in preparation for Development #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - will be good to maybe run a test for dpo / grpo with main and this branch with the appropriate yaml changes to show things look good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you! As mentioned above, let's just test RMs, DPO, and an online RL training flow to make sure things still work.
MCLI run names for each pipeline: Reward Training: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, and overall LGTM! thank you for this! I'd also want @gupta-abhay or @dakinggg to stamp to just be sure, but from my end LGTM
LGTM! Please update yamls here (and let's figure out a path away from the bwd compatible stuff asap - unless TAO production is hinging on that old setup). Huge thanks for doing this :) |
I apologize in advance since this shuffling/renaming will require a bit of updating of yamls and dependencies. My main goal is to make the organization of the code a bit more semantically meaningful and hopefully clearer to onboard. Here is a summary of the main changes
algorithms
directorybuffers.py
=>data
rather than keep only inonline
model_methods.py
to implementforward
andloss
variants.online_rl_loss
topolicy_loss
andcritic_loss
OnPolicyEnum
andAlgorithm_Type
Enums forloss_type
similar toPairwiseOfflineEnum
online_rl_loss
just returnreturn_dict
similar to other algorithm pipelinesMCLI run names for each pipeline:
Reward Training:
reward-reorg-aWY7m3
Offline Training:
rebel-reorg-nboVUO
Online Training:
grpo-reorg-fK2HDm
MLFlow Link