Skip to content

Renaming and Organization of RL algorithms in preparation for Development #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jun 16, 2025

Conversation

jdchang1
Copy link
Collaborator

@jdchang1 jdchang1 commented Jun 5, 2025

I apologize in advance since this shuffling/renaming will require a bit of updating of yamls and dependencies. My main goal is to make the organization of the code a bit more semantically meaningful and hopefully clearer to onboard. Here is a summary of the main changes

  • Removed all algorithm specific references like DPO and PPO when we are using these as general purpose abstractions of families of algorithms
  • Moved algorithm code to an algorithms directory
  • shuffled some files to more meaningful directories. I.e. buffers.py => data rather than keep only in online
  • unified between offline, online, and reward_modeling directories to use model_methods.py to implement forward and loss variants.
  • Split online_rl_loss to policy_loss and critic_loss
  • Added OnPolicyEnum and Algorithm_Type Enums for loss_type similar to PairwiseOfflineEnum
  • Made online_rl_loss just return return_dict similar to other algorithm pipelines

MCLI run names for each pipeline:

Reward Training: reward-reorg-aWY7m3
Offline Training: rebel-reorg-nboVUO
Online Training: grpo-reorg-fK2HDm

MLFlow Link

@jdchang1 jdchang1 marked this pull request as ready for review June 5, 2025 18:16
Copy link
Collaborator

@gupta-abhay gupta-abhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - will be good to maybe run a test for dpo / grpo with main and this branch with the appropriate yaml changes to show things look good.

Copy link
Collaborator

@bcui-db bcui-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you! As mentioned above, let's just test RMs, DPO, and an online RL training flow to make sure things still work.

@jdchang1
Copy link
Collaborator Author

MCLI run names for each pipeline:

Reward Training: reward-reorg-aWY7m3
Offline Training: rebel-reorg-nboVUO
Online Training: grpo-reorg-fK2HDm

MLFlow Link

Copy link
Collaborator

@bcui-db bcui-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, and overall LGTM! thank you for this! I'd also want @gupta-abhay or @dakinggg to stamp to just be sure, but from my end LGTM

@gupta-abhay
Copy link
Collaborator

gupta-abhay commented Jun 16, 2025

LGTM! Please update yamls here (and let's figure out a path away from the bwd compatible stuff asap - unless TAO production is hinging on that old setup).

Huge thanks for doing this :)

@jdchang1 jdchang1 merged commit 6bcfeba into main Jun 16, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants