Skip to content

[Draft] TP Overlap of Micro Batches #4963

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Apr 1, 2025

Motivation

Draft of TP Overlap of Micro-Batches (LLaMA)
This is the forward part as described in arXiv:2409.15241.
Currently, it is implemented with a rough code hack for the POC stage. Functional and mmlu tests have been passed on the torch kernel backend and torch comm backend

Modifications

Checklist

@inkcherry inkcherry marked this pull request as draft April 1, 2025 06:09
@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Apr 1, 2025

Wondering whether code in #4068 may be a bit helpful. For example,

  • Code to split a batch into two micro batches
  • Code to allow users to write two-batch-overlap code as if one-batch code

@zhyncs
Copy link
Member

zhyncs commented Apr 1, 2025

Amazing!

@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Apr 1, 2025

Extracting: #4965

@inkcherry
Copy link
Contributor Author

Wondering whether code in #4068 may be a bit helpful. For example,

  • Code to split a batch into two micro batches
  • Code to allow users to write two-batch-overlap code as if one-batch code

Thanks! this PR is only at the POC stage and hasn't considered the design aspect.
#4068 has a good design, let me understand how to use it.

@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Apr 1, 2025

@inkcherry Btw many things in diff of #4068 are unrelated to it, but are actually other not-yet-merged PRs that 4068 utilizes , so you can ignore them when reading.

@inkcherry
Copy link
Contributor Author

inkcherry commented Apr 11, 2025

Hi, @fzyzcjy,
I’m able to adopt your designed API and have verified mmlu with local tests, including:

  • split a batch into two micro batches
  • Two-batch stage pipeline layout design

That’s very helpful, once the related code is merged, I’ll continue with further updates : )

@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Apr 11, 2025

Looks great! I hope my PRs can be merged soon.

@merrymercy merrymercy closed this Apr 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants