Skip to content

Add profiler to full finetune recipes #1288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 9, 2024

Conversation

gau-nernst
Copy link
Contributor

@gau-nernst gau-nernst commented Aug 8, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses. #1279

Changelog

What are the changes made in this PR?

Add profiler to full_finetune_distributed.py and full_finetune_single_device.py. Basically copied over from LoRA recipes.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Just like the LoRA recipe, to profile, run:

tune run full_finetune_single_device \
  --config llama2/7B_full_low_memory \
  profiler.enabled=True \
  profiler.output_dir=./profile-test

Then we can open the trace under profile-test/iteration_12 with something like https://ui.perfetto.dev

image

And it reveals that, in my setup, most of the time is spent transferring data back and forth between GPU and CPU for paged Adam.

Copy link

pytorch-bot bot commented Aug 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1288

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 56f4adc with merge base 150a011 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2024
self._profiler = self._setup_profiler(cfg.get(PROFILER_KEY, None))

def _setup_profiler(
self, cfg_profiler: Optional[DictConfig]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
self, cfg_profiler: Optional[DictConfig]
self, cfg_profiler: Optional[DictConfig] = None

Parses the `profiler` section of top-level `cfg` and sets up profiler

Args:
cfg_profiler (DictConfig): `profiler` section of the top-level `cfg` (the main config passed to `recipe.main`)
Copy link
Collaborator

@SalmanMohammadi SalmanMohammadi Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
cfg_profiler (DictConfig): `profiler` section of the top-level `cfg` (the main config passed to `recipe.main`)
cfg_profiler (Optional[DictConfig]): ``profiler`` section of the top-level ``cfg`` (the main config passed to ``recipe.main``). Default None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the case in all the recipes. Sorry about our annoying linter : )

@SalmanMohammadi
Copy link
Collaborator

Minor nit to appease our linter but LGTM. Thanks for getting this out!!

@SalmanMohammadi
Copy link
Collaborator

cc @ebsmothers @joecummings to double check whether a quick example profile would be good to see

@ebsmothers
Copy link
Contributor

Thanks @gau-nernst, this looks great! Re @SalmanMohammadi's comment: can you include a screenshot of one of the traces you got in the PR summary just to explicitly show that it works via an example? Also regarding the failing CI signals, can you check the linter one? (You can ignore the build signals, there is an unrelated breakage we are sorting out right now.) Let me know if you need any pointers on the linter and I'll be happy to provide them. After that I think this is good to go!

@gau-nernst
Copy link
Contributor Author

Fix the linter issue. Sorry I didn't install pre-commit before making the first commit, so it slips through subsequent checks. Manually run pylintdoc to double check and no issue now. Also add the full finetune low memory trace to the PR description.

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for adding this!

@ebsmothers ebsmothers merged commit 18962f3 into pytorch:main Aug 9, 2024
20 checks passed
@gau-nernst gau-nernst deleted the profile_more_recipes branch August 9, 2024 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants