Add profiler to full finetune recipes #1288

gau-nernst · 2024-08-08T14:58:20Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses. #1279

Changelog

What are the changes made in this PR?

Add profiler to full_finetune_distributed.py and full_finetune_single_device.py. Basically copied over from LoRA recipes.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Just like the LoRA recipe, to profile, run:

tune run full_finetune_single_device \
  --config llama2/7B_full_low_memory \
  profiler.enabled=True \
  profiler.output_dir=./profile-test

Then we can open the trace under profile-test/iteration_12 with something like https://ui.perfetto.dev

And it reveals that, in my setup, most of the time is spent transferring data back and forth between GPU and CPU for paged Adam.

pytorch-bot · 2024-08-08T14:58:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1288

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 56f4adc with merge base 150a011 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SalmanMohammadi · 2024-08-08T18:57:32Z

recipes/full_finetune_distributed.py

+        self._profiler = self._setup_profiler(cfg.get(PROFILER_KEY, None))
+
+    def _setup_profiler(
+        self, cfg_profiler: Optional[DictConfig]


nit:

Suggested change

self, cfg_profiler: Optional[DictConfig]

self, cfg_profiler: Optional[DictConfig] = None

SalmanMohammadi · 2024-08-08T18:57:47Z

recipes/full_finetune_distributed.py

+        Parses the `profiler` section of top-level `cfg` and sets up profiler
+
+        Args:
+            cfg_profiler (DictConfig): `profiler` section of the top-level `cfg` (the main config passed to `recipe.main`)


nit:

Suggested change

cfg_profiler (DictConfig): `profiler` section of the top-level `cfg` (the main config passed to `recipe.main`)

cfg_profiler (Optional[DictConfig]): ``profiler`` section of the top-level ``cfg`` (the main config passed to ``recipe.main``). Default None.

I think this is the case in all the recipes. Sorry about our annoying linter : )

SalmanMohammadi · 2024-08-08T18:58:49Z

Minor nit to appease our linter but LGTM. Thanks for getting this out!!

SalmanMohammadi · 2024-08-08T18:59:21Z

cc @ebsmothers @joecummings to double check whether a quick example profile would be good to see

ebsmothers · 2024-08-08T21:07:23Z

Thanks @gau-nernst, this looks great! Re @SalmanMohammadi's comment: can you include a screenshot of one of the traces you got in the PR summary just to explicitly show that it works via an example? Also regarding the failing CI signals, can you check the linter one? (You can ignore the build signals, there is an unrelated breakage we are sorting out right now.) Let me know if you need any pointers on the linter and I'll be happy to provide them. After that I think this is good to go!

gau-nernst · 2024-08-09T00:56:50Z

Fix the linter issue. Sorry I didn't install pre-commit before making the first commit, so it slips through subsequent checks. Manually run pylintdoc to double check and no issue now. Also add the full finetune low memory trace to the PR description.

ebsmothers

Looks great, thank you for adding this!

gau-nernst added 2 commits August 8, 2024 20:42

add profiler for full finetune

fa383b6

don't use ctx manager to avoid diff

fc2ac54

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2024

SalmanMohammadi reviewed Aug 8, 2024

View reviewed changes

make linter happy

7b58292

Merge branch 'pytorch:main' into profile_more_recipes

56f4adc

ebsmothers approved these changes Aug 9, 2024

View reviewed changes

ebsmothers merged commit 18962f3 into pytorch:main Aug 9, 2024
20 checks passed

gau-nernst deleted the profile_more_recipes branch August 9, 2024 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add profiler to full finetune recipes #1288

Add profiler to full finetune recipes #1288

Uh oh!

gau-nernst commented Aug 8, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 8, 2024 •

edited

Loading

Uh oh!

SalmanMohammadi Aug 8, 2024

Uh oh!

SalmanMohammadi Aug 8, 2024 •

edited

Loading

Uh oh!

SalmanMohammadi Aug 8, 2024

Uh oh!

SalmanMohammadi commented Aug 8, 2024

Uh oh!

SalmanMohammadi commented Aug 8, 2024

Uh oh!

ebsmothers commented Aug 8, 2024

Uh oh!

gau-nernst commented Aug 9, 2024

Uh oh!

ebsmothers left a comment

Uh oh!

Uh oh!

Uh oh!

	self, cfg_profiler: Optional[DictConfig]
	self, cfg_profiler: Optional[DictConfig] = None

	cfg_profiler (DictConfig): `profiler` section of the top-level `cfg` (the main config passed to `recipe.main`)
	cfg_profiler (Optional[DictConfig]): ``profiler`` section of the top-level ``cfg`` (the main config passed to ``recipe.main``). Default None.

Add profiler to full finetune recipes #1288

Add profiler to full finetune recipes #1288

Uh oh!

Conversation

gau-nernst commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1288

✅ No Failures

Uh oh!

SalmanMohammadi Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi commented Aug 8, 2024

Uh oh!

SalmanMohammadi commented Aug 8, 2024

Uh oh!

ebsmothers commented Aug 8, 2024

Uh oh!

gau-nernst commented Aug 9, 2024

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gau-nernst commented Aug 8, 2024 •

edited

Loading

pytorch-bot bot commented Aug 8, 2024 •

edited

Loading

SalmanMohammadi Aug 8, 2024 •

edited

Loading