FSDP Llama3 wrapping improvements for full finetune #865

rohan-varma · 2024-04-25T00:49:50Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

This PR primarily seeks to improve memory efficiency specifically for llama3 full distributed training and enable a distributed finetune in 4x 24GB of memory. We do this with a new FSDP wrapping policy that wraps the token embedding and output projections. These are much larger for llama3 due to the increased vocab size, so sharding them across GPUs has more of an effect.

Added new API to retrieve memory efficient FSDP wrapping policy. To control whether the memory efficient wrapping policy is retrieved, we introduce a new flag memory_efficient_wrapping in our configs. Currently, this is only set to True for llama3 distributed full finetuning. As follow up work, we'll investigate other workloads with this wrapping and enable where beneficial.
Added appropriate unittests
Integrated in full_finetune_distributed. Did a bit of study on potential integration into LoRA, but memory savings were less pronounced there - this needs further investigations.

Test plan

Added unittests.

This PR only seeks to ship improvements to llama3 training.

Docs

Full finetune

Run command for full finetune: CUDA_VISIBLE_DEVICES=0,3,6,7 tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full batch_size=1

With this PR: peak_memory_active:20.830272 peak_memory_alloc:19.085376 peak_memory_reserved:23.699914752, 1.06it/s
Without this PR: peak_memory_active:24.170446336 peak_memory_alloc:21.988057088 peak_memory_reserved:27.908898816, 1.08it/s
About 13% savings in allocated memory, 15% in memory resereved. This allows us to get a 4x 24GB finetune.
NOTE: A previous version of this PR also wrapped the token embedding and output projection in their own activation checkpointing units, but this is not needed. Vocabulary size is increased, but activations generated are proportional to sequences, not vocab size, so checkpointing these won't help more for llama3 compared to llama2. A quick study checkpointing these versus not shows roughly the same memory efficiency. In particular, with checkpointing the token embedding and output proj, we achieve peak_memory_active:20.880037376 peak_memory_alloc:19.135141376 peak_memory_reserved:24.13821952, while without it, we achieve the numbers reported above: they are very comparable.

pytorch-bot · 2024-04-25T00:49:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/865

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6124dd5 with merge base 7d05579 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rohan-varma · 2024-04-25T19:41:35Z

torchtune/utils/memory.py

+def _llama3_ac_policy(module: nn.Module, recurse: bool, modules_to_wrap, **kwargs):
+    # Label that output_proj should be wrapped individually.
+    if isinstance(module, modules.TransformerDecoder):
+        targ = module.output.module


this should probably use a helper function called _get_fsdp_wrapped_module or something that is aware of whether module.output is wrapped in FSDP or not and intelligently unwraps it.

ebsmothers

Overall this makes sense to me and the memory savings for full finetune are great. My main question is around whether model type is the most natural way to expose this feature.

There's nothing about this functionality that is unique to Llama3, it's just that it proves most beneficial there. By doing things this way we are kinda making the decision for people that only Llama3 should have this feature, and supporting other models with large vocab sizes will then require updating the wrapping internals instead of just flipping a config. I know we had discussed Gemma as one specific model where this is a challenge, but I wonder if we can do an assert on the backend to raise an error if the model class is not TransformerDecoder as we would expect.

ebsmothers · 2024-04-26T22:56:27Z

torchtune/utils/memory.py

+def _llama3_ac_policy(module: nn.Module, recurse: bool, modules_to_wrap, **kwargs):
+    # Label that output_proj should be wrapped individually.
+    if isinstance(module, modules.TransformerDecoder):
+        targ = _maybe_fsdp_unwrap(module.output)


Sorry, dumb question here: is this just because we want to support both non-FSDP and FSDP models? Cause rn we are only integrating into distributed recipe, in which case we could (not saying should) assume FSDP

So this isn't really because we want to support FSDP and non FSDP models, but we do happen to have this support with this function.

The reason we need this unwrap call is if the model, when we wrap in AC, has already been wrapped in FSDP. Then we want to AC wrap module.output, but just accessing module.output when the model is FSDP wrapped may give us back the FSDP class. So we further unwrap to retrieve the local nn.module.

ebsmothers · 2024-04-26T23:10:27Z

CUDA_VISIBLE_DEVICES=0,3,6,7

👀

rohan-varma · 2024-04-29T02:36:27Z

Overall this makes sense to me and the memory savings for full finetune are great. My main question is around whether model type is the most natural way to expose this feature.

There's nothing about this functionality that is unique to Llama3, it's just that it proves most beneficial there. By doing things this way we are kinda making the decision for people that only Llama3 should have this feature, and supporting other models with large vocab sizes will then require updating the wrapping internals instead of just flipping a config. I know we had discussed Gemma as one specific model where this is a challenge, but I wonder if we can do an assert on the backend to raise an error if the model class is not TransformerDecoder as we would expect.

@ebsmothers I definitely agree here. My proposal on a way forward would be to decouple the model type from the checkpointer and offer it as a general accessor to determine which model is being trained - there's currently no easy way to go about this. And I'd like for this change to be especially focused on llama3 (so the initial rollout of these policies will only be done for llama3). As follow up work we should enable for llama2 and investigate other models, although verifying these improvements are currently a long-running process and should ideally be done iteratively and/or by multiple folks, IMO

@joecummings could you chime in on ModelType for this sort of use case and if you happen to have any, alternative ways to achieve this sort of gating based on specific models here?

joecummings · 2024-04-29T13:43:58Z

@rohan-varma I could totally be missing something here, but why can't we include embedding in the modules to wrap within the config for Llama3, rather than tie this directly to ModelType? That way, you can expand this to any new models that have this large embedding space, which is starting to become more popular.

cc: @ebsmothers

musabgultekin · 2024-04-29T13:46:52Z

I haven't tested this but will this allow 70B on 8x80GB? I was only able to full-fine tune 70B with cpu offloading

rohan-varma · 2024-04-29T16:39:36Z

@rohan-varma I could totally be missing something here, but why can't we include embedding in the modules to wrap within the config for Llama3, rather than tie this directly to ModelType? That way, you can expand this to any new models that have this large embedding space, which is starting to become more popular.

cc: @ebsmothers

@joecummings This is because modules_to_wrap is not configurable right now, and configuring it would be a little tricky (i.e. we'd have to parse the string like torch.nn.Embedding and make it a class)

rohan-varma · 2024-04-30T21:52:22Z

I haven't tested this but will this allow 70B on 8x80GB? I was only able to full-fine tune 70B with cpu offloading

This unfortunately won't allow 70B on 8x80GB from my experiments without CPU offloading, but can do a bit more testing. our current thinking is to enable full finetune for 70B models with CPU offload.

recipes/full_finetune_distributed.py

rohan-varma · 2024-05-03T01:11:56Z

torchtune/utils/_distributed.py

+            have not been verified and may not see the same improvements.
+    Returns:
+        FSDPPolicyType: Wrapping policy that can be passed into ``FullyShardedDataParallel`` as the ``auto_wrap_policy``
+            argument. Please see documentation for `torchtune.utils.FSDPPolicyType` for additional details.


Is there any way to link directly to this docstr? @ebsmothers or @NicolasHug maybe?

Can try this? ref

:const:`~torchtune.utils.FSDPPolicyType`

Bumping this

rohan-varma · 2024-05-03T01:16:14Z

Thanks for adding this!

I don't think I fully understand this:

New AC wrapping policy that checkpoints the token embedding and output projections as well. Similar reason to above - they generate larger activations so it would be useful to not store those in memory.

Irrespective of the size of the vocab, the output of the embedding table would just depend on the sequence length? So why does this have anything to do with the vocab size? Or am I misunderstanding?

@kartikayk Thanks for the feedback and the review! You're totally right that this doesn' t have anything to do with the vocab size and this was an oversight on my end. I verified that if we remove the modified AC wrapping, we don't change anything about the memory improvements we're shippping here. So this PR is now only limited to FSDP wrapping changes.

Also added a bunch more documentation to FSDPPolicyType to clearly explain it to the user and link back to FSDP wrapping docs where useful. thanks!

ebsmothers · 2024-05-03T23:46:00Z

torchtune/utils/_distributed.py

+            have not been verified and may not see the same improvements.
+    Returns:
+        FSDPPolicyType: Wrapping policy that can be passed into ``FullyShardedDataParallel`` as the ``auto_wrap_policy``
+            argument. Please see documentation for `torchtune.utils.FSDPPolicyType` for additional details.


Can try this? ref

:const:`~torchtune.utils.FSDPPolicyType`

ebsmothers · 2024-05-03T23:59:47Z

torchtune/utils/_distributed.py

+    """
+    A default policy for wrapping Llama-3 style models for full finetuning using FSDP. Specifically,
+    this will wrap the model's token embedding and output projection into their own FSDP units to
+    maximize memory savings. After this is done, model will also be hierarchically wrapped


Can we be a little bit more explicit about why this maximizes memory savings here? (At least say that this helps because the embedding and output layers are quite large)

ebsmothers · 2024-05-04T00:10:48Z

torchtune/utils/_distributed.py

+    def llama3_wrap(module: nn.Module, recurse: bool, **kwargs):
+        # Label that output_proj should be wrapped individually.
+        if isinstance(module, modules.TransformerDecoder):
+            module.output._wrap = True


Re the transformer decoder changes, I think the main thing is that the if isinstance check may need to be generalized (since now e.g. our TransformerDecoderLM or TransformerDecoderClassifier classes will both have output layers). But realistically I think the main use case will still be for when we're projecting to vocab_size, so maybe just directly replacing with TransformerDecoderLM (or whatever we're calling it) will be sufficient here.

ebsmothers · 2024-05-04T00:11:53Z

torchtune/utils/_distributed.py

+        return ModuleWrapPolicy(modules_to_wrap)
+
+
+def _llama3_full_fsdp_wrap_policy(modules_to_wrap: Set[Type]) -> FSDPPolicyType:


Also bump: are we still naming this based on llama3?

ebsmothers

OK a few more comments but overall no major concerns from my side

recipes/configs/llama2/7B_full.yaml

ebsmothers · 2024-05-06T23:52:40Z

torchtune/utils/_distributed.py

+    def llama3_wrap(module: nn.Module, recurse: bool, **kwargs):
+        # Label that output_proj should be wrapped individually.
+        if isinstance(module, modules.TransformerDecoder):
+            module.output._wrap = True


Oh yeah to clarify there are likely to be some inbound changes to the TransformerDecoder class itself. Basically we will probably have

(a) a base class without any output layer (so just token embeddings, decoder layers, and norm),
(b) a class equivalent to our existing TransformerDecoder, but with (a) as a component + the usual output projection to vocab size, and
(c) a separate classifier composing (a) with a more general head module.

Personally I don't think you should optimize for changes that haven't landed yet, but we should at least have an idea of how we'd need to change the util to support this.

ebsmothers · 2024-05-06T23:56:49Z

torchtune/utils/_distributed.py

+
+    Args:
+        memory_efficient_fsdp_wrap (bool): If ``True``, will also wrap embedding and output projection layers
+        with FSDP.


I think you need to indent or something. This is rendering weirdly in the live docs (can be seen in the screenshot you attached for the summary)

Oh oops, I changed this for LoRA on L226 but didn't change it here, thanks!

ebsmothers · 2024-05-06T23:58:43Z

torchtune/utils/_distributed.py

+            have not been verified and may not see the same improvements.
+    Returns:
+        FSDPPolicyType: Wrapping policy that can be passed into ``FullyShardedDataParallel`` as the ``auto_wrap_policy``
+            argument. Please see documentation for `torchtune.utils.FSDPPolicyType` for additional details.


Bumping this

codecov-commenter · 2024-05-07T23:31:06Z

Codecov Report

Attention: Patch coverage is 24.32432% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 26.67%. Comparing base (a978956) to head (6124dd5).
Report is 29 commits behind head on main.

Files	Patch %	Lines
tests/torchtune/utils/test_distributed.py	21.05%	15 Missing ⚠️
torchtune/utils/_distributed.py	27.77%	13 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #865       +/-   ##
===========================================
- Coverage   66.39%   26.67%   -39.72%     
===========================================
  Files         155      172       +17     
  Lines        6484     7182      +698     
===========================================
- Hits         4305     1916     -2389     
- Misses       2179     5266     +3087

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

)" This reverts commit fa1392b.

upd

cf05040

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2024

rohan-varma marked this pull request as draft April 25, 2024 00:49

rohan-varma added 2 commits April 25, 2024 12:30

Merge remote-tracking branch 'origin/main' into wrapping_imp

fc02178

upd

46a92e3

rohan-varma commented Apr 25, 2024

View reviewed changes

rohan-varma added 4 commits April 25, 2024 16:08

upd

84f4e8e

upd

9dc0ff1

upd

2fb53e4

upd

53e63ee

rohan-varma changed the title ~~[WIP] Llama wrapping improvements~~ AC and FSDP Llama3 wrapping improvements Apr 26, 2024

rohan-varma mentioned this pull request Apr 26, 2024

Decouple ModelType from checkpointer #882

Closed

rohan-varma added 2 commits April 26, 2024 11:34

upd

551bf7c

upd

d8de275

rohan-varma marked this pull request as ready for review April 26, 2024 18:36

rohan-varma requested a review from ebsmothers April 26, 2024 18:36

ebsmothers reviewed Apr 26, 2024

View reviewed changes

rohan-varma added 3 commits April 30, 2024 14:54

Merge branch 'main' of github.com:pytorch/torchtune into wrapping_imp

ecbfc3f

upd

1fc4319

upd

6bb4f62

rohan-varma requested a review from ebsmothers April 30, 2024 22:49

upd

da5b4ed

ebsmothers reviewed May 1, 2024

View reviewed changes

recipes/full_finetune_distributed.py Outdated Show resolved Hide resolved

rohan-varma changed the title ~~FSDP Llama3 wrapping improvements~~ FSDP Llama3 wrapping improvements for full finetune May 3, 2024

rohan-varma added 7 commits May 2, 2024 17:51

upd

3a47133

upd

96beaf9

upd

95cfee1

upd

3916af5

upd

bd15672

upd

5100b97

upd

0086afb

rohan-varma commented May 3, 2024

View reviewed changes

rohan-varma added 2 commits May 2, 2024 18:13

Upd

9cc6ff5

Merge branch 'main' of github.com:pytorch/torchtune into wrapping_imp

162c3ae

rohan-varma requested review from ebsmothers and kartikayk May 3, 2024 01:14

rohan-varma added 6 commits May 2, 2024 18:19

upd

8a920eb

doc

6bb698e

upd

5c0572e

upd

7906ed6

upd

a665e0a

upd

d010c8f

ebsmothers reviewed May 4, 2024

View reviewed changes

upd

c7a9087

rohan-varma requested a review from ebsmothers May 6, 2024 19:28

ebsmothers approved these changes May 7, 2024

View reviewed changes

Doc updates

6124dd5

rohan-varma merged commit fa1392b into main May 7, 2024

andrewor14 added a commit to andrewor14/torchtune that referenced this pull request May 14, 2024

Revert "FSDP Llama3 wrapping improvements for full finetune (pytorch#865

99b9dd8

)" This reverts commit fa1392b.

joecummings deleted the wrapping_imp branch May 14, 2024 19:51

weifengpy mentioned this pull request Jun 7, 2024

FSDP2 + QLoRA: NF4 dispatch error #1072

Closed

		return ModuleWrapPolicy(modules_to_wrap)


		def _llama3_full_fsdp_wrap_policy(modules_to_wrap: Set[Type]) -> FSDPPolicyType:

FSDP Llama3 wrapping improvements for full finetune #865

FSDP Llama3 wrapping improvements for full finetune #865

Uh oh!

Conversation

rohan-varma commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

Docs

Full finetune

Uh oh!

pytorch-bot bot commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/865

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers commented Apr 26, 2024

Uh oh!

rohan-varma commented Apr 29, 2024

Uh oh!

joecummings commented Apr 29, 2024

Uh oh!

musabgultekin commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohan-varma commented Apr 29, 2024

Uh oh!

rohan-varma commented Apr 30, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented May 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 7, 2024

Codecov Report

Uh oh!

Uh oh!

rohan-varma commented Apr 25, 2024 •

edited

Loading

pytorch-bot bot commented Apr 25, 2024 •

edited

Loading

musabgultekin commented Apr 29, 2024 •

edited

Loading