Skip to content

update memory optimization tutorial #1948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Nov 7, 2024

Conversation

felipemello1
Copy link
Contributor

@felipemello1 felipemello1 commented Nov 4, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)
    image

Copy link

pytorch-bot bot commented Nov 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1948

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1e4a198 with merge base 9eced21 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 4, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.74%. Comparing base (9eced21) to head (6ec2c8f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1948      +/-   ##
==========================================
- Coverage   68.40%   65.74%   -2.66%     
==========================================
  Files         311      311              
  Lines       16973    16973              
==========================================
- Hits        11610    11159     -451     
- Misses       5363     5814     +451     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 18 to 21
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context lengths. Be aware that it will slow down training speed."
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed due to the overhead of moving tensors between GPU VRAM and CPU. We minimize it by using a different stream, so you may not experience any slow down. This **should** be used alongside activation checkpointing."
":ref:`glossary_grad_accm`", "Helpful when memory-constrained to simulate larger batch sizes. However, it is not compatible with optimizer in backward. Use it when training adapters, e.g. LoRA, which don't benefit from optimizer in backward, or when you can already fit a batch with you memory, but not enough of them.
":ref:`glossary_low_precision_opt`", "When you have a large model (optimizer state is 2x model size) and need to further reduce memory. Note that lower precision optimizers may reduce training stability/accuracy."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we are making these so-called summaries longer rather than shorter, which I do not like. Like honestly I have half a mind to just give columns like "memory", "tokens per second", "when to use" (maybe "constraints" too?) with up/down arrows for the first two columns and < 10 words on "when to use" (e.g. long sequence length, increase effective batch size, etc). The individual items further down should be responsible for getting more into the weeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the offloading blurb is too handwavy. (Again would benefit from being moved down to the more detailed section to give it the proper treatment, as a first-time reader just trying to understand my options I definitely do not care that it's using a separate stream)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like honestly I have half a mind to just give columns like "memory", "tokens per second", "when to use" (maybe "constraints" too?) with up/down arrows for the first two columns and < 10 words on "when to use" (e.g. long sequence length, increase effective batch size, etc).

I'd be up for this, happy to do it in a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SalmanMohammadi , i am gathering data to write a blog and add a table. After i have a draft, i would love to have your input. I should have it prob in a week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i simplified and removed info

":ref:`glossary_qlora`", "When you need even more memory savings than LoRA, at the potential cost of some training speed. Useful for very large models or limited hardware."
":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed, as CPU optimizer steps can be slow and bottleneck training performance. Prioritize using it only if the other techniques are not enough."
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy"
":ref:`glossary_qlora`", "When you are training a large model and quantizing it will save significant memory, at the potential cost of some training speed and accuracy."
":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was already here but I think we may wanna take it out. Like DoRA will increas memory and reduce training speed compared to LoRA, I think it does not fit in as well with the other techniques mentioned here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to say we should take DoRA from this whole doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the more I think about it I think we should. Since it's more of a modeling improvement on top of an existing memory saving technique. But lmk if you strongly disagree

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like to see features that we offer documented somewhere rather than not documented, so whether we put this here or e.g. move the section to the LoRA recipe docs to let users know they can use it, or as another option under the LoRA section, I'm easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left DoRA as "a variant of LoRA" as a middle ground

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh I like this even less. Now it is both (a) there and (b) uninformative. Sorry I am being a PITA. Let's say "a variant of LoRA that can improve model performance at the cost of slightly more memory"?

* Gradient accumulation should always be set to 1 when ``offload_gradients=True``, as gradients are cleared on GPU every backward pass.
* This optimizer works by keeping a copy of parameters and pre-allocating gradient memory on CPU. Therefore, expect your RAM usage to increase by 4x model size.
* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` in any distributed recipe. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details
* This optimizer is only supported for single-device recipes. To use CPU-offloading in distributed recipes, use ``fsdp_cpu_offload=True`` instead. See :class:`torch.distributed.fsdp.FullyShardedDataParallel` for more details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is :class:torch.distributed.fsdp.FullyShardedDataParallel still the best API ref here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 409 to 410
When considering using QLoRA to reduce memory usage, it's worth noting that a) QLoRA is slower than LoRA and may not be worth it if
the model you are finetuning is small; b) QLoRA prevents accuracy degradation during quantization by up-casting quantized parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this inline enumeration will not look great in the live docs. Also you can probably be more explicit about the memory savings, right? Doesn't it save roughly a constant 1.5 bytes * (# of model parameters) compared to LoRA? In general would try to be less handwavy about memory/perf statements in this doc wherever possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the a/b and added 1.5 bytes

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like doc build job is failing

":ref:`glossary_dora`", "Like LoRA, DoRA can provide significant memory savings and training speed-ups. DoRA may improve performance over LoRA, particularly when using small rank updates."
":ref:`glossary_precision`", "You'll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter, half of fp32."
":ref:`glossary_act_ckpt`", "Use when you're memory constrained and want to use a larger model, batch size or context lengths. Be aware that it will slow down training speed."
":ref:`glossary_act_off`", "Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? You should always activation offload when using activation checkpointing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other way around, you always need ckpt when using offloading. Yes, otherwise its painfully slow.

":ref:`glossary_cpu_offload`", "Offloads optimizer states and (optionally) gradients to CPU, performing optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough."
":ref:`glossary_lora`", "When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy"
":ref:`glossary_qlora`", "When you are training a large model, since quantization will save 1.5 bytes * (# of model parameters), at the potential cost of some training speed and accuracy."
":ref:`glossary_dora`", "A variant of LoRA."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which may use slightly more memory and reduce training speed than LoRA? idk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is ok to keep in the section with the details?

Copy link
Collaborator

@SalmanMohammadi SalmanMohammadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your patience here : )

All of our recipes support lower-precision optimizers from the `torchao <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim>`_ library.
For single device recipes, we also support `bitsandbytes <https://huggingface.co/docs/bitsandbytes/main/en/index>`_ .

A good place to start might be the :class:`torchao.prototype.low_bit_optim.torchao.AdamW8bit` and :class:`bitsandbytes.optim.PagedAdamW8bit` optimizers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but these :class: annotations don't actually render. If we wanna link out to something prob better to just use a hyperlink with Github URL pinned to a specific commit

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more small comments, otherwise looks great. Thanks for making these changes!

@felipemello1 felipemello1 merged commit 05232a0 into pytorch:main Nov 7, 2024
11 checks passed
@felipemello1 felipemello1 deleted the upate_memory_opt_tutorial branch November 7, 2024 00:24
joecummings pushed a commit that referenced this pull request Nov 11, 2024
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Salman Mohammadi <[email protected]>
Co-authored-by: ebsmothers <[email protected]>
@ebsmothers ebsmothers mentioned this pull request Nov 26, 2024
44 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants